Describe the bug
When a model naturally ends a thinking block (emits </think> on its own before exhausting the budget), the ThinkingBudgetStateHolder state machine fails to track subsequent thinking blocks. A second <think> block in the same completion is never recognized as "in think" mode, so the budget is never enforced on it.
This means a model that naturally ends one thinking block early can then open a new <think> block and reason indefinitely with no budget enforcement.
Root Cause
After a natural </think>, start_thinking and end_thinking retain their values from Block 1 and are never reset. When Block 2's <think> appears, _find_last_sequence_index searches from position 0 (since scan_offset is not set for natural ends), finds the original Block 1 <think>, and since start_thinking (0) < end_thinking (7), hits the "exiting think mode" branch — so Block 2 is never recognized as entering think mode.
Compare with forced-end re-entry (fixed in #43757): scan_offset advances past Block 1 after the forced close completes (line 461), so the re-entry <think> is correctly detected as new.
Reproduction
import torch
from dataclasses import dataclass
from unittest.mock import MagicMock
from vllm.v1.sample.thinking_budget_state import ThinkingBudgetStateHolder
THINK_START = 100
THINK_END = [200]
BUDGET = 10
@dataclass
class FakeReasoningConfig:
reasoning_start_token_ids: list
reasoning_end_token_ids: list
enabled: bool = True
cfg = FakeReasoningConfig(
reasoning_start_token_ids=[THINK_START],
reasoning_end_token_ids=THINK_END,
)
holder = ThinkingBudgetStateHolder(
reasoning_config=cfg, max_num_seqs=8,
num_spec_tokens=0, device=torch.device("cpu"), is_pin_memory=False,
)
params = MagicMock()
params.thinking_token_budget = BUDGET
batch_update = MagicMock(removed=[], added=[(0, params, None, [])], moved=[])
holder.sync_batch(batch_update)
output = []
# Block 1: 6 tokens + natural </think>
output.append(THINK_START)
holder.update_state([list(output)], None, None)
for _ in range(6):
output.append(60) # think token
holder.update_state([list(output)], None, None)
output.append(THINK_END[0]) # natural end
holder.update_state([list(output)], None, None)
# Content
for _ in range(3):
output.append(50)
holder.update_state([list(output)], None, None)
# Block 2: re-entry
output.append(THINK_START)
holder.update_state([list(output)], None, None)
for i in range(14):
output.append(60)
holder.update_state([list(output)], None, None)
state = holder._state[0]
assert state["in_end"], f"Block 2 should be budget-enforced after 14 tokens (budget={BUDGET}), but in_end={state['in_end']}"
Expected behavior
Block 2 should be budget-enforced. Either:
- Cumulative: tokens from Block 1 (6) count toward the total, so Block 2 gets cut at 4
- Per-block reset: Block 2 gets a fresh budget of 10, enforced after 10 tokens
Either policy is acceptable; what's not acceptable is zero enforcement.
Actual behavior
Block 2 never enters in_think=True, so the budget countdown never starts. The model can reason indefinitely in the second block.
Environment
- vLLM:
vllm/vllm-openai:latest (also confirmed on current main)
- Affects all models using
thinking_token_budget that can produce multiple think blocks in one completion
Related
Describe the bug
When a model naturally ends a thinking block (emits
</think>on its own before exhausting the budget), theThinkingBudgetStateHolderstate machine fails to track subsequent thinking blocks. A second<think>block in the same completion is never recognized as "in think" mode, so the budget is never enforced on it.This means a model that naturally ends one thinking block early can then open a new
<think>block and reason indefinitely with no budget enforcement.Root Cause
After a natural
</think>,start_thinkingandend_thinkingretain their values from Block 1 and are never reset. When Block 2's<think>appears,_find_last_sequence_indexsearches from position 0 (sincescan_offsetis not set for natural ends), finds the original Block 1<think>, and sincestart_thinking (0) < end_thinking (7), hits the "exiting think mode" branch — so Block 2 is never recognized as entering think mode.Compare with forced-end re-entry (fixed in #43757):
scan_offsetadvances past Block 1 after the forced close completes (line 461), so the re-entry<think>is correctly detected as new.Reproduction
Expected behavior
Block 2 should be budget-enforced. Either:
Either policy is acceptable; what's not acceptable is zero enforcement.
Actual behavior
Block 2 never enters
in_think=True, so the budget countdown never starts. The model can reason indefinitely in the second block.Environment
vllm/vllm-openai:latest(also confirmed on currentmain)thinking_token_budgetthat can produce multiple think blocks in one completionRelated
thinking_token_budgetenforcement fails on multi-turn conversations whenmax_completion_tokens>>thinking_token_budgetwith ignore_eos:true #43708 — forced-end re-entry (fix in PR [Bugfix][Reasoning] Fix thinking_token_budget not enforced on re-entry after forced end #43757, scoped toscan_offset > 0case)