Skip to content

[Bug]: thinking_token_budget not enforced on re-entry after natural </think> #45974

@ashwing

Description

@ashwing

Describe the bug

When a model naturally ends a thinking block (emits </think> on its own before exhausting the budget), the ThinkingBudgetStateHolder state machine fails to track subsequent thinking blocks. A second <think> block in the same completion is never recognized as "in think" mode, so the budget is never enforced on it.

This means a model that naturally ends one thinking block early can then open a new <think> block and reason indefinitely with no budget enforcement.

Root Cause

After a natural </think>, start_thinking and end_thinking retain their values from Block 1 and are never reset. When Block 2's <think> appears, _find_last_sequence_index searches from position 0 (since scan_offset is not set for natural ends), finds the original Block 1 <think>, and since start_thinking (0) < end_thinking (7), hits the "exiting think mode" branch — so Block 2 is never recognized as entering think mode.

Compare with forced-end re-entry (fixed in #43757): scan_offset advances past Block 1 after the forced close completes (line 461), so the re-entry <think> is correctly detected as new.

Reproduction

import torch
from dataclasses import dataclass
from unittest.mock import MagicMock
from vllm.v1.sample.thinking_budget_state import ThinkingBudgetStateHolder

THINK_START = 100
THINK_END = [200]
BUDGET = 10

@dataclass
class FakeReasoningConfig:
    reasoning_start_token_ids: list
    reasoning_end_token_ids: list
    enabled: bool = True

cfg = FakeReasoningConfig(
    reasoning_start_token_ids=[THINK_START],
    reasoning_end_token_ids=THINK_END,
)
holder = ThinkingBudgetStateHolder(
    reasoning_config=cfg, max_num_seqs=8,
    num_spec_tokens=0, device=torch.device("cpu"), is_pin_memory=False,
)

params = MagicMock()
params.thinking_token_budget = BUDGET
batch_update = MagicMock(removed=[], added=[(0, params, None, [])], moved=[])
holder.sync_batch(batch_update)

output = []

# Block 1: 6 tokens + natural </think>
output.append(THINK_START)
holder.update_state([list(output)], None, None)
for _ in range(6):
    output.append(60)  # think token
    holder.update_state([list(output)], None, None)
output.append(THINK_END[0])  # natural end
holder.update_state([list(output)], None, None)

# Content
for _ in range(3):
    output.append(50)
    holder.update_state([list(output)], None, None)

# Block 2: re-entry
output.append(THINK_START)
holder.update_state([list(output)], None, None)
for i in range(14):
    output.append(60)
    holder.update_state([list(output)], None, None)

state = holder._state[0]
assert state["in_end"], f"Block 2 should be budget-enforced after 14 tokens (budget={BUDGET}), but in_end={state['in_end']}"

Expected behavior

Block 2 should be budget-enforced. Either:

  • Cumulative: tokens from Block 1 (6) count toward the total, so Block 2 gets cut at 4
  • Per-block reset: Block 2 gets a fresh budget of 10, enforced after 10 tokens

Either policy is acceptable; what's not acceptable is zero enforcement.

Actual behavior

Block 2 never enters in_think=True, so the budget countdown never starts. The model can reason indefinitely in the second block.

Environment

  • vLLM: vllm/vllm-openai:latest (also confirmed on current main)
  • Affects all models using thinking_token_budget that can produce multiple think blocks in one completion

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions