Skip to content

Add support for reasoning models#1711

Draft
RobinPicard wants to merge 1 commit into
mainfrom
add_thinking_mode
Draft

Add support for reasoning models#1711
RobinPicard wants to merge 1 commit into
mainfrom
add_thinking_mode

Conversation

@RobinPicard

@RobinPicard RobinPicard commented Aug 6, 2025

Copy link
Copy Markdown
Contributor

Expose 2 new keywords for generation:

  • end_thinking_tag: a string indicating the tag used by the reasoning model to indicate that thinking is finished (and so that we should start constraining the generation)
  • thinking_max_tokens: an int giving the maximum number of tokens during which the model can think, after that number is reached, we force the generation of the end of thinking token

Not supported:

  • Models for which the end of the thinking does not correspond to a single token

If we want to capture the content of the thinking in the future when we will return an object with various attributes instead of just the text output, we could add an argument start_thinking_tag for the models that use one.

The tag the model uses to indicate the end of the thinking process.
Only used when running a thinking model.
thinking_max_tokens: int | None
The maximum number of tokens the model can think about. Only used when

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The maximum number of tokens the model can think about. Only used when
The maximum number of tokens the model can think for. Only used when

Comment on lines +41 to +44
end_thinking_token_id: int | None
The id of the end thinking token
thinking_max_tokens: int | None
The maximum number of tokens the model can think about

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to only build a specialized logits processor that the backends are unaware of? You should be able to not call the logits biasing function as long as </think> has not been generated, and limit the number of tokens from the logits processor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wanted to wrap the logits processor into another one that would just not bias anything until we encounter the token and then it calls the other tokenizer it wraps, the problem is that it does not work for batching as the different sequences may not all stop thinking at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants