feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support#683
Open
jeffreysijuntan wants to merge 9 commits into
Open
feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support#683jeffreysijuntan wants to merge 9 commits into
jeffreysijuntan wants to merge 9 commits into
Conversation
…tive-mode support Add `rllm.renderers`: a single token-level renderer interface with a registry that routes each model to the strongest backend, unifying model coverage for training across prime-rl and the tinker/Fireworks renderers. - rllm/renderers/: `Renderer` protocol + RenderedTokens/ParsedResponse/ToolSpec; `resolve()`/`select_backend()` with RL-first precedence (prime-rl exact-match → tinker/Fireworks adapter → DefaultRenderer). Lazy backend imports (find_spec) so importing the package pulls no transformers/torch until first use. - Generic cross-turn bridge for tinker/Fireworks renderers, built from the same primitives `build_generation_prompt` uses (`render_message` + `_get_generation_suffix`). Validated byte-for-byte against prime-rl's hand-coded bridge for Qwen3/3.5; preserves verbatim sampled history (incl. thinking) where a full re-render would strip it. Returns None on assistant-in-new-slice / truncation-without-close / multimodal. - FireworksEngine: fall back to the tinker/Fireworks renderer when the tokenizer has no Jinja chat_template (fixes the DeepSeek-V4 rollout crash in ChatTemplateParser). Chat-template models are unchanged. - Gateway cumulative token mode stays dependency-free: `create_app` accepts an injected renderer; the in-process GatewayManager builds it via `rllm.renderers.resolve()` and injects it, so models prime-rl lacks (DeepSeek-V4, Gemma-4, Ministral-3) get a working bridge without the gateway importing tinker/Fireworks. - Tests: tests/test_renderers.py (routing, parity, bridge) and rllm-model-gateway/tests/unit/test_renderer_injection.py (injection contract). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-message render_message was wrong for tool turns: DeepSeek-V4's render_message rejects raw `tool` role (tool results must be merged into a user turn first), and tinker's renderers don't reproduce turn-level tool merging/grouping when called per message. Render the new turn's delta via `build_generation_prompt(sentinel + new_messages)` and take everything after the sentinel's N-th close token (N = close-token count the sentinel emits, stable regardless of current-vs-historical assistant rendering). This lets the renderer do its own turn-level preprocessing — merging consecutive tool results into one user turn, pairing them with the assistant's calls, role handling — so the bridge is correct for tool turns. Validated: DeepSeek-V4 merges consecutive tool results into a single user turn (verbatim history preserved); Qwen3.5 tool-result bridge equals prime-rl's hand-coded bridge byte-for-byte. Adds tool-call parity tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The TokenAccumulator reset log conflated two distinct causes, making frequent resets hard to diagnose. Now each reset records why: - Prefix not append-only: `divergence()` reports the first message index that changed vs the verified prefix, plus its role (the agent mutated history, not just appended). - History shrank: the conversation got shorter (summarization / unwind). - Bridge returned None: logs the new-slice roles + content types (e.g. a structured/list content tripping the multimodal guard, or an unsupported role); distinguished from "no bridgeable new messages" (all-assistant slice). Stores per-message fingerprints (rehash only on reset, off the hot path) to locate the divergence index. Gateway-only; no rllm/tinker deps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… messages In cumulative token mode the gateway rewrote chat/completions to /v1/completions and dumped the raw completion text into message.content. For thinking / tool models (e.g. DeepSeek-V4) that text is "reasoning</think>answer" or raw tool-call markup, so turn-1+ responses had a different shape than the chat-path first turn (which the inference server parses into content / reasoning_content / tool_calls) — breaking the agent's action and tool-call parsing on every cumulative turn. Parse the completion via the injected renderer's parse_response (the gateway already has the renderer + completion_token_ids) into clean content + reasoning_content + structured tool_calls, matching the chat endpoint's shape. Falls back to raw text if the renderer can't parse it. Non-streaming path only; the streaming cumulative path is unchanged (still raw). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A request byte-identical to the already-folded turn (len == prefix, all messages match) was logged as the cryptic "non-cumulative message list (len=N <= prefix N)". This case is specifically a duplicate/replayed request — the agent re-sent an already-processed turn, typically a retried sampling call. divergence() now returns a distinct "duplicate" kind and the reset log says so, pointing at the agent-side error that triggered the retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…resetting A request byte-identical to the already-folded turn (a retried/replayed sampling call) used to reset the TokenAccumulator and fall back to the chat path, breaking the cumulative chain and splitting the trajectory into two training rows. A duplicate is just an idempotent re-do of the same turn, so handle it that way: re-sample from the turn's existing prompt (prev_prompt_ids), swap the completion in place via resample_completion() (turn_count / message_count / prefix unchanged), and overwrite that turn's trace by reusing its trace_id (both stores upsert by trace_id). The cumulative chain is preserved and the trajectory keeps one row for the turn. Non-streaming path; streaming duplicates still reset (documented). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-sample
The "Re-sampling duplicate request" log didn't say which session it was (one
stuck session vs many resending once) or *why* the agent retried. Add session_id
and the prior turn's sampling outcome (finish_reason + completion length, tracked
via record_outcome on every fold/resample) to the log:
- completion=0 -> empty/failed response (deployment/infra)
- finish_reason=length -> still hitting max_tokens
- finish_reason=stop, completion>0 -> a valid response the agent rejected
(cause is agent-side, e.g. action parsing)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ueued A task is consumed as one GRPO group, which prefix-merges into training rows; previously only batch-level merge metrics were visible. Log per-task at INFO when the group is finalized: number of groups, trajectories, steps (turns), and datums (rows after prefix-merge), plus steps/datum. _segment_count() applies the same prefix-extension check the backend transform uses, so the per-task datum count matches what's trained and the merge ratio is visible per task, not just batch aggregate (steps/datum == 1 is a clean cumulative trajectory; >1 a prefix break). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Adds
rllm.renderers— a single token-level renderer interface with a registry that routes each model to the strongest backend — and uses it to unblock DeepSeek-V4 (and other Fireworks-only models) for both rollout and gateway cumulative-token mode.Motivated by two concrete problems:
chat_template, soChatTemplateParserraisedValueError: tokenizer.chat_template is not setinFireworksEngine.__init__.renderersdoesn't cover — only prime-rl had abridge_to_next_turn.Approach
rllm/renderers/— nativeRendererprotocol (render_ids/parse_response/get_stop_token_ids/bridge_to_next_turn) +resolve()/select_backend()with RL-first precedence:prime-rl exact-match (token bridge)→tinker/Fireworks adapter→DefaultRenderer. Backends import lazily (find_spec), soimport rllm.rendererspulls no transformers/torch until first use.Generic cross-turn bridge for any tinker-cookbook/Fireworks renderer. The next-turn prompt is the prior turn's sampled tokens kept verbatim (anchored at the turn-close token) followed by the new turn's framing. The delta is rendered via
build_generation_prompt(sentinel + new_messages), split on the sentinel's N-th close token — not per-messagerender_message— so the renderer does its own turn-level preprocessing (merging consecutive tool results into one user turn, pairing them with the assistant's calls, role handling). This makes the bridge correct for tool turns, which terminal-rl needs. Validated byte-for-byte against prime-rl's hand-coded bridge for Qwen3/Qwen3.5 (incl. tool-result turns); DeepSeek-V4 merges consecutive tool results into a single user turn while preserving sampled history. ReturnsNone(safe full-re-render fallback) on assistant-in-new-slice / truncation-without-close / multimodal / any renderer error.FireworksEngine falls back to the tinker/Fireworks renderer (TinkerEngine's existing non-bypass render+parse path) when there's no chat template — fixing the DeepSeek-V4 crash. Chat-template models are unchanged.
Gateway stays dependency-free.
rllmdepends onrllm-model-gateway(so the gateway can't importrllm— circular), and the gateway is intentionally lightweight. So instead of putting tinker/Fireworks logic in the gateway,create_app(..., renderer=...)accepts an injected renderer; the in-processGatewayManagerbuilds it viarllm.renderers.resolve()and injects it. The gateway gains zero new deps (still justrenderers). The subprocess/verl path keeps the prime-rl-only build.Files
rllm/renderers/— protocol/types,_prime,_tinker(+ generic tool-aware bridge),registry,_fw_register,_commonrllm/engine/rollout/fireworks_engine.py— no-chat_template fallbackrllm/gateway/manager.py— build + inject renderer (in-process mode)rllm-model-gateway/.../server.py,models.py— accept injected renderer;_renderer_has_bridgeguardtests/test_renderers.py,rllm-model-gateway/tests/unit/test_renderer_injection.pyRENDERER_MERGE_PLAN.md— design write-upCommits
feat(renderers)— native renderer layer + DeepSeek-V4 rollout & cumulative-mode support (injection wiring)fix(renderers)— correct tinker bridge delta for tool turns (per-message →build_generation_prompt-based) + tool-call parity testsdocs(renderers)— note tool-call bridge validation in the merge planTesting
tests/test_renderers.py— 17 pass: routing precedence,apply_chat_templateparity for prime + tinker, bridge prefix property, generic bridge == prime gold for Qwen3.5 (plain and tool-result turns), DeepSeek-V4 render + bridge with tool-result mergingtests/engine/test_tinker_engine.py(12) unchanged; ruff cleanFireworksEngineconstructs/renders/parses DeepSeek-V4 (no crash)Failure-mode semantics (for reviewers)
The bridge contract is: return tokens that start byte-for-byte with
prev_prompt + prev_completion, orNone. OnNone, the gateway resets and re-runs the turn on the normal chat path; the training-side prefix-extension merger (_process_trajectory) then sees a non-extension and splits the trajectory into an extra row (safe, just less efficient — no corruption). The fix matters because the previous per-message delta did not returnNonefor tool turns — it crashed on DeepSeek-V4 (render_messagerejects rawtoolrole) and silently mis-grouped on Qwen multi-tool. Withraise_on_error=false, a crash is absorbed into an error episode and filtered out, so a broken bridge silently yields ~zero usable training data rather than failing loudly. The new delta renders correctly, and wraps renderer calls intry/exceptso any future unrenderable case degrades to the gracefulNonesplit instead of crashing.Remaining / notes
None(full re-render fallback) on anything it can't render, so they won't corrupt, just won't get the cumulative optimization until verified.build_generation_promptmerges correctly.🤖 Generated with Claude Code