feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support by jeffreysijuntan · Pull Request #683 · rllm-org/rllm

jeffreysijuntan · 2026-06-22T08:28:07Z

What & why

Adds rllm.renderers — a single token-level renderer interface with a registry that routes each model to the strongest backend — and uses it to unblock DeepSeek-V4 (and other Fireworks-only models) for both rollout and gateway cumulative-token mode.

Motivated by two concrete problems:

DeepSeek-V4 rollout crashed: its tokenizer ships no Jinja chat_template, so ChatTemplateParser raised ValueError: tokenizer.chat_template is not set in FireworksEngine.__init__.
No drift-free multi-turn (cumulative token mode) for models prime-rl's renderers doesn't cover — only prime-rl had a bridge_to_next_turn.

Approach

rllm/renderers/ — native Renderer protocol (render_ids / parse_response / get_stop_token_ids / bridge_to_next_turn) + resolve() / select_backend() with RL-first precedence:
prime-rl exact-match (token bridge) → tinker/Fireworks adapter → DefaultRenderer. Backends import lazily (find_spec), so import rllm.renderers pulls no transformers/torch until first use.

Generic cross-turn bridge for any tinker-cookbook/Fireworks renderer. The next-turn prompt is the prior turn's sampled tokens kept verbatim (anchored at the turn-close token) followed by the new turn's framing. The delta is rendered via build_generation_prompt(sentinel + new_messages), split on the sentinel's N-th close token — not per-message render_message — so the renderer does its own turn-level preprocessing (merging consecutive tool results into one user turn, pairing them with the assistant's calls, role handling). This makes the bridge correct for tool turns, which terminal-rl needs. Validated byte-for-byte against prime-rl's hand-coded bridge for Qwen3/Qwen3.5 (incl. tool-result turns); DeepSeek-V4 merges consecutive tool results into a single user turn while preserving sampled history. Returns None (safe full-re-render fallback) on assistant-in-new-slice / truncation-without-close / multimodal / any renderer error.

FireworksEngine falls back to the tinker/Fireworks renderer (TinkerEngine's existing non-bypass render+parse path) when there's no chat template — fixing the DeepSeek-V4 crash. Chat-template models are unchanged.

Gateway stays dependency-free. rllm depends on rllm-model-gateway (so the gateway can't import rllm — circular), and the gateway is intentionally lightweight. So instead of putting tinker/Fireworks logic in the gateway, create_app(..., renderer=...) accepts an injected renderer; the in-process GatewayManager builds it via rllm.renderers.resolve() and injects it. The gateway gains zero new deps (still just renderers). The subprocess/verl path keeps the prime-rl-only build.

Files

rllm/renderers/ — protocol/types, _prime, _tinker (+ generic tool-aware bridge), registry, _fw_register, _common
rllm/engine/rollout/fireworks_engine.py — no-chat_template fallback
rllm/gateway/manager.py — build + inject renderer (in-process mode)
rllm-model-gateway/.../server.py, models.py — accept injected renderer; _renderer_has_bridge guard
Tests: tests/test_renderers.py, rllm-model-gateway/tests/unit/test_renderer_injection.py
RENDERER_MERGE_PLAN.md — design write-up

Commits

feat(renderers) — native renderer layer + DeepSeek-V4 rollout & cumulative-mode support (injection wiring)
fix(renderers) — correct tinker bridge delta for tool turns (per-message → build_generation_prompt-based) + tool-call parity tests
docs(renderers) — note tool-call bridge validation in the merge plan

Testing

tests/test_renderers.py — 17 pass: routing precedence, apply_chat_template parity for prime + tinker, bridge prefix property, generic bridge == prime gold for Qwen3.5 (plain and tool-result turns), DeepSeek-V4 render + bridge with tool-result merging
gateway injection contract + existing cumulative-mode tests pass; tests/engine/test_tinker_engine.py (12) unchanged; ruff clean
Verified FireworksEngine constructs/renders/parses DeepSeek-V4 (no crash)

Failure-mode semantics (for reviewers)

The bridge contract is: return tokens that start byte-for-byte with prev_prompt + prev_completion, or None. On None, the gateway resets and re-runs the turn on the normal chat path; the training-side prefix-extension merger (_process_trajectory) then sees a non-extension and splits the trajectory into an extra row (safe, just less efficient — no corruption). The fix matters because the previous per-message delta did not return None for tool turns — it crashed on DeepSeek-V4 (render_message rejects raw tool role) and silently mis-grouped on Qwen multi-tool. With raise_on_error=false, a crash is absorbed into an error episode and filtered out, so a broken bridge silently yields ~zero usable training data rather than failing loudly. The new delta renders correctly, and wraps renderer calls in try/except so any future unrenderable case degrades to the graceful None split instead of crashing.

Remaining / notes

Parity-checked the bridge on DeepSeek-V4 and Qwen; the other tinker-served FW models (Gemma-4, Ministral-3, Kimi-K2.7-code) aren't parity-checked yet — the bridge returns None (full re-render fallback) on anything it can't render, so they won't corrupt, just won't get the cumulative optimization until verified.
tinker's Qwen renderer doesn't group multi-tool, but Qwen routes to prime-rl (which does), so only tinker-served models rely on the bridge's delta path — and DeepSeek's build_generation_prompt merges correctly.

🤖 Generated with Claude Code

…tive-mode support Add `rllm.renderers`: a single token-level renderer interface with a registry that routes each model to the strongest backend, unifying model coverage for training across prime-rl and the tinker/Fireworks renderers. - rllm/renderers/: `Renderer` protocol + RenderedTokens/ParsedResponse/ToolSpec; `resolve()`/`select_backend()` with RL-first precedence (prime-rl exact-match → tinker/Fireworks adapter → DefaultRenderer). Lazy backend imports (find_spec) so importing the package pulls no transformers/torch until first use. - Generic cross-turn bridge for tinker/Fireworks renderers, built from the same primitives `build_generation_prompt` uses (`render_message` + `_get_generation_suffix`). Validated byte-for-byte against prime-rl's hand-coded bridge for Qwen3/3.5; preserves verbatim sampled history (incl. thinking) where a full re-render would strip it. Returns None on assistant-in-new-slice / truncation-without-close / multimodal. - FireworksEngine: fall back to the tinker/Fireworks renderer when the tokenizer has no Jinja chat_template (fixes the DeepSeek-V4 rollout crash in ChatTemplateParser). Chat-template models are unchanged. - Gateway cumulative token mode stays dependency-free: `create_app` accepts an injected renderer; the in-process GatewayManager builds it via `rllm.renderers.resolve()` and injects it, so models prime-rl lacks (DeepSeek-V4, Gemma-4, Ministral-3) get a working bridge without the gateway importing tinker/Fireworks. - Tests: tests/test_renderers.py (routing, parity, bridge) and rllm-model-gateway/tests/unit/test_renderer_injection.py (injection contract). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Per-message render_message was wrong for tool turns: DeepSeek-V4's render_message rejects raw `tool` role (tool results must be merged into a user turn first), and tinker's renderers don't reproduce turn-level tool merging/grouping when called per message. Render the new turn's delta via `build_generation_prompt(sentinel + new_messages)` and take everything after the sentinel's N-th close token (N = close-token count the sentinel emits, stable regardless of current-vs-historical assistant rendering). This lets the renderer do its own turn-level preprocessing — merging consecutive tool results into one user turn, pairing them with the assistant's calls, role handling — so the bridge is correct for tool turns. Validated: DeepSeek-V4 merges consecutive tool results into a single user turn (verbatim history preserved); Qwen3.5 tool-result bridge equals prime-rl's hand-coded bridge byte-for-byte. Adds tool-call parity tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The TokenAccumulator reset log conflated two distinct causes, making frequent resets hard to diagnose. Now each reset records why: - Prefix not append-only: `divergence()` reports the first message index that changed vs the verified prefix, plus its role (the agent mutated history, not just appended). - History shrank: the conversation got shorter (summarization / unwind). - Bridge returned None: logs the new-slice roles + content types (e.g. a structured/list content tripping the multimodal guard, or an unsupported role); distinguished from "no bridgeable new messages" (all-assistant slice). Stores per-message fingerprints (rehash only on reset, off the hot path) to locate the divergence index. Gateway-only; no rllm/tinker deps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… messages In cumulative token mode the gateway rewrote chat/completions to /v1/completions and dumped the raw completion text into message.content. For thinking / tool models (e.g. DeepSeek-V4) that text is "reasoning</think>answer" or raw tool-call markup, so turn-1+ responses had a different shape than the chat-path first turn (which the inference server parses into content / reasoning_content / tool_calls) — breaking the agent's action and tool-call parsing on every cumulative turn. Parse the completion via the injected renderer's parse_response (the gateway already has the renderer + completion_token_ids) into clean content + reasoning_content + structured tool_calls, matching the chat endpoint's shape. Falls back to raw text if the renderer can't parse it. Non-streaming path only; the streaming cumulative path is unchanged (still raw). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A request byte-identical to the already-folded turn (len == prefix, all messages match) was logged as the cryptic "non-cumulative message list (len=N <= prefix N)". This case is specifically a duplicate/replayed request — the agent re-sent an already-processed turn, typically a retried sampling call. divergence() now returns a distinct "duplicate" kind and the reset log says so, pointing at the agent-side error that triggered the retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…resetting A request byte-identical to the already-folded turn (a retried/replayed sampling call) used to reset the TokenAccumulator and fall back to the chat path, breaking the cumulative chain and splitting the trajectory into two training rows. A duplicate is just an idempotent re-do of the same turn, so handle it that way: re-sample from the turn's existing prompt (prev_prompt_ids), swap the completion in place via resample_completion() (turn_count / message_count / prefix unchanged), and overwrite that turn's trace by reusing its trace_id (both stores upsert by trace_id). The cumulative chain is preserved and the trajectory keeps one row for the turn. Non-streaming path; streaming duplicates still reset (documented). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e-sample The "Re-sampling duplicate request" log didn't say which session it was (one stuck session vs many resending once) or *why* the agent retried. Add session_id and the prior turn's sampling outcome (finish_reason + completion length, tracked via record_outcome on every fold/resample) to the log: - completion=0 -> empty/failed response (deployment/infra) - finish_reason=length -> still hitting max_tokens - finish_reason=stop, completion>0 -> a valid response the agent rejected (cause is agent-side, e.g. action parsing) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ueued A task is consumed as one GRPO group, which prefix-merges into training rows; previously only batch-level merge metrics were visible. Log per-task at INFO when the group is finalized: number of groups, trajectories, steps (turns), and datums (rows after prefix-merge), plus steps/datum. _segment_count() applies the same prefix-extension check the backend transform uses, so the per-task datum count matches what's trained and the merge ratio is visible per task, not just batch aggregate (steps/datum == 1 is a clean cumulative trajectory; >1 a prefix break). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jeffreysijuntan and others added 9 commits June 22, 2026 08:27

docs(renderers): note tool-call bridge validation in merge plan

6a88e2b

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support#683

feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support#683
jeffreysijuntan wants to merge 9 commits into
terminal-rlfrom
feat/native-renderers

jeffreysijuntan commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeffreysijuntan commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Approach

Files

Commits

Testing

Failure-mode semantics (for reviewers)

Remaining / notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeffreysijuntan commented Jun 22, 2026 •

edited

Loading