Skip to content

feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support#683

Open
jeffreysijuntan wants to merge 9 commits into
terminal-rlfrom
feat/native-renderers
Open

feat(renderers): native renderer layer + DeepSeek-V4 rollout & cumulative-mode support#683
jeffreysijuntan wants to merge 9 commits into
terminal-rlfrom
feat/native-renderers

Conversation

@jeffreysijuntan

@jeffreysijuntan jeffreysijuntan commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What & why

Adds rllm.renderers — a single token-level renderer interface with a registry that routes each model to the strongest backend — and uses it to unblock DeepSeek-V4 (and other Fireworks-only models) for both rollout and gateway cumulative-token mode.

Motivated by two concrete problems:

  1. DeepSeek-V4 rollout crashed: its tokenizer ships no Jinja chat_template, so ChatTemplateParser raised ValueError: tokenizer.chat_template is not set in FireworksEngine.__init__.
  2. No drift-free multi-turn (cumulative token mode) for models prime-rl's renderers doesn't cover — only prime-rl had a bridge_to_next_turn.

Approach

rllm/renderers/ — native Renderer protocol (render_ids / parse_response / get_stop_token_ids / bridge_to_next_turn) + resolve() / select_backend() with RL-first precedence:
prime-rl exact-match (token bridge)tinker/Fireworks adapterDefaultRenderer. Backends import lazily (find_spec), so import rllm.renderers pulls no transformers/torch until first use.

Generic cross-turn bridge for any tinker-cookbook/Fireworks renderer. The next-turn prompt is the prior turn's sampled tokens kept verbatim (anchored at the turn-close token) followed by the new turn's framing. The delta is rendered via build_generation_prompt(sentinel + new_messages), split on the sentinel's N-th close token — not per-message render_message — so the renderer does its own turn-level preprocessing (merging consecutive tool results into one user turn, pairing them with the assistant's calls, role handling). This makes the bridge correct for tool turns, which terminal-rl needs. Validated byte-for-byte against prime-rl's hand-coded bridge for Qwen3/Qwen3.5 (incl. tool-result turns); DeepSeek-V4 merges consecutive tool results into a single user turn while preserving sampled history. Returns None (safe full-re-render fallback) on assistant-in-new-slice / truncation-without-close / multimodal / any renderer error.

FireworksEngine falls back to the tinker/Fireworks renderer (TinkerEngine's existing non-bypass render+parse path) when there's no chat template — fixing the DeepSeek-V4 crash. Chat-template models are unchanged.

Gateway stays dependency-free. rllm depends on rllm-model-gateway (so the gateway can't import rllm — circular), and the gateway is intentionally lightweight. So instead of putting tinker/Fireworks logic in the gateway, create_app(..., renderer=...) accepts an injected renderer; the in-process GatewayManager builds it via rllm.renderers.resolve() and injects it. The gateway gains zero new deps (still just renderers). The subprocess/verl path keeps the prime-rl-only build.

Files

  • rllm/renderers/ — protocol/types, _prime, _tinker (+ generic tool-aware bridge), registry, _fw_register, _common
  • rllm/engine/rollout/fireworks_engine.py — no-chat_template fallback
  • rllm/gateway/manager.py — build + inject renderer (in-process mode)
  • rllm-model-gateway/.../server.py, models.py — accept injected renderer; _renderer_has_bridge guard
  • Tests: tests/test_renderers.py, rllm-model-gateway/tests/unit/test_renderer_injection.py
  • RENDERER_MERGE_PLAN.md — design write-up

Commits

  1. feat(renderers) — native renderer layer + DeepSeek-V4 rollout & cumulative-mode support (injection wiring)
  2. fix(renderers) — correct tinker bridge delta for tool turns (per-message → build_generation_prompt-based) + tool-call parity tests
  3. docs(renderers) — note tool-call bridge validation in the merge plan

Testing

  • tests/test_renderers.py — 17 pass: routing precedence, apply_chat_template parity for prime + tinker, bridge prefix property, generic bridge == prime gold for Qwen3.5 (plain and tool-result turns), DeepSeek-V4 render + bridge with tool-result merging
  • gateway injection contract + existing cumulative-mode tests pass; tests/engine/test_tinker_engine.py (12) unchanged; ruff clean
  • Verified FireworksEngine constructs/renders/parses DeepSeek-V4 (no crash)

Failure-mode semantics (for reviewers)

The bridge contract is: return tokens that start byte-for-byte with prev_prompt + prev_completion, or None. On None, the gateway resets and re-runs the turn on the normal chat path; the training-side prefix-extension merger (_process_trajectory) then sees a non-extension and splits the trajectory into an extra row (safe, just less efficient — no corruption). The fix matters because the previous per-message delta did not return None for tool turns — it crashed on DeepSeek-V4 (render_message rejects raw tool role) and silently mis-grouped on Qwen multi-tool. With raise_on_error=false, a crash is absorbed into an error episode and filtered out, so a broken bridge silently yields ~zero usable training data rather than failing loudly. The new delta renders correctly, and wraps renderer calls in try/except so any future unrenderable case degrades to the graceful None split instead of crashing.

Remaining / notes

  • Parity-checked the bridge on DeepSeek-V4 and Qwen; the other tinker-served FW models (Gemma-4, Ministral-3, Kimi-K2.7-code) aren't parity-checked yet — the bridge returns None (full re-render fallback) on anything it can't render, so they won't corrupt, just won't get the cumulative optimization until verified.
  • tinker's Qwen renderer doesn't group multi-tool, but Qwen routes to prime-rl (which does), so only tinker-served models rely on the bridge's delta path — and DeepSeek's build_generation_prompt merges correctly.

🤖 Generated with Claude Code

jeffreysijuntan and others added 9 commits June 22, 2026 08:27
…tive-mode support

Add `rllm.renderers`: a single token-level renderer interface with a registry
that routes each model to the strongest backend, unifying model coverage for
training across prime-rl and the tinker/Fireworks renderers.

- rllm/renderers/: `Renderer` protocol + RenderedTokens/ParsedResponse/ToolSpec;
  `resolve()`/`select_backend()` with RL-first precedence (prime-rl exact-match →
  tinker/Fireworks adapter → DefaultRenderer). Lazy backend imports (find_spec)
  so importing the package pulls no transformers/torch until first use.
- Generic cross-turn bridge for tinker/Fireworks renderers, built from the same
  primitives `build_generation_prompt` uses (`render_message` +
  `_get_generation_suffix`). Validated byte-for-byte against prime-rl's
  hand-coded bridge for Qwen3/3.5; preserves verbatim sampled history (incl.
  thinking) where a full re-render would strip it. Returns None on
  assistant-in-new-slice / truncation-without-close / multimodal.
- FireworksEngine: fall back to the tinker/Fireworks renderer when the tokenizer
  has no Jinja chat_template (fixes the DeepSeek-V4 rollout crash in
  ChatTemplateParser). Chat-template models are unchanged.
- Gateway cumulative token mode stays dependency-free: `create_app` accepts an
  injected renderer; the in-process GatewayManager builds it via
  `rllm.renderers.resolve()` and injects it, so models prime-rl lacks
  (DeepSeek-V4, Gemma-4, Ministral-3) get a working bridge without the gateway
  importing tinker/Fireworks.
- Tests: tests/test_renderers.py (routing, parity, bridge) and
  rllm-model-gateway/tests/unit/test_renderer_injection.py (injection contract).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-message render_message was wrong for tool turns: DeepSeek-V4's render_message
rejects raw `tool` role (tool results must be merged into a user turn first), and
tinker's renderers don't reproduce turn-level tool merging/grouping when called
per message.

Render the new turn's delta via `build_generation_prompt(sentinel + new_messages)`
and take everything after the sentinel's N-th close token (N = close-token count
the sentinel emits, stable regardless of current-vs-historical assistant
rendering). This lets the renderer do its own turn-level preprocessing — merging
consecutive tool results into one user turn, pairing them with the assistant's
calls, role handling — so the bridge is correct for tool turns.

Validated: DeepSeek-V4 merges consecutive tool results into a single user turn
(verbatim history preserved); Qwen3.5 tool-result bridge equals prime-rl's
hand-coded bridge byte-for-byte. Adds tool-call parity tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The TokenAccumulator reset log conflated two distinct causes, making frequent
resets hard to diagnose. Now each reset records why:

- Prefix not append-only: `divergence()` reports the first message index that
  changed vs the verified prefix, plus its role (the agent mutated history,
  not just appended).
- History shrank: the conversation got shorter (summarization / unwind).
- Bridge returned None: logs the new-slice roles + content types (e.g. a
  structured/list content tripping the multimodal guard, or an unsupported
  role); distinguished from "no bridgeable new messages" (all-assistant slice).

Stores per-message fingerprints (rehash only on reset, off the hot path) to
locate the divergence index. Gateway-only; no rllm/tinker deps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… messages

In cumulative token mode the gateway rewrote chat/completions to /v1/completions
and dumped the raw completion text into message.content. For thinking / tool
models (e.g. DeepSeek-V4) that text is "reasoning</think>answer" or raw tool-call
markup, so turn-1+ responses had a different shape than the chat-path first turn
(which the inference server parses into content / reasoning_content / tool_calls)
— breaking the agent's action and tool-call parsing on every cumulative turn.

Parse the completion via the injected renderer's parse_response (the gateway
already has the renderer + completion_token_ids) into clean content +
reasoning_content + structured tool_calls, matching the chat endpoint's shape.
Falls back to raw text if the renderer can't parse it. Non-streaming path only;
the streaming cumulative path is unchanged (still raw).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A request byte-identical to the already-folded turn (len == prefix, all
messages match) was logged as the cryptic "non-cumulative message list
(len=N <= prefix N)". This case is specifically a duplicate/replayed request —
the agent re-sent an already-processed turn, typically a retried sampling call.
divergence() now returns a distinct "duplicate" kind and the reset log says so,
pointing at the agent-side error that triggered the retry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…resetting

A request byte-identical to the already-folded turn (a retried/replayed sampling
call) used to reset the TokenAccumulator and fall back to the chat path, breaking
the cumulative chain and splitting the trajectory into two training rows.

A duplicate is just an idempotent re-do of the same turn, so handle it that way:
re-sample from the turn's existing prompt (prev_prompt_ids), swap the completion
in place via resample_completion() (turn_count / message_count / prefix unchanged),
and overwrite that turn's trace by reusing its trace_id (both stores upsert by
trace_id). The cumulative chain is preserved and the trajectory keeps one row for
the turn. Non-streaming path; streaming duplicates still reset (documented).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-sample

The "Re-sampling duplicate request" log didn't say which session it was (one
stuck session vs many resending once) or *why* the agent retried. Add session_id
and the prior turn's sampling outcome (finish_reason + completion length, tracked
via record_outcome on every fold/resample) to the log:

  - completion=0          -> empty/failed response (deployment/infra)
  - finish_reason=length  -> still hitting max_tokens
  - finish_reason=stop, completion>0 -> a valid response the agent rejected
    (cause is agent-side, e.g. action parsing)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ueued

A task is consumed as one GRPO group, which prefix-merges into training rows;
previously only batch-level merge metrics were visible. Log per-task at INFO when
the group is finalized: number of groups, trajectories, steps (turns), and datums
(rows after prefix-merge), plus steps/datum. _segment_count() applies the same
prefix-extension check the backend transform uses, so the per-task datum count
matches what's trained and the merge ratio is visible per task, not just batch
aggregate (steps/datum == 1 is a clean cumulative trajectory; >1 a prefix break).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant