feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks) by jeffreysijuntan · Pull Request #673 · rllm-org/rllm

jeffreysijuntan · 2026-06-20T22:00:29Z

Summary

Adds the eval → curate → SFT loop to the rLLM CLI: run an eval with k samples per task, curate the trajectories by aggregate metrics into an SFT dataset, and fine-tune on it — all from the CLI. Also refactors SFT around a backend-agnostic dispatcher with pluggable backends (tinker + fireworks; verl deferred).

rllm eval math500 --model <m> --attempts 8
rllm dataset from-eval <run_id> --name math500-rft --filter "0 < avg < 1" --select correct
rllm sft math500-rft --backend fireworks --epochs 3

Design doc: design/sft-distillation.md.

What's included

Curation (rllm dataset from-eval)

rllm/eval/curation.py — load eval run dirs, pool attempts by stable task_id across runs, filter tasks, select trajectories (correct/best/best-n/shortest/all), lazy-load only chosen episodes, emit {"messages": ...} rows with provenance, register via DatasetRegistry.
rllm/eval/filter_dsl.py — safe per-task boolean DSL over aggregates (avg, best, worst, solved, n, n_correct, and budget-aware pass@k via a name@k rewrite + AST node-whitelist). avg@k is treated as k-invariant.
--dry-run, task-level train/val holdout.

Unified SFT trainer

rllm/trainer/sft/ — SFTSpec (backend-agnostic input), SFTBackend ABC (each backend owns its fit()), and AgentSFTTrainer as the dispatcher (mirrors the RL stack's AgentTrainer/launcher seam).
TinkerSFTBackend (migrated loop) and FireworksSFTBackend(TinkerSFTBackend) — Fireworks provisions through the training-shape path (init_fireworks_infra("sft", …) with a fireworks_infra doc), shares the tinker-cookbook data pipeline, and runs a synchronous pipelined loop.
rllm sft CLI (registered in main.py); panel shows the backend-resolved model.
Clean break: removed the old ad-hoc AgentSFTTrainer._train_verl/_train_tinker and the deprecated Tinker SFT trainer/dataset; kept the AgentSFTTrainer name.

Verified

Fireworks SFT confirmed end-to-end against the live service (Qwen3.5-9B LoRA): provisions via the shape path, trains real steps (forward_backward/optim_step, loss from result.metrics), saves DCP checkpoints, tears down the trainer on exit.
Curation engine, filter DSL, dispatcher, and CLI: 74 unit tests (GPU-free; the Fireworks provision doc is parsed offline as a regression guard). ruff clean.

Deferred (follow-ups)

VerlSFTBackend + torchrun launcher (--backend verl currently returns a clear "not wired yet").
Fireworks trainer/checkpoint reuse (skip the ~7 min re-provision via --keep-trainer/--fireworks-job-id; cross-run resume).
Default Fireworks model is Qwen3.5-9B because Fireworks ships no 3.5-4B shape; tinker default stays Qwen3.5-4B.

🤖 Generated with Claude Code

Add the eval -> curate -> SFT loop's curation half (design doc + milestones 1-2). Milestone 3 (unified SFT trainer) follows on this branch. - rllm/eval/filter_dsl.py: safe per-task boolean DSL over aggregate metrics (avg, best, worst, solved, n, n_correct, and budget-aware pass@k via a name@k rewrite + AST node-whitelist). avg@k is treated as k-invariant. - rllm/eval/curation.py: curate() loads eval run dirs, pools attempts by stable task_id across runs, filters tasks, selects trajectories (correct/best/best-n/shortest/all), lazy-loads only chosen episodes, and emits {"messages": ...} rows with provenance. - rllm dataset from-eval: thin CLI over curate() with --dry-run and a task-level train/val holdout; registers the result via DatasetRegistry. - design/sft-distillation.md: full design (curation engine + unified SFT trainer mirroring the RL dispatcher/launcher seam). - tests: 42 for the engine/DSL, 8 for the CLI command. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Rewrite SFT around a backend-agnostic spec + dispatcher, mirroring the RL stack's AgentTrainer/launcher seam. Clean break: the old ad-hoc AgentSFTTrainer._train_verl/_train_tinker and the deprecated Tinker SFT trainer/dataset are removed; the new dispatcher keeps the AgentSFTTrainer name. - rllm/trainer/sft/spec.py: SFTSpec (backend-agnostic; the only input). - rllm/trainer/sft/backend.py: SFTBackend ABC (each backend owns its fit()), SFTConfigError, validate_messages_dataset. - rllm/trainer/sft/tinker_backend.py + tinker_dataset.py + config/tinker.yaml: TinkerSFTBackend with the migrated tinker SFT loop; heavy imports lazy so the dispatcher/CLI import without the tinker stack. - rllm/trainer/agent_sft_trainer.py: AgentSFTTrainer is now the dispatcher (SFTSpec + backend). tinker works; verl/fireworks report "milestone 4". - rllm/cli/sft.py: `rllm sft` speaks SFTSpec; registered in main.py. - Remove deprecated/tinker_sft_{trainer,dataset}.py and their re-export shims (deprecated/__init__, tinker/__init__); update the archive example. - tests: dispatcher dispatch + tinker build_config/validate (14), CLI resolution/dispatch (4). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add FireworksSFTBackend, the second managed SFT backend, mirroring how the RL stack's FireworksBackend extends TinkerBackend. - rllm/trainer/sft/fireworks_backend.py: FireworksSFTBackend(TinkerSFTBackend) reuses validate_spec/build_config/prepare_data and the shared data pipeline; overrides fit() with a synchronous pipelined loop over Fireworks' SDK-managed client (build_service_client -> create_training_client -> ReconnectableClient -> TrainingCheckpoints). requires_distributed=False (hosted, like tinker). Requires FIREWORKS_API_KEY; SDK imports deferred to fit(). - rllm/trainer/sft/tinker_backend.py: extract build_sft_data() + a _config_template() hook so tinker and fireworks share the tinker-cookbook renderer/dataset pipeline and the spec->config mapping. - rllm/trainer/sft/config/fireworks.yaml: native template. - rllm/trainer/agent_sft_trainer.py: dispatch 'fireworks' (now implemented; only verl remains planned). - Default managed-SFT model is now Qwen/Qwen3.5-4B (both backends + SFTSpec/CLI). - tests: fireworks dispatch + build_config/validation + default-model (7 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Fireworks SFT backend now provisions exactly like the RL FireworksBackend: parse a `fireworks_infra` document with `trainers.policy.training_shape_id` and call `init_fireworks_infra("sft", ...)`. Because the doc names a training shape, the SDK takes the training-shape path and never falls back to the manual-infra path that force-sends `skipValidations=true` (which standard accounts can't do). The previous `build_service_client` + `create_training_client` route had no shape, hit the manual path, and failed with HTTP 400 "Only superuser can skip validations" — a symptom of the wrong provisioning route, not an account limit (RL on the same account works). - config/fireworks.yaml: FW model path + HF tokenizer + training shape (qwen3p5-4b, following swe-rl's qwen3p5-9b pattern) + a fireworks_infra provision doc (common/trainers.policy/recipe.sft). - fireworks_backend: build provision doc -> load_yaml_provision("sft") -> init_fireworks_infra("sft"); loop over infra.policy; checkpoint via TrainingCheckpoints; infra.close() on exit. build_config keeps the FW model path unless --model is itself a FW path. - build_sft_data: tokenize from model.tokenizer_model when set (FW model.name is a non-HF path). - tests: offline provision-doc parse guard + FW model/override assertions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Fireworks' catalog has no 3.5-4B training shape, so default the Fireworks SFT backend to the 9B identifiers: accounts/fireworks/models/qwen3p5-9b + Qwen/Qwen3.5-9B + accounts/fireworks/trainingShapes/qwen3p5-9b-256k. (Tinker SFT keeps Qwen/Qwen3.5-4B.) Update the provision-doc parse test accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-lora) The shape path now resolves (no more skipValidations), but the non-`-lora` shape has no LoRA-validated version, so LORA_TRAINER creation 400s with "no validated training shape exists". Switch the default to the `-lora` shape (matches swe-rl's RL recipe, which trains LoRA rank 32 on the same model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

len(TinkerSFTDataset) floors examples//batch_size, so a dataset smaller than one batch yielded 0 batches and the SFT loop ran vacuously (exit 0, nothing trained). Clamp n_batches to >=1 in both tinker and fireworks backends so the final partial batch is trained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Fireworks' forward_backward(loss_fn="cross_entropy") returns aggregate metrics (loss:sum / response_tokens), not the per-token loss_fn_outputs["logprobs"] the tinker SDK exposes. The copied-from-tinker logprobs path raised KeyError on the first step. Compute train/val loss from result.metrics, matching the cookbook's own sft_loop collect. (Tinker backend keeps logprobs — that's correct for it.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The panel printed the raw --model default (e.g. Qwen/Qwen3.5-4B) even when the Fireworks backend resolves to a different FW model path + HF tokenizer. Add AgentSFTTrainer.prepare() (build/configure the backend locally, cached, no provisioning) and a backend.config property; the CLI renders the panel from the resolved config (model name + tokenizer row) and reuses the same backend for train(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jeffreysijuntan and others added 9 commits June 20, 2026 22:03

jeffreysijuntan force-pushed the feat/cli-sft-distillation branch from 42cb172 to 42bb4aa Compare June 20, 2026 22:03

jeffreysijuntan changed the base branch from feat/echo-algorithm to main June 20, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks)#673

feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks)#673
jeffreysijuntan wants to merge 9 commits into
mainfrom
feat/cli-sft-distillation

jeffreysijuntan commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeffreysijuntan commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Verified

Deferred (follow-ups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeffreysijuntan commented Jun 20, 2026 •

edited

Loading