Skip to content

feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks)#673

Open
jeffreysijuntan wants to merge 9 commits into
mainfrom
feat/cli-sft-distillation
Open

feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks)#673
jeffreysijuntan wants to merge 9 commits into
mainfrom
feat/cli-sft-distillation

Conversation

@jeffreysijuntan

@jeffreysijuntan jeffreysijuntan commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the eval → curate → SFT loop to the rLLM CLI: run an eval with k samples per task, curate the trajectories by aggregate metrics into an SFT dataset, and fine-tune on it — all from the CLI. Also refactors SFT around a backend-agnostic dispatcher with pluggable backends (tinker + fireworks; verl deferred).

rllm eval math500 --model <m> --attempts 8
rllm dataset from-eval <run_id> --name math500-rft --filter "0 < avg < 1" --select correct
rllm sft math500-rft --backend fireworks --epochs 3

Design doc: design/sft-distillation.md.

What's included

Curation (rllm dataset from-eval)

  • rllm/eval/curation.py — load eval run dirs, pool attempts by stable task_id across runs, filter tasks, select trajectories (correct/best/best-n/shortest/all), lazy-load only chosen episodes, emit {"messages": ...} rows with provenance, register via DatasetRegistry.
  • rllm/eval/filter_dsl.py — safe per-task boolean DSL over aggregates (avg, best, worst, solved, n, n_correct, and budget-aware pass@k via a name@k rewrite + AST node-whitelist). avg@k is treated as k-invariant.
  • --dry-run, task-level train/val holdout.

Unified SFT trainer

  • rllm/trainer/sft/SFTSpec (backend-agnostic input), SFTBackend ABC (each backend owns its fit()), and AgentSFTTrainer as the dispatcher (mirrors the RL stack's AgentTrainer/launcher seam).
  • TinkerSFTBackend (migrated loop) and FireworksSFTBackend(TinkerSFTBackend) — Fireworks provisions through the training-shape path (init_fireworks_infra("sft", …) with a fireworks_infra doc), shares the tinker-cookbook data pipeline, and runs a synchronous pipelined loop.
  • rllm sft CLI (registered in main.py); panel shows the backend-resolved model.
  • Clean break: removed the old ad-hoc AgentSFTTrainer._train_verl/_train_tinker and the deprecated Tinker SFT trainer/dataset; kept the AgentSFTTrainer name.

Verified

  • Fireworks SFT confirmed end-to-end against the live service (Qwen3.5-9B LoRA): provisions via the shape path, trains real steps (forward_backward/optim_step, loss from result.metrics), saves DCP checkpoints, tears down the trainer on exit.
  • Curation engine, filter DSL, dispatcher, and CLI: 74 unit tests (GPU-free; the Fireworks provision doc is parsed offline as a regression guard). ruff clean.

Deferred (follow-ups)

  • VerlSFTBackend + torchrun launcher (--backend verl currently returns a clear "not wired yet").
  • Fireworks trainer/checkpoint reuse (skip the ~7 min re-provision via --keep-trainer/--fireworks-job-id; cross-run resume).
  • Default Fireworks model is Qwen3.5-9B because Fireworks ships no 3.5-4B shape; tinker default stays Qwen3.5-4B.

🤖 Generated with Claude Code

jeffreysijuntan and others added 9 commits June 20, 2026 22:03
Add the eval -> curate -> SFT loop's curation half (design doc +
milestones 1-2). Milestone 3 (unified SFT trainer) follows on this branch.

- rllm/eval/filter_dsl.py: safe per-task boolean DSL over aggregate metrics
  (avg, best, worst, solved, n, n_correct, and budget-aware pass@k via a
  name@k rewrite + AST node-whitelist). avg@k is treated as k-invariant.
- rllm/eval/curation.py: curate() loads eval run dirs, pools attempts by
  stable task_id across runs, filters tasks, selects trajectories
  (correct/best/best-n/shortest/all), lazy-loads only chosen episodes, and
  emits {"messages": ...} rows with provenance.
- rllm dataset from-eval: thin CLI over curate() with --dry-run and a
  task-level train/val holdout; registers the result via DatasetRegistry.
- design/sft-distillation.md: full design (curation engine + unified SFT
  trainer mirroring the RL dispatcher/launcher seam).
- tests: 42 for the engine/DSL, 8 for the CLI command.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrite SFT around a backend-agnostic spec + dispatcher, mirroring the RL
stack's AgentTrainer/launcher seam. Clean break: the old ad-hoc
AgentSFTTrainer._train_verl/_train_tinker and the deprecated Tinker SFT
trainer/dataset are removed; the new dispatcher keeps the AgentSFTTrainer name.

- rllm/trainer/sft/spec.py: SFTSpec (backend-agnostic; the only input).
- rllm/trainer/sft/backend.py: SFTBackend ABC (each backend owns its fit()),
  SFTConfigError, validate_messages_dataset.
- rllm/trainer/sft/tinker_backend.py + tinker_dataset.py + config/tinker.yaml:
  TinkerSFTBackend with the migrated tinker SFT loop; heavy imports lazy so the
  dispatcher/CLI import without the tinker stack.
- rllm/trainer/agent_sft_trainer.py: AgentSFTTrainer is now the dispatcher
  (SFTSpec + backend). tinker works; verl/fireworks report "milestone 4".
- rllm/cli/sft.py: `rllm sft` speaks SFTSpec; registered in main.py.
- Remove deprecated/tinker_sft_{trainer,dataset}.py and their re-export shims
  (deprecated/__init__, tinker/__init__); update the archive example.
- tests: dispatcher dispatch + tinker build_config/validate (14), CLI
  resolution/dispatch (4).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add FireworksSFTBackend, the second managed SFT backend, mirroring how the RL
stack's FireworksBackend extends TinkerBackend.

- rllm/trainer/sft/fireworks_backend.py: FireworksSFTBackend(TinkerSFTBackend)
  reuses validate_spec/build_config/prepare_data and the shared data pipeline;
  overrides fit() with a synchronous pipelined loop over Fireworks' SDK-managed
  client (build_service_client -> create_training_client -> ReconnectableClient
  -> TrainingCheckpoints). requires_distributed=False (hosted, like tinker).
  Requires FIREWORKS_API_KEY; SDK imports deferred to fit().
- rllm/trainer/sft/tinker_backend.py: extract build_sft_data() + a
  _config_template() hook so tinker and fireworks share the tinker-cookbook
  renderer/dataset pipeline and the spec->config mapping.
- rllm/trainer/sft/config/fireworks.yaml: native template.
- rllm/trainer/agent_sft_trainer.py: dispatch 'fireworks' (now implemented;
  only verl remains planned).
- Default managed-SFT model is now Qwen/Qwen3.5-4B (both backends + SFTSpec/CLI).
- tests: fireworks dispatch + build_config/validation + default-model (7 new).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Fireworks SFT backend now provisions exactly like the RL FireworksBackend:
parse a `fireworks_infra` document with `trainers.policy.training_shape_id` and
call `init_fireworks_infra("sft", ...)`. Because the doc names a training shape,
the SDK takes the training-shape path and never falls back to the manual-infra
path that force-sends `skipValidations=true` (which standard accounts can't do).

The previous `build_service_client` + `create_training_client` route had no
shape, hit the manual path, and failed with HTTP 400 "Only superuser can skip
validations" — a symptom of the wrong provisioning route, not an account limit
(RL on the same account works).

- config/fireworks.yaml: FW model path + HF tokenizer + training shape
  (qwen3p5-4b, following swe-rl's qwen3p5-9b pattern) + a fireworks_infra
  provision doc (common/trainers.policy/recipe.sft).
- fireworks_backend: build provision doc -> load_yaml_provision("sft") ->
  init_fireworks_infra("sft"); loop over infra.policy; checkpoint via
  TrainingCheckpoints; infra.close() on exit. build_config keeps the FW model
  path unless --model is itself a FW path.
- build_sft_data: tokenize from model.tokenizer_model when set (FW model.name
  is a non-HF path).
- tests: offline provision-doc parse guard + FW model/override assertions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fireworks' catalog has no 3.5-4B training shape, so default the Fireworks SFT
backend to the 9B identifiers: accounts/fireworks/models/qwen3p5-9b +
Qwen/Qwen3.5-9B + accounts/fireworks/trainingShapes/qwen3p5-9b-256k. (Tinker SFT
keeps Qwen/Qwen3.5-4B.) Update the provision-doc parse test accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-lora)

The shape path now resolves (no more skipValidations), but the non-`-lora`
shape has no LoRA-validated version, so LORA_TRAINER creation 400s with
"no validated training shape exists". Switch the default to the `-lora` shape
(matches swe-rl's RL recipe, which trains LoRA rank 32 on the same model).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
len(TinkerSFTDataset) floors examples//batch_size, so a dataset smaller than
one batch yielded 0 batches and the SFT loop ran vacuously (exit 0, nothing
trained). Clamp n_batches to >=1 in both tinker and fireworks backends so the
final partial batch is trained.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fireworks' forward_backward(loss_fn="cross_entropy") returns aggregate metrics
(loss:sum / response_tokens), not the per-token loss_fn_outputs["logprobs"] the
tinker SDK exposes. The copied-from-tinker logprobs path raised KeyError on the
first step. Compute train/val loss from result.metrics, matching the cookbook's
own sft_loop collect. (Tinker backend keeps logprobs — that's correct for it.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The panel printed the raw --model default (e.g. Qwen/Qwen3.5-4B) even when the
Fireworks backend resolves to a different FW model path + HF tokenizer. Add
AgentSFTTrainer.prepare() (build/configure the backend locally, cached, no
provisioning) and a backend.config property; the CLI renders the panel from the
resolved config (model name + tokenizer row) and reuses the same backend for
train().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jeffreysijuntan jeffreysijuntan force-pushed the feat/cli-sft-distillation branch from 42cb172 to 42bb4aa Compare June 20, 2026 22:03
@jeffreysijuntan jeffreysijuntan changed the base branch from feat/echo-algorithm to main June 20, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant