feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks)#673
Open
jeffreysijuntan wants to merge 9 commits into
Open
feat(cli): eval → curate → SFT loop + unified SFT trainer (tinker/fireworks)#673jeffreysijuntan wants to merge 9 commits into
jeffreysijuntan wants to merge 9 commits into
Conversation
Add the eval -> curate -> SFT loop's curation half (design doc +
milestones 1-2). Milestone 3 (unified SFT trainer) follows on this branch.
- rllm/eval/filter_dsl.py: safe per-task boolean DSL over aggregate metrics
(avg, best, worst, solved, n, n_correct, and budget-aware pass@k via a
name@k rewrite + AST node-whitelist). avg@k is treated as k-invariant.
- rllm/eval/curation.py: curate() loads eval run dirs, pools attempts by
stable task_id across runs, filters tasks, selects trajectories
(correct/best/best-n/shortest/all), lazy-loads only chosen episodes, and
emits {"messages": ...} rows with provenance.
- rllm dataset from-eval: thin CLI over curate() with --dry-run and a
task-level train/val holdout; registers the result via DatasetRegistry.
- design/sft-distillation.md: full design (curation engine + unified SFT
trainer mirroring the RL dispatcher/launcher seam).
- tests: 42 for the engine/DSL, 8 for the CLI command.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrite SFT around a backend-agnostic spec + dispatcher, mirroring the RL
stack's AgentTrainer/launcher seam. Clean break: the old ad-hoc
AgentSFTTrainer._train_verl/_train_tinker and the deprecated Tinker SFT
trainer/dataset are removed; the new dispatcher keeps the AgentSFTTrainer name.
- rllm/trainer/sft/spec.py: SFTSpec (backend-agnostic; the only input).
- rllm/trainer/sft/backend.py: SFTBackend ABC (each backend owns its fit()),
SFTConfigError, validate_messages_dataset.
- rllm/trainer/sft/tinker_backend.py + tinker_dataset.py + config/tinker.yaml:
TinkerSFTBackend with the migrated tinker SFT loop; heavy imports lazy so the
dispatcher/CLI import without the tinker stack.
- rllm/trainer/agent_sft_trainer.py: AgentSFTTrainer is now the dispatcher
(SFTSpec + backend). tinker works; verl/fireworks report "milestone 4".
- rllm/cli/sft.py: `rllm sft` speaks SFTSpec; registered in main.py.
- Remove deprecated/tinker_sft_{trainer,dataset}.py and their re-export shims
(deprecated/__init__, tinker/__init__); update the archive example.
- tests: dispatcher dispatch + tinker build_config/validate (14), CLI
resolution/dispatch (4).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add FireworksSFTBackend, the second managed SFT backend, mirroring how the RL stack's FireworksBackend extends TinkerBackend. - rllm/trainer/sft/fireworks_backend.py: FireworksSFTBackend(TinkerSFTBackend) reuses validate_spec/build_config/prepare_data and the shared data pipeline; overrides fit() with a synchronous pipelined loop over Fireworks' SDK-managed client (build_service_client -> create_training_client -> ReconnectableClient -> TrainingCheckpoints). requires_distributed=False (hosted, like tinker). Requires FIREWORKS_API_KEY; SDK imports deferred to fit(). - rllm/trainer/sft/tinker_backend.py: extract build_sft_data() + a _config_template() hook so tinker and fireworks share the tinker-cookbook renderer/dataset pipeline and the spec->config mapping. - rllm/trainer/sft/config/fireworks.yaml: native template. - rllm/trainer/agent_sft_trainer.py: dispatch 'fireworks' (now implemented; only verl remains planned). - Default managed-SFT model is now Qwen/Qwen3.5-4B (both backends + SFTSpec/CLI). - tests: fireworks dispatch + build_config/validation + default-model (7 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Fireworks SFT backend now provisions exactly like the RL FireworksBackend:
parse a `fireworks_infra` document with `trainers.policy.training_shape_id` and
call `init_fireworks_infra("sft", ...)`. Because the doc names a training shape,
the SDK takes the training-shape path and never falls back to the manual-infra
path that force-sends `skipValidations=true` (which standard accounts can't do).
The previous `build_service_client` + `create_training_client` route had no
shape, hit the manual path, and failed with HTTP 400 "Only superuser can skip
validations" — a symptom of the wrong provisioning route, not an account limit
(RL on the same account works).
- config/fireworks.yaml: FW model path + HF tokenizer + training shape
(qwen3p5-4b, following swe-rl's qwen3p5-9b pattern) + a fireworks_infra
provision doc (common/trainers.policy/recipe.sft).
- fireworks_backend: build provision doc -> load_yaml_provision("sft") ->
init_fireworks_infra("sft"); loop over infra.policy; checkpoint via
TrainingCheckpoints; infra.close() on exit. build_config keeps the FW model
path unless --model is itself a FW path.
- build_sft_data: tokenize from model.tokenizer_model when set (FW model.name
is a non-HF path).
- tests: offline provision-doc parse guard + FW model/override assertions.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fireworks' catalog has no 3.5-4B training shape, so default the Fireworks SFT backend to the 9B identifiers: accounts/fireworks/models/qwen3p5-9b + Qwen/Qwen3.5-9B + accounts/fireworks/trainingShapes/qwen3p5-9b-256k. (Tinker SFT keeps Qwen/Qwen3.5-4B.) Update the provision-doc parse test accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-lora) The shape path now resolves (no more skipValidations), but the non-`-lora` shape has no LoRA-validated version, so LORA_TRAINER creation 400s with "no validated training shape exists". Switch the default to the `-lora` shape (matches swe-rl's RL recipe, which trains LoRA rank 32 on the same model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
len(TinkerSFTDataset) floors examples//batch_size, so a dataset smaller than one batch yielded 0 batches and the SFT loop ran vacuously (exit 0, nothing trained). Clamp n_batches to >=1 in both tinker and fireworks backends so the final partial batch is trained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fireworks' forward_backward(loss_fn="cross_entropy") returns aggregate metrics (loss:sum / response_tokens), not the per-token loss_fn_outputs["logprobs"] the tinker SDK exposes. The copied-from-tinker logprobs path raised KeyError on the first step. Compute train/val loss from result.metrics, matching the cookbook's own sft_loop collect. (Tinker backend keeps logprobs — that's correct for it.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The panel printed the raw --model default (e.g. Qwen/Qwen3.5-4B) even when the Fireworks backend resolves to a different FW model path + HF tokenizer. Add AgentSFTTrainer.prepare() (build/configure the backend locally, cached, no provisioning) and a backend.config property; the CLI renders the panel from the resolved config (model name + tokenizer row) and reuses the same backend for train(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
42cb172 to
42bb4aa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the eval → curate → SFT loop to the rLLM CLI: run an eval with
ksamples per task, curate the trajectories by aggregate metrics into an SFT dataset, and fine-tune on it — all from the CLI. Also refactors SFT around a backend-agnostic dispatcher with pluggable backends (tinker + fireworks; verl deferred).Design doc:
design/sft-distillation.md.What's included
Curation (
rllm dataset from-eval)rllm/eval/curation.py— load eval run dirs, pool attempts by stabletask_idacross runs, filter tasks, select trajectories (correct/best/best-n/shortest/all), lazy-load only chosen episodes, emit{"messages": ...}rows with provenance, register viaDatasetRegistry.rllm/eval/filter_dsl.py— safe per-task boolean DSL over aggregates (avg,best,worst,solved,n,n_correct, and budget-awarepass@kvia aname@krewrite + AST node-whitelist).avg@kis treated as k-invariant.--dry-run, task-level train/val holdout.Unified SFT trainer
rllm/trainer/sft/—SFTSpec(backend-agnostic input),SFTBackendABC (each backend owns itsfit()), andAgentSFTTraineras the dispatcher (mirrors the RL stack'sAgentTrainer/launcher seam).TinkerSFTBackend(migrated loop) andFireworksSFTBackend(TinkerSFTBackend)— Fireworks provisions through the training-shape path (init_fireworks_infra("sft", …)with afireworks_infradoc), shares the tinker-cookbook data pipeline, and runs a synchronous pipelined loop.rllm sftCLI (registered inmain.py); panel shows the backend-resolved model.AgentSFTTrainer._train_verl/_train_tinkerand the deprecated Tinker SFT trainer/dataset; kept theAgentSFTTrainername.Verified
forward_backward/optim_step, loss fromresult.metrics), saves DCP checkpoints, tears down the trainer on exit.ruffclean.Deferred (follow-ups)
VerlSFTBackend+ torchrun launcher (--backend verlcurrently returns a clear "not wired yet").--keep-trainer/--fireworks-job-id; cross-run resume).🤖 Generated with Claude Code