This repository contains the code for training and evaluating language models as calibrated stochastic generators.
The main paper workflow compares two variants of Calibration Fine-Tuning:
- Soft-target fine-tuning: discretize the target output distribution, build a prefix trie over tokenized canonical outputs, and train the model to match trie-induced next-token targets.
- Hard-target fine-tuning: sample canonical outputs from the same discretized target distribution and train on the sampled completions with standard next-token cross-entropy.
The code also includes evaluation pipelines for structured numeric sampling, open-ended random generation, MCQ answer-position balance, NoveltyBench, PALOMA perplexity, and TinyBenchmarks retention.
src/random_steering/calibrate_sft/: soft-target data construction, loss, training, and structured-sampling evaluation.src/random_steering/hard_label_sft/: hard-target data construction, loss, and training.src/random_steering/inference/: shared Hugging Face/vLLM generation backends, chat formatting, and String Seed of Thought wrappers.src/random_steering/open_random_gen/: open-ended random-generation evaluation.src/random_steering/mcq_gen/: MCQ answer-position balance evaluation.src/random_steering/noveltybench/: NoveltyBench generation, partitioning, and scoring.src/random_steering/perplexity/: PALOMA-style teacher-forced perplexity evaluation.src/random_steering/retention/: TinyBenchmarks retention evaluation.conf/: Hydra configs for training and evaluation.benchmarks/: small benchmark assets used by the open-generation and NoveltyBench evaluations.tests/: lightweight unit and integration tests with mocked models where possible.
The internal Python package is still named random_steering for compatibility with the original experiment code. The artifact-level project name is Calibration Fine-Tuning.
Create an environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .For Hugging Face models and datasets, set cache locations as appropriate for your machine:
export HF_HOME=/path/to/hf_cacheIf you use gated Hugging Face models or datasets, authenticate first:
huggingface-cli loginThe code uses Hydra for configuration. Training logs are written under outputs/ by default. W&B logging is disabled in the artifact configs; enable it explicitly with train.wandb.enabled=true train.wandb.mode=online.
Most evaluation axes are either synthetic or ship with the repository. The only large external corpus used by the paper pipeline is PALOMA.
No external dataset is required. Prompts, target distributions, train/test splits, discretized output spaces, and logit/sample metrics are generated from the Hydra data configs under conf/data/.
The main configs are:
conf/data/calibrate_sft_final.yamlfor soft-target fine-tuning.conf/data/calibrate_sft_hard_label_final.yamlfor hard-target fine-tuning.conf/data/calibrate_sft_selected.yamlfor structured-sampling evaluation.
The prompt set is included:
benchmarks/open_random_gen/prompts.json
The loader reads this path from conf/open_random_gen/open_random_gen.yaml. No download is needed.
No external dataset is required. The benchmark uses a fixed medical-MCQ prompt defined in:
src/random_steering/mcq_gen/prompt.py
The model generates new MCQs, and the evaluator parses the declared Correct Answer: A/B/C/D field.
The prompt assets are included:
benchmarks/noveltybench/curated.jsonlbenchmarks/noveltybench/wildchat_1k.jsonl
The evaluator also uses the same scoring models as the NoveltyBench pipeline:
- Similarity classifier tokenizer:
microsoft/deberta-v3-large - Similarity classifier:
yimingzhang/deberta-v3-large-generation-similarity - Reward model:
Skywork/Skywork-Reward-Gemma-2-27B-v0.2
These are downloaded automatically by Transformers unless already present in HF_HOME. To pre-download them:
huggingface-cli download microsoft/deberta-v3-large
huggingface-cli download yimingzhang/deberta-v3-large-generation-similarity
huggingface-cli download Skywork/Skywork-Reward-Gemma-2-27B-v0.2The full NoveltyBench run is expensive because it generates responses, partitions them with the DeBERTa classifier, and scores representative outputs with the reward model. The staged entry points are also available:
PYTHONPATH=src python -m random_steering.noveltybench.generate
PYTHONPATH=src python -m random_steering.noveltybench.partition run_dir=/path/to/generated_run
PYTHONPATH=src python -m random_steering.noveltybench.score run_dir=/path/to/partitioned_runThe GP-IRT metadata artifact is included:
src/random_steering/retention/assets/tinyBenchmarks.pkl
The 100-example task datasets are loaded from Hugging Face through datasets.load_dataset:
tinyBenchmarks/tinyMMLUtinyBenchmarks/tinyHellaswagtinyBenchmarks/tinyTruthfulQAtinyBenchmarks/tinyWinograndetinyBenchmarks/tinyGSM8k
They download automatically into the Hugging Face datasets cache. To pre-cache them:
python - <<'PY'
from datasets import load_dataset
load_dataset("tinyBenchmarks/tinyMMLU", split="test")
load_dataset("tinyBenchmarks/tinyHellaswag", split="validation")
load_dataset("tinyBenchmarks/tinyTruthfulQA", "multiple_choice", split="validation")
load_dataset("tinyBenchmarks/tinyWinogrande", "winogrande_xl", split="validation")
load_dataset("tinyBenchmarks/tinyGSM8k", "main", split="test")
PYPALOMA is not vendored because it is large and gated by the AI2 ImpACT license. The perplexity loader expects local gzip JSONL files with this layout:
datasets/paloma/<slice_name>/<split>/*.jsonl.gz
For the paper-style multi-slice run, use conf/perplexity/paloma_full_stride_1024.yaml, which expects:
wikitext_103c4_endolma-v1_5mc4ptbredpajamafalcon-refinedweb
After accepting the dataset terms on Hugging Face, the expected layout can be obtained with:
huggingface-cli download allenai/paloma \
--repo-type dataset \
--local-dir datasets/palomaFor a lightweight smoke run, the default conf/perplexity/standard_lm.yaml only evaluates wikitext_103.
Base models are selected by configs under conf/model/, for example Qwen/Qwen3-1.7B. Some model families used in the paper, such as Llama and Gemma, may require accepting model terms on Hugging Face before download.
Fine-tuned adapter checkpoints are not included. Train them with the commands below, or edit the template files under conf/eval_target/ to point to your local checkpoints.
Run the lightweight tests:
pytest testsSome tests that require gated Hugging Face models, local tokens, or large-model inference are skipped automatically when the required environment is unavailable.
Soft-target fine-tuning:
PYTHONPATH=src python -m random_steering.train \
model=qwen3_1p7b \
data=calibrate_sft_final \
train=calibrate_sft_final \
experiment.name=qwen3_1p7b_soft_targetHard-target fine-tuning:
PYTHONPATH=src python -m random_steering.train \
model=qwen3_1p7b \
data=calibrate_sft_hard_label_final \
train=hard_label_sft_final \
experiment.name=qwen3_1p7b_hard_targetFor multi-GPU FSDP runs, use the corresponding *_fsdp train configs with torchrun.
Baseline evaluation:
PYTHONPATH=src python -m random_steering.calibrate_sft.eval \
model=qwen3_1p7b \
eval_target=calibrate_sft_qwen3_1p7b_baseline_paperFine-tuned checkpoint evaluation uses the provided template configs. Replace the placeholder checkpoint paths in conf/eval_target/calibrate_sft_qwen3_1p7b_soft_template.yaml or conf/eval_target/calibrate_sft_qwen3_1p7b_hard_template.yaml, then run:
PYTHONPATH=src python -m random_steering.calibrate_sft.eval \
model=qwen3_1p7b \
eval_target=calibrate_sft_qwen3_1p7b_soft_templateOpen-ended random generation:
PYTHONPATH=src python -m random_steering.open_random_gen.eval \
model=qwen3_1p7b \
eval_target=open_random_gen_baselineMCQ answer-position balance:
PYTHONPATH=src python -m random_steering.mcq_gen.eval \
model=qwen3_1p7b \
eval_target=mcq_gen_baselineTinyBenchmarks retention:
PYTHONPATH=src python -m random_steering.retention.eval \
model=qwen3_1p7b \
eval_target=tinybenchmarks_baselinePALOMA perplexity:
PYTHONPATH=src python -m random_steering.perplexity.eval \
model=qwen3_1p7b \
eval_target=calibrate_sft_qwen3_1p7b_baseline_paperNoveltyBench:
PYTHONPATH=src python -m random_steering.noveltybench.eval \
model=qwen3_1p7b \
eval_target=noveltybench_baseline- Final soft-target configs use five-decimal canonical outputs and
max_bins=1001. - Final hard-target configs use five-decimal canonical outputs and
max_bins=16384. - The default training prompt is:
Generate exactly ONE random number from a [distribution] distribution with parameters [params]. Output ONLY the number.
- Evaluation target configs in this artifact are intentionally templates. They avoid machine-specific absolute checkpoint paths and should be edited to point to locally trained checkpoints.