ES-at-Scale is an open-source framework for fine-tuning large language models using Evolution Strategies (ES) — a fully backpropagation-free, massively parallelizable alternative to RL-based training methods like PPO and GRPO. The implementation is based on the paper "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (https://arxiv.org/abs/2509.24372).
ES-at-Scale performs direct optimization in the full parameter space:
- No backpropagation
- No optimizer states
- No activations stored
- No dimensionality reduction or low-rank adapters
Training is built on Ray for distributed execution and vLLM for high-throughput inference, enabling efficient multi-GPU rollout evaluation at scale.
Feel free to join the ES fine-tuning forum in Discussions.
For the older version of the codes that were used to generate the original experimental results in the paper, please see /archive (with corresponding documentations inside).
06/18/2026: 🔥🔥🔥 A new open-source library ES-at-Scale with an extensible interface and a more stable instrastructure is released! 🚀🚀🚀
10/27/2025: 🔥🔥🔥 An accelerated version with 10X+ speed-up in running time is added to the repo! 🚀🚀🚀
ES-at-Scale is free to use, modify, and build on.
- Massively parallel: each perturbation (population member) can be evaluated independently
- Works with non-differentiable rewards: symbolic graders, program execution, discrete checks
- Inference-only systems: leverage vLLM throughput instead of training-time backprop
A major design goal of this repository is model‑size flexibility — and, in turn, democratizing LLM fine‑tuning. Using this exact implementation we have successfully trained 0.5B, 3B, 7B, 14B, 32B, and 72B parameter models with no architectural changes.
If the model fits in your GPUs under vLLM, it can be fine‑tuned with this ES framework.
Because there are no gradients, optimizer states, or stored activations, the scaling challenge shifts from gradient computation to inference throughput. Smaller teams can fine‑tune large models on inference‑optimized clusters, with arbitrary (including non‑differentiable) reward functions.
- Python 3.12 is reccommended. Other python versions are untested and may not work
- CUDA-enabled GPUs (multi-GPU strongly recommended)
- Linux environment
- CUDA, PyTorch, Ray, and vLLM properly installed
Python 3.12 is supported and tested. Other versions are untested and may not work.
python3.12 -m venv es
source es/bin/activateFrom the repository root:
pip install -e .pip install wandbpip install math-verify
pip install pylatexenc
pip install latex2sympy2_extendedNote
Ensurenvidia-smiworks correctly and that Ray detects all available GPUs.
train.py is a task-specific example showing how to use EvolutionStrategiesTrainer. It bundles two ready-to-run tasks — countdown (the default) and math — selected via --task. Each task wires in its own reward function, prompt template, and on-disk HuggingFace datasets. To use ES on your own task, copy train.py and replace those three components — the trainer itself is fully task-agnostic.
--task selects the bundled configuration:
countdown(default) — the Countdown task with the<think>/<answer>format reward and a pass-through template (the dataset'scontextfield already contains the full prompt).math— math reasoning with the boxed-answer reward and Qwen math chat template.
The example below fine-tunes Qwen/Qwen2.5-1.5B-Instruct on 8 GPUs against the Countdown task:
python es_at_scale/train.py \
--task countdown \
--model-name "Qwen/Qwen2.5-1.5B-Instruct" \
--sigma 0.001 \
--population-size 30 \
--n-iterations 500 \
--eval-freq 5 \
--train-dataset "datasets/train/countdown" \
--eval-dataset "datasets/evaluation_suite/countdown" \
--batch-size 200 \
--mini-batch-size 200 \
--max-tokens 512 \
--n-vllm-engines 8 \
--use-gpus "0,1,2,3,4,5,6,7" \
--output-directory "./experiments/" \
--experiment-name "my-first-countdown-run" \
--wandb-project "es-finetuning" \
--logging wandbThe example below fine-tunes Qwen/Qwen2.5-Math-7B on 8 GPUs against the MATH level 3–5 training set:
python es_at_scale/train.py \
--task math \
--model-name "Qwen/Qwen2.5-Math-7B" \
--sigma 0.001 \
--population-size 30 \
--n-iterations 500 \
--eval-freq 5 \
--train-dataset "datasets/train/math_lvl3to5_8k" \
--eval-dataset "datasets/evaluation_suite/math/" \
--batch-size 1024 \
--mini-batch-size 1024 \
--max-tokens 3000 \
--n-vllm-engines 8 \
--use-gpus "0,1,2,3,4,5,6,7" \
--output-directory "./experiments/" \
--experiment-name "my-first-math-run" \
--wandb-project "es-finetuning" \
--logging wandb| Argument | Default | Description |
|---|---|---|
--task |
countdown |
Bundled task config: countdown or math. Selects reward function, prompt template, and dataset collate. |
--model-name |
Qwen/Qwen2.5-1.5B-Instruct |
HuggingFace model ID or local path |
--checkpoint |
— | Path to a .pth ES checkpoint to resume from |
--sigma |
0.001 |
Noise scale for ES perturbations |
--alpha |
sigma/2 |
Learning rate (if not specified, it will be auto-set to sigma/2) |
--reward-shaping |
z-scores |
Reward normalization strategy |
--population-size |
30 |
Number of perturbations per iteration |
--n-iterations |
300 |
Total number of ES training iterations |
--eval-freq |
5 |
Run evaluation every N iterations |
--train-dataset |
datasets/train/countdown |
Path to training DatasetDict on disk |
--eval-dataset |
datasets/evaluation_suite/countdown |
Path to evaluation DatasetDict on disk |
--batch-size |
512 |
Number of prompts/training samples used to evaluate each perturbed model (population member) at one ES iteration |
--mini-batch-size |
512 |
How many prompts/training samples each vLLM engine processes at once. A memory/throughput knob only — it does not change the ES update. See note below. |
--max-tokens |
512 |
Maximum tokens per generated response |
--n-vllm-engines |
8 |
Number of vLLM engines (one per GPU recommended) |
--n-gpu-per-vllm-engine |
1 |
GPUs per vLLM engine |
--logging |
wandb |
Logging backend (wandb or none) |
--seed |
42 |
Global random seed |
--use-gpus |
0,1,2,3,4,5,6,7 |
Comma-separated GPU indices to use |
--reward-function-timeout |
10 |
Timeout (seconds) for reward function calls |
--output-directory |
./experiments/ |
Root directory for checkpoints and logs |
--save-best-models |
False |
Save a checkpoint each time eval score improves |
--experiment-name |
auto-generated | Name for this run (used in wandb and checkpoint paths) |
--wandb-project |
es-finetuning |
Wandb project name |
--batch-size is the number of prompts/training samples used to evaluate each perturbed model (population member) at one ES iteration — every population member sees the same prompts, and the average reward over these prompts/training samples are used as the final reward for each population member.
--mini-batch-size is a memory lever: it splits that fixed batch into sequential chunks run through each vLLM engine one at a time, so the full batch never has to fit in memory at once. Rewards are accumulated across chunks with size-weighting, so it does not change the result — only peak memory and speed. If rollout hits OOM (common with long --max-tokens or large models), lower it.
The batch is split into ceil(batch_size / mini_batch_size) chunks, processed one after another. There are three cases:
batch-size == mini-batch-size(default): one chunk — the whole batch is processed in a single pass. Highest memory.batch-size > mini-batch-size: multiple chunks (e.g. batch 512, mini 128 → 4 passes of 128). Lower peak memory, more sequential passes.batch-size < mini-batch-size: one chunk — the mini-batch is capped at the batch size, so it behaves exactly like the default.
Evaluation mini-batch size. In train.py, evaluation uses the same mini-batch size as the --mini-batch-size in training, since --mini-batch-size should have already been tuned during training to the largest batch that fits in memory without an OOM.
The following are not configurable via CLI:
| Parameter | Value | Description |
|---|---|---|
| Training temperature | 0.0 |
Greedy decoding during training rollouts |
| Training top-p | 1.0 |
The entire token distribution is included — no tokens are filtered out |
| Eval temperature | 0.0 |
Greedy decoding during evaluation |
| Eval top-p | 1.0 |
The entire token distribution is included — no tokens are filtered out |
| Rollouts per prompt | 1 |
Single sample per population member |
Currently, there is no separate evaluation entry point — evaluation is run through the same train.py. Setting --n-iterations 0 puts the trainer in eval-only mode: it runs a single evaluation pass on your unmodified model and exits, performing no training steps and saving no checkpoint.
- Set
--n-iterations 0to evaluate only. - Pass
--checkpointto evaluate a fine-tuned ES checkpoint, or omit it to evaluate the raw base model. - Per-sample outputs are written to
experiments/<experiment-name>/eval-output/model_eval_task<name>_iteration0.json, and pass@1 is printed to stdout. - Evaluation runs on a single vLLM engine (
engines[0]), so the examples below use--n-vllm-engines 1. The other engines are only used to parallelize the population during training.
python es_at_scale/train.py \
--task countdown \
--model-name "Qwen/Qwen2.5-1.5B-Instruct" \
--eval-dataset "datasets/evaluation_suite/countdown" \
--max-tokens 512 \
--n-iterations 0 \
--n-vllm-engines 1 \
--use-gpus "0" \
--output-directory "./experiments/" \
--experiment-name "eval-base-model" \
--wandb-project "es-evaluation" \
--logging "wandb"python es_at_scale/train.py \
--task countdown \
--model-name "Qwen/Qwen2.5-1.5B-Instruct" \
--checkpoint "experiments/<run>/checkpoint-es_fine_tuned_iteration_500/pytorch_model.pth" \
--eval-dataset "datasets/evaluation_suite/countdown" \
--max-tokens 512 \
--n-iterations 0 \
--n-vllm-engines 1 \
--use-gpus "0" \
--output-directory "./experiments/" \
--experiment-name "eval-checkpoint" \
--wandb-project "es-evaluation" \
--logging "wandb"-
The number of GPUs required depends on your hardware and model size. The examples above use a single GPU (
--use-gpus "0",--n-vllm-engines 1), which is sufficient for smaller models. Larger models that do not fit on one GPU need tensor parallelism — increase--n-gpu-per-vllm-engineand list the corresponding GPUs in--use-gpus(e.g.--n-gpu-per-vllm-engine 4 --use-gpus "0,1,2,3"). -
Only greedy decoding is currently supported for evaluation (temperature
0.0, top-p1.0, one rollout per prompt). These eval sampling parameters are fixed and not configurable via the CLI. -
In a future release we plan to abstract the evaluation step so users can customize eval parameters (e.g. sampling temperature, top-p, multiple rollouts per prompt) directly.
-
For now, to run any evaluation beyond greedy decoding, override the current behavior by subclassing
EvolutionStrategiesTrainerand reimplementingeval_step()(and/or the sampling parameters it builds) in your own trainer class. See the Customization Guide.
Each run produces a timestamped experiment directory:
experiments/
└── es-finetuned-.../
├── checkpoints/
├── eval-output/
└── train-output/
- The final model weights are always saved at the end of training as
checkpoint-es_fine_tuned_iteration_<N>/pytorch_model.pth. - When
save_best_models=False(the default), set to True to save each time a new best mean eval score is achieved during training. These are written tocheckpoints/<experiment_name>-mean<score>/pytorch_model.pth. Default isFalseto save disk-space. - When
save_best_models=False, only the final model is saved.
With --logging wandb, the following are tracked:
- Training reward statistics
- Evaluation pass@1 metrics
- ES hyperparameters
- Throughput and rollout diagnostics
This repo is designed so you can plug in your own task + evaluator without touching the ES core logic. At a minimum you provide:
- A dataset and a template function that turns a raw dataset question into the exact prompt string you want to feed the model.
- A reward function that scores each model output against the target answer (can be non-differentiable).
- (Optional) Your own trainer (subclass
EvolutionStrategiesTrainer) if you want to change batching, logging, or rollout processing.
The trainer only requires PyTorch DataLoaders. It does not care how your data is stored — HuggingFace datasets, JSON files, CSV, databases, or anything else all work, as long as you wrap them in a DataLoader that yields the right format.
Each batch must return:
(list_of_prompts, list_of_targets)That's the only requirement. The field names in your underlying data, the file format, and the storage location are entirely up to you.
from torch.utils.data import Dataset, DataLoader
class SimpleMathDataset(Dataset):
def __init__(self):
self.rows = [
{"problem": "What is 2+2?", "answer": "4"},
{"problem": "What is 3+5?", "answer": "8"},
]
def __len__(self):
return len(self.rows)
def __getitem__(self, idx):
return self.rows[idx]def collate_fn(batch):
prompts = [x["problem"] for x in batch]
answers = [x["answer"] for x in batch]
return prompts, answerstrain_dataset = SimpleMathDataset()
eval_dataset = SimpleMathDataset()
train_dataloader = DataLoader(
train_dataset,
batch_size=2,
shuffle=True,
collate_fn=collate_fn,
)
eval_dataloader_dict = {
"toy_eval": DataLoader(
eval_dataset,
batch_size=2,
shuffle=False,
collate_fn=collate_fn,
)
}The structure below shows how the default train.py organises datasets on disk. This is the layout expected by the --train-dataset and --eval-dataset flags, which use HuggingFace load_from_disk as a convenience loader. If you load data a different way, this layout is not required — replace the loader in train.py with anything that produces a DataLoader yielding (list[prompt], list[target]) batches.
datasets/
├── train/
│ ├── math_lvl3to5_8k/
│ └── countdown/
└── evaluation_suite/
├── math/
│ ├── amc/
│ ├── aime/
│ ├── math500/
│ ├── minerva/
│ └── olympiad_bench/
└── countdown/
└── countdown_eval/
This
DatasetDictlayout is only the convenience loader used bytrain.py— it is not a requirement of the framework. The trainer itself only consumes PyTorch DataLoaders that yield(list[prompt], list[target])batches (see Datasets). Any storage format — JSON, CSV, Parquet, a database, a remote API, an in-memory list — works just as well; you simply plug it in with your ownDataset/DataLoadercode in place of theload_from_diskcalls intrain.py. The rest of this subsection applies only if you choose to use the built-inload_from_diskloader.
The convenience loader in train.py uses HuggingFace load_from_disk, which expects a DatasetDict saved to disk — a folder containing a dataset_dict.json file plus one subfolder per split. Always point these flags at the folder that contains dataset_dict.json, not at an individual split subfolder.
--train-dataset— must be aDatasetDictfolder. The current implementation assumes there is only one split of the training dataset, so there should only be one single split namedtraininside the folder.--eval-dataset— must be aDatasetDictfolder. Each split is evaluated separately and reported under its split name (e.g.eval/<split>/pass@1/mean). This is how the math suite reportsamc,aime,math500,minerva, andolympiad_benchindividually.
Create your own from raw data with save_to_disk:
from datasets import Dataset, DatasetDict
rows = Dataset.from_list([
{"problem": "What is 2+2?", "answer": "4"},
{"problem": "What is 3+5?", "answer": "8"},
])
# Train: a single `train` split is conventional.
DatasetDict({"train": rows}).save_to_disk("datasets/train/my_task")
# Eval: each split becomes a separately-reported benchmark.
DatasetDict({
"my_eval_a": rows,
"my_eval_b": rows,
}).save_to_disk("datasets/evaluation_suite/my_task")Then point the flags at the folders that contain dataset_dict.json:
--train-dataset "datasets/train/my_task" \
--eval-dataset "datasets/evaluation_suite/my_task"The columns each row must contain are dictated by your collate function — e.g.
problem/answerfor the math task,context/numbers/targetfor countdown. See the Datasets requirements above.
Your template function must accept a single question: str and return a full prompt string. It is called here:
- training:
input_text = [self.template(i) for i in input_text] - eval:
prompts = [self.template(i) for i in input_text]
Example (Qwen math chat template, from train.py):
def apply_qwen_math_template(question: str) -> str:
return (
"<|im_start|>system\n"
"Please reason step by step, and put your final answer within \\boxed{}."
"<|im_end|>\n"
"<|im_start|>user\n"
+ question
+ "<|im_end|>\n"
"<|im_start|>assistant\n"
)Tips:
- Make sure the returned string includes the right “assistant start” marker for your model (if applicable).
- If you want to do few-shot prompting, add examples inside this function.
Your reward function is invoked inside the trainer with:
self.task = functools.partial(reward_function)
...
fmt, r = self.task(response_text, target_text) # executed in a multiprocessing Pool with a timeoutSo your function should have this signature:
def my_reward_fn(model_output: str, target: str, fast: bool = False) -> tuple[str, float]:
...And it should return:
fmt(a short string label you can use for debugging, e.g."ok","timeout","bad_format")reward(a scalar float; higher is better)
Minimal example (exact-match):
def exact_match_reward_fn(model_output: str, target: str, fast: bool = False):
pred = model_output.strip()
gold = target.strip()
return ("exact_match", 1.0 if pred == gold else 0.0)More realistic example (extract \boxed{...} final answer):
import re
_BOX_RE = re.compile(r"\\boxed\{([^}]*)\}")
def boxed_final_answer_reward_fn(model_output: str, target: str, fast: bool = False):
m = _BOX_RE.search(model_output)
if m is None:
return ("missing_box", 0.0)
pred = m.group(1).strip()
gold = target.strip()
return ("boxed", 1.0 if pred == gold else 0.0)Important notes:
- The reward function runs in a multiprocessing pool with a timeout in
_postprocess_outputs(). If your grader might be slow (e.g., symbolic math, code execution), increase the timeout by increase thereward_function_timeoutparameter. It is set to 60 seconds by default. - If the reward function throws or times out, training currently assigns 0.0 reward for that rollout.
In train.py, import and assign your functions, then pass them into the trainer:
from es_at_scale.trainer.es_trainer import EvolutionStrategiesTrainer
from my_project.reward import my_reward_fn
from my_project.templates import my_template_fn
trainer = EvolutionStrategiesTrainer(
model_name=args.model_name,
checkpoint=args.checkpoint,
sigma=args.sigma,
alpha=alpha,
population_size=args.population_size,
reward_shaping=args.reward_shaping,
num_iterations=args.n_iterations,
max_tokens=args.max_tokens,
batch_size=args.batch_size,
mini_batch_size=args.mini_batch_size,
reward_function=my_reward_fn,
template_function=my_template_fn,
train_dataloader=train_dataloader,
eval_dataloader_dict=eval_dataloader_dict,
eval_freq=args.eval_freq,
n_vllm_engines=args.n_vllm_engines,
n_gpu_per_vllm_engine=args.n_gpu_per_vllm_engine,
logging=args.logging,
global_seed=args.seed,
use_gpus=args.use_gpus,
experiment_name=experiment_name,
wandb_project=args.wandb_project,
reward_function_timeout=args.reward_function_timeout,
save_best_models=args.save_best_models,
)
trainer.fit()Most users do not need to change the trainer. But if you want custom behavior, the intended path is to subclass and override the parts you need:
Common extension points:
_postprocess_outputs()— change reward aggregation, logging payloads, or what gets savedevaluate_population_on_batch()— change how you schedule seeds across engineslaunch_engines()— change vLLM / Ray actor configuration
Example skeleton:
from es_at_scale.trainer.es_trainer import EvolutionStrategiesTrainer
class MyTrainer(EvolutionStrategiesTrainer):
def eval_step(self, args):
results = ....
return resultsThen instantiate MyTrainer instead of EvolutionStrategiesTrainer in train.py.
- ES scales nearly linearly with the number of GPUs (population parallelism)
- No synchronization barriers from backpropagation
- Ideal for single-node multi-GPU and distributed Ray clusters
- Tune
--mini-batch-sizeto balance memory usage and throughput
If you use ES-at-Scale in your research, please cite our paper:
@article{Qiu2026EvolutionStrategies,
title={Evolution Strategies at Scale: {LLM} Fine-Tuning Beyond Reinforcement Learning},
author={Qiu, Xin and Gan, Yulu and Hayes, Conor F. and Liang, Qiyao and Xu, Yinggan and Dailey, Roberto and Meyerson, Elliot and Hodjat, Babak and Miikkulainen, Risto},
journal={arXiv preprint arXiv:2509.24372},
year={2026},
eprint={2509.24372},
archivePrefix={arXiv},
primaryClass={cs.LG},
doi={10.48550/arXiv.2509.24372}
}