Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions randopt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# RandOpt

## Required `verl` Version

See [`REQUIRED_VERL.txt`](REQUIRED_VERL.txt) for the upstream repository, install mode, and copy-pastable `pip` instruction.

## Overview

RandOpt is a LLM post-training algorithm introduced in [**Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights**](https://arxiv.org/abs/2603.12228) (ICML 2026 Spotlight). It samples Gaussian perturbations around a pretrained model, evaluates the perturbed models in parallel with vLLM, selects the top-performing perturbations, and evaluates them with majority-vote ensembling.

Project page: <https://thickets.mit.edu>

## Installation

Install the pinned `verl` version and runtime dependencies:

```bash
pip install verl==0.7.1
pip install vllm ray pandas pyarrow tqdm
```

If running from a `verl` checkout, initialize the recipe submodule first:

```bash
git submodule update --init --recursive recipe
```

## Run

Run the Countdown example from the `verl` repository root. The following command expects prepared Countdown parquet files:

```bash
python3 -m recipe.randopt.main_randopt \
model.path=Qwen/Qwen2.5-3B-Instruct \
data.task_type=countdown \
data.train_files=data/countdown/train.parquet \
data.val_files=data/countdown/test.parquet \
randopt.worker_extension_cls=recipe.randopt.worker_extension.WorkerExtension
```

For a quick test with generated toy Countdown data:

```bash
python3 -m recipe.randopt.run_countdown_example
```

Common overrides:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m recipe.randopt.main_randopt \
model.path=Qwen/Qwen2.5-7B-Instruct \
data.task_type=countdown \
data.train_files=/path/to/train.parquet \
data.val_files=/path/to/test.parquet \
randopt.num_engines=4 \
randopt.tensor_parallel_size=1 \
"randopt.sigma_list=[0.0005,0.001,0.002]" \
"randopt.top_k_ratios=[0.02,0.1]" \
randopt.worker_extension_cls=recipe.randopt.worker_extension.WorkerExtension \
trainer.n_gpus_per_node=4
```


If you are running from the standalone `verl-recipe` repository root instead of from `verl`, drop the `recipe.` prefix:

```bash
python3 -m randopt.run_countdown_example
python3 -m randopt.main_randopt ...
```

## Test Result

The following test was run on six H200 GPUs with `Qwen/Qwen2.5-1.5B-Instruct`, `population_size=500`, 20 toy Countdown training examples, 200 validation examples, and one RandOpt iteration:

```text
train/reward_mean: 0.1045
train/reward_std: 0.0664
train/reward_min: 0.0130
train/reward_max: 0.3820
ensemble/top_10_accuracy: 42.0%
ensemble/top_50_accuracy: 64.0%
```

With `top_k_ratios=[0.02,0.1]`, `population_size=500` evaluates top-10 and top-50 majority-vote ensembles. The base model may score poorly on the toy Countdown examples, but the ensemble metrics should still be emitted at the end of a successful smoke test run.

## Custom Tasks

For a custom dataset, set `data.task_type=custom` and provide a reward function and optional prompt processor:

```bash
python3 -m recipe.randopt.main_randopt \
data.task_type=custom \
data.train_files=/path/to/train.parquet \
data.val_files=/path/to/test.parquet \
data.reward_fn_path=/path/to/reward.py \
data.reward_fn_name=my_reward_fn \
data.prompt_processor_path=/path/to/prompts.py \
data.prompt_processor_name=my_prompt_processor
```

The reward function should accept `(response: str, task_data: dict)` and return either a float or a dict with a `reward` field.

## Citation

```bibtex
@misc{gan2026neuralthickets,
title={Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights},
author={Yulu Gan and Phillip Isola},
year={2026},
eprint={2603.12228},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.12228},
}
```
7 changes: 7 additions & 0 deletions randopt/REQUIRED_VERL.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# RandOpt recipe is prepared against the latest tested verl release line.
UPSTREAM=https://github.com/verl-project/verl.git
MODE=pinned_tag
TAG=v0.7.1
COMMIT=bec9ef74768dd201881cd4e54cd0385e87caae27
PIP_INSTALL=pip install verl==0.7.1

1 change: 1 addition & 0 deletions randopt/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""RandOpt recipe for zeroth-order post-training with verl and vLLM."""
66 changes: 66 additions & 0 deletions randopt/config/randopt_trainer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# RandOpt recipe configuration.

randopt:
sigma: 0.001
# Optional list of perturbation scales. When set, each sampled perturbation
# draws one sigma uniformly from this list, following the reference RandOpt code.
sigma_list: null
population_size: 30
# Evaluate majority-vote ensembles for top int(ratio * population_size) perturbations.
top_k_ratios:
- 0.02
- 0.1
# Optional absolute K values. When set, this takes precedence over top_k_ratios.
top_k_values: null
num_engines: 4
tensor_parallel_size: 1
precision: bfloat16
max_tokens: 1024
temperature: 0.0
gpu_memory_utilization: 0.85
enable_prefix_caching: false
enforce_eager: true
worker_extension_cls: recipe.randopt.worker_extension.WorkerExtension
global_seed: 42
debug_print_samples: false
debug_max_samples: 4

model:
path: Qwen/Qwen2.5-3B-Instruct
trust_remote_code: false

data:
# Built-in options: countdown, parquet_prompt, custom.
task_type: countdown
train_files: data/countdown/train.parquet
val_files: data/countdown/test.parquet
train_max_samples: 200
val_max_samples: -1
system_message: "You are a helpful assistant. You first think about the reasoning process in your mind and then provide the user with the answer."
user_template: "Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. Return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>."
response_prompt: "Let me solve this step by step.\n<think>"
reward_fn_path: null
reward_fn_name: null
prompt_processor_path: null
prompt_processor_name: null

trainer:
project_name: randopt
experiment_name: countdown-randopt
logger:
- console
default_local_dir: /tmp/${oc.env:USER}/verl/randopt_checkpoints
default_hdfs_dir: null
device: cuda
n_gpus_per_node: 4
nnodes: 1
total_epochs: null
test_freq: null
save_freq: -1
npu_profile:
enable: false

ray_kwargs:
ray_init:
runtime_env: {}

80 changes: 80 additions & 0 deletions randopt/main_randopt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import os
import socket
import tempfile
import time
from pprint import pprint

import hydra
import ray
from omegaconf import OmegaConf
from transformers import AutoTokenizer

from verl.utils.device import auto_set_device

try:
from recipe.randopt.randopt_ray_trainer import RandOptRayTrainer
from recipe.randopt.task_utils import create_prompt_processor, create_reward_fn, create_vote_fns, load_data
except ModuleNotFoundError:
from randopt.randopt_ray_trainer import RandOptRayTrainer
from randopt.task_utils import create_prompt_processor, create_reward_fn, create_vote_fns, load_data


@hydra.main(config_path="config", config_name="randopt_trainer", version_base=None)
def main(config):
auto_set_device(config)
run_randopt(config)


def run_randopt(config) -> None:
print(f"RandOpt hostname: {socket.gethostname()}, PID: {os.getpid()}")
pprint(OmegaConf.to_container(config, resolve=True))
OmegaConf.resolve(config)

if not ray.is_initialized():
ray_init_kwargs = config.ray_kwargs.get("ray_init", {})
if not ray_init_kwargs:
ray_init_kwargs = {
"address": "local",
"include_dashboard": False,
"ignore_reinit_error": True,
"_temp_dir": tempfile.mkdtemp(prefix=f"ray_randopt_{int(time.time())}_"),
}
ray.init(**OmegaConf.to_container(ray_init_kwargs, resolve=True))

data_config = OmegaConf.to_container(config.data, resolve=True)
train_data = load_data(config.data.train_files)
eval_data = load_data(config.data.val_files) if config.data.get("val_files") else []
train_max_samples = int(config.data.get("train_max_samples", -1))
val_max_samples = int(config.data.get("val_max_samples", -1))
if train_max_samples > 0:
train_data = train_data[:train_max_samples]
if val_max_samples > 0:
eval_data = eval_data[:val_max_samples]

tokenizer = AutoTokenizer.from_pretrained(
config.model.path,
trust_remote_code=config.model.get("trust_remote_code", False),
)
prompt_processor = create_prompt_processor(data_config)
reward_fn = create_reward_fn(data_config)
vote_answer_fn, vote_correct_fn = create_vote_fns(data_config)

trainer = RandOptRayTrainer(
config=config,
tokenizer=tokenizer,
reward_fn=reward_fn,
train_data=train_data,
eval_data=eval_data,
prompt_processor=prompt_processor,
vote_answer_fn=vote_answer_fn,
vote_correct_fn=vote_correct_fn,
)
trainer.init_workers(config.model.path)
trainer.fit()

if ray.is_initialized():
ray.shutdown()


if __name__ == "__main__":
main()
Loading
Loading