verl-project · sunrainyg · May 6, 2026
diff --git a/randopt/README.md b/randopt/README.md
@@ -0,0 +1,115 @@
+# RandOpt
+
+## Required `verl` Version
+
+See [`REQUIRED_VERL.txt`](REQUIRED_VERL.txt) for the upstream repository, install mode, and copy-pastable `pip` instruction.
+
+## Overview
+
+RandOpt is a LLM post-training algorithm introduced in [**Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights**](https://arxiv.org/abs/2603.12228) (ICML 2026 Spotlight). It samples Gaussian perturbations around a pretrained model, evaluates the perturbed models in parallel with vLLM, selects the top-performing perturbations, and evaluates them with majority-vote ensembling.
+
+Project page: <https://thickets.mit.edu>
+
+## Installation
+
+Install the pinned `verl` version and runtime dependencies:
+
+```bash
+pip install verl==0.7.1
+pip install vllm ray pandas pyarrow tqdm
+```
+
+If running from a `verl` checkout, initialize the recipe submodule first:
+
+```bash
+git submodule update --init --recursive recipe
+```
+
+## Run
+
+Run the Countdown example from the `verl` repository root. The following command expects prepared Countdown parquet files:
+
+```bash
+python3 -m recipe.randopt.main_randopt \
+    model.path=Qwen/Qwen2.5-3B-Instruct \
+    data.task_type=countdown \
+    data.train_files=data/countdown/train.parquet \
+    data.val_files=data/countdown/test.parquet \
+    randopt.worker_extension_cls=recipe.randopt.worker_extension.WorkerExtension
+```
+
+For a quick test with generated toy Countdown data:
+
+```bash
+python3 -m recipe.randopt.run_countdown_example
+```
+
+Common overrides:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m recipe.randopt.main_randopt \
+    model.path=Qwen/Qwen2.5-7B-Instruct \
+    data.task_type=countdown \
+    data.train_files=/path/to/train.parquet \
+    data.val_files=/path/to/test.parquet \
+    randopt.num_engines=4 \
+    randopt.tensor_parallel_size=1 \
+    "randopt.sigma_list=[0.0005,0.001,0.002]" \
+    "randopt.top_k_ratios=[0.02,0.1]" \
+    randopt.worker_extension_cls=recipe.randopt.worker_extension.WorkerExtension \
+    trainer.n_gpus_per_node=4
+```
+
+
+If you are running from the standalone `verl-recipe` repository root instead of from `verl`, drop the `recipe.` prefix:
+
+```bash
+python3 -m randopt.run_countdown_example
+python3 -m randopt.main_randopt ...
+```
+
+## Test Result
+
+The following test was run on six H200 GPUs with `Qwen/Qwen2.5-1.5B-Instruct`, `population_size=500`, 20 toy Countdown training examples, 200 validation examples, and one RandOpt iteration:
+
+```text
+train/reward_mean: 0.1045
+train/reward_std: 0.0664
+train/reward_min: 0.0130
+train/reward_max: 0.3820
+ensemble/top_10_accuracy: 42.0%
+ensemble/top_50_accuracy: 64.0%
+```
+
+With `top_k_ratios=[0.02,0.1]`, `population_size=500` evaluates top-10 and top-50 majority-vote ensembles. The base model may score poorly on the toy Countdown examples, but the ensemble metrics should still be emitted at the end of a successful smoke test run.
+
+## Custom Tasks
+
+For a custom dataset, set `data.task_type=custom` and provide a reward function and optional prompt processor:
+
+```bash
+python3 -m recipe.randopt.main_randopt \
+    data.task_type=custom \
+    data.train_files=/path/to/train.parquet \
+    data.val_files=/path/to/test.parquet \
+    data.reward_fn_path=/path/to/reward.py \
+    data.reward_fn_name=my_reward_fn \
+    data.prompt_processor_path=/path/to/prompts.py \
+    data.prompt_processor_name=my_prompt_processor
+```
+
+The reward function should accept `(response: str, task_data: dict)` and return either a float or a dict with a `reward` field.
+
+## Citation
+
+```bibtex
+@misc{gan2026neuralthickets,
+      title={Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights},
+      author={Yulu Gan and Phillip Isola},
+      year={2026},
+      eprint={2603.12228},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2603.12228},
+}
+```
diff --git a/randopt/REQUIRED_VERL.txt b/randopt/REQUIRED_VERL.txt
@@ -0,0 +1,7 @@
+# RandOpt recipe is prepared against the latest tested verl release line.
+UPSTREAM=https://github.com/verl-project/verl.git
+MODE=pinned_tag
+TAG=v0.7.1
+COMMIT=bec9ef74768dd201881cd4e54cd0385e87caae27
+PIP_INSTALL=pip install verl==0.7.1
+
diff --git a/randopt/__init__.py b/randopt/__init__.py
@@ -0,0 +1 @@
+"""RandOpt recipe for zeroth-order post-training with verl and vLLM."""
diff --git a/randopt/config/randopt_trainer.yaml b/randopt/config/randopt_trainer.yaml
@@ -0,0 +1,66 @@
+# RandOpt recipe configuration.
+
+randopt:
+  sigma: 0.001
+  # Optional list of perturbation scales. When set, each sampled perturbation
+  # draws one sigma uniformly from this list, following the reference RandOpt code.
+  sigma_list: null
+  population_size: 30
+  # Evaluate majority-vote ensembles for top int(ratio * population_size) perturbations.
+  top_k_ratios:
+    - 0.02
+    - 0.1
+  # Optional absolute K values. When set, this takes precedence over top_k_ratios.
+  top_k_values: null
+  num_engines: 4
+  tensor_parallel_size: 1
+  precision: bfloat16
+  max_tokens: 1024
+  temperature: 0.0
+  gpu_memory_utilization: 0.85
+  enable_prefix_caching: false
+  enforce_eager: true
+  worker_extension_cls: recipe.randopt.worker_extension.WorkerExtension
+  global_seed: 42
+  debug_print_samples: false
+  debug_max_samples: 4
+
+model:
+  path: Qwen/Qwen2.5-3B-Instruct
+  trust_remote_code: false
+
+data:
+  # Built-in options: countdown, parquet_prompt, custom.
+  task_type: countdown
+  train_files: data/countdown/train.parquet
+  val_files: data/countdown/test.parquet
+  train_max_samples: 200
+  val_max_samples: -1
+  system_message: "You are a helpful assistant. You first think about the reasoning process in your mind and then provide the user with the answer."
+  user_template: "Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. Return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>."
+  response_prompt: "Let me solve this step by step.\n<think>"
+  reward_fn_path: null
+  reward_fn_name: null
+  prompt_processor_path: null
+  prompt_processor_name: null
+
+trainer:
+  project_name: randopt
+  experiment_name: countdown-randopt
+  logger:
+    - console
+  default_local_dir: /tmp/${oc.env:USER}/verl/randopt_checkpoints
+  default_hdfs_dir: null
+  device: cuda
+  n_gpus_per_node: 4
+  nnodes: 1
+  total_epochs: null
+  test_freq: null
+  save_freq: -1
+  npu_profile:
+    enable: false
+
+ray_kwargs:
+  ray_init:
+    runtime_env: {}
+
diff --git a/randopt/main_randopt.py b/randopt/main_randopt.py
@@ -0,0 +1,80 @@
+import os
+import socket
+import tempfile
+import time
+from pprint import pprint
+
+import hydra
+import ray
+from omegaconf import OmegaConf
+from transformers import AutoTokenizer
+
+from verl.utils.device import auto_set_device
+
+try:
+    from recipe.randopt.randopt_ray_trainer import RandOptRayTrainer
+    from recipe.randopt.task_utils import create_prompt_processor, create_reward_fn, create_vote_fns, load_data
+except ModuleNotFoundError:
+    from randopt.randopt_ray_trainer import RandOptRayTrainer
+    from randopt.task_utils import create_prompt_processor, create_reward_fn, create_vote_fns, load_data
+
+
+@hydra.main(config_path="config", config_name="randopt_trainer", version_base=None)
+def main(config):
+    auto_set_device(config)
+    run_randopt(config)
+
+
+def run_randopt(config) -> None:
+    print(f"RandOpt hostname: {socket.gethostname()}, PID: {os.getpid()}")
+    pprint(OmegaConf.to_container(config, resolve=True))
+    OmegaConf.resolve(config)
+
+    if not ray.is_initialized():
+        ray_init_kwargs = config.ray_kwargs.get("ray_init", {})
+        if not ray_init_kwargs:
+            ray_init_kwargs = {
+                "address": "local",
+                "include_dashboard": False,
+                "ignore_reinit_error": True,
+                "_temp_dir": tempfile.mkdtemp(prefix=f"ray_randopt_{int(time.time())}_"),
+            }
+        ray.init(**OmegaConf.to_container(ray_init_kwargs, resolve=True))
+
+    data_config = OmegaConf.to_container(config.data, resolve=True)
+    train_data = load_data(config.data.train_files)
+    eval_data = load_data(config.data.val_files) if config.data.get("val_files") else []
+    train_max_samples = int(config.data.get("train_max_samples", -1))
+    val_max_samples = int(config.data.get("val_max_samples", -1))
+    if train_max_samples > 0:
+        train_data = train_data[:train_max_samples]
+    if val_max_samples > 0:
+        eval_data = eval_data[:val_max_samples]
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        config.model.path,
+        trust_remote_code=config.model.get("trust_remote_code", False),
+    )
+    prompt_processor = create_prompt_processor(data_config)
+    reward_fn = create_reward_fn(data_config)
+    vote_answer_fn, vote_correct_fn = create_vote_fns(data_config)
+
+    trainer = RandOptRayTrainer(
+        config=config,
+        tokenizer=tokenizer,
+        reward_fn=reward_fn,
+        train_data=train_data,
+        eval_data=eval_data,
+        prompt_processor=prompt_processor,
+        vote_answer_fn=vote_answer_fn,
+        vote_correct_fn=vote_correct_fn,
+    )
+    trainer.init_workers(config.model.path)
+    trainer.fit()
+
+    if ray.is_initialized():
+        ray.shutdown()
+
+
+if __name__ == "__main__":
+    main()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""RandOpt recipe for zeroth-order post-training with verl and vLLM."""