feat: add randopt algorithm by sunrainyg · Pull Request #95 · verl-project/verl-recipe

sunrainyg · 2026-05-06T23:01:23Z

What does this PR do?

This PR adds the implementation of the paper "Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights" (ICML 2026 Spotlight, arXiv:2603.12228).

RandOpt in this implementation:

Samples Gaussian perturbations around a pretrained model.
Evaluates perturbed models in parallel with vLLM.
Selects top-performing perturbations by reward.
Reports majority-vote ensemble accuracy for selected top-k sets.

Test

python3 -m randopt.run_countdown_example

Test result

train/reward_mean: 0.1045
train/reward_std: 0.0664
train/reward_min: 0.0130
train/reward_max: 0.3820
ensemble/top_10_accuracy: 42.0%
ensemble/top_50_accuracy: 64.0%

Usage from verl repo root

python3 -m recipe.randopt.main_randopt \
  model.path=Qwen/Qwen2.5-3B-Instruct \
  data.task_type=countdown \
  data.train_files=data/countdown/train.parquet \
  data.val_files=data/countdown/test.parquet \
  randopt.worker_extension_cls=recipe.randopt.worker_extension.WorkerExtension

Quick local example

python3 -m recipe.randopt.run_countdown_example
Standalone verl-recipe repo usage
python3 -m randopt.run_countdown_example
python3 -m randopt.main_randopt ...

Design & Code Changes

High-level design

This PR adds a full RandOpt pipeline for perturbation-based policy optimization with parallel rollout/evaluation and top-k majority-vote ensemble reporting.

Main Files and Responsibilities

`randopt/randopt_ray_trainer.py`

Core RandOpt training/evaluation loop.
Parallel perturbation evaluation with multiple vLLM engines.
Top-k selection and ensemble metric computation.

`randopt/main_randopt.py`

Main entrypoint for launching RandOpt training with config.

`randopt/task_utils.py`

Task-specific utilities (Countdown/parquet/custom prompt+reward plumbing).

`randopt/worker_extension.py`

Worker extension hooks used by the trainer/runtime.

`randopt/config/randopt_trainer.yaml`

Default RandOpt configuration and trainer/runtime knobs.

`randopt/run_countdown_example.py`

One-command toy data generation + smoke test path.

`randopt/README.md`

Setup, run instructions, common overrides, custom task usage, and citation.

`randopt/REQUIRED_VERL.txt`

Pinned tested verl version metadata.

Add the RandOpt training recipe, configuration, and Countdown example to support zeroth-order post-training workflows with verl and vLLM. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request introduces the RandOpt recipe, a zeroth-order post-training algorithm that samples Gaussian perturbations around pretrained models and evaluates them in parallel using Ray and vLLM. The implementation includes a dedicated Ray trainer, configuration files, and task utilities specifically for the 'countdown' task. Critical feedback was provided regarding Ray resource isolation, where the current implementation manually manipulates environment variables instead of requesting GPU resources through Ray. Additionally, a significant bug was identified in the weight perturbation logic where re-initializing random generators inside parameter loops leads to correlated noise across layers. Finally, a security concern was raised regarding the use of eval() for processing model-generated arithmetic expressions.

gemini-code-assist · 2026-05-06T23:04:04Z

+    def __init__(self, *args, **kwargs):
+        os.environ.pop("CUDA_VISIBLE_DEVICES", None)
+        os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
+        super().__init__(*args, **kwargs)


Popping CUDA_VISIBLE_DEVICES from the environment bypasses Ray's resource isolation. When combined with num_gpus=0 in the actor definition (line 99), this causes all vLLM instances on the same node to see all available GPUs, likely leading to resource contention and multiple instances attempting to use GPU 0. Instead, the actor should request the appropriate number of GPUs, allowing Ray to set the environment variables correctly.

Suggested change

def __init__(self, *args, **kwargs):

os.environ.pop("CUDA_VISIBLE_DEVICES", None)

os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"

super().__init__(*args, **kwargs)

def __init__(self, *args, **kwargs):

os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"

super().__init__(*args, **kwargs)

gemini-code-assist · 2026-05-06T23:04:05Z

+        ]
+
+        self.engines = [
+            ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote(


The Ray actor should request the GPUs it intends to use. By setting num_gpus=0, Ray does not set CUDA_VISIBLE_DEVICES for the actor, which necessitates the manual environment manipulation at line 28. Requesting tensor_parallel_size GPUs ensures proper isolation and environment setup by Ray.

Suggested change

ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote(

ray.remote(num_cpus=1, num_gpus=tensor_parallel_size, scheduling_strategy=strategy)(RandOptLLM).remote(

gemini-code-assist · 2026-05-06T23:04:05Z

+        for name, param in self.model_runner.model.named_parameters():
+            if not self._should_perturb(name):
+                continue
+            generator = torch.Generator(device=param.device)
+            generator.manual_seed(int(seed))
+            noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)


Re-initializing the generator with the same seed for every parameter causes all parameters with the same shape to receive identical noise perturbations. This significantly reduces the diversity of the sampled models and does not represent a true Gaussian perturbation of the model weights. The generator should be initialized once outside the loop to ensure independent noise across layers.

Suggested change

for name, param in self.model_runner.model.named_parameters():

if not self._should_perturb(name):

continue

generator = torch.Generator(device=param.device)

generator.manual_seed(int(seed))

noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

generator = torch.Generator(device=self.device)

generator.manual_seed(int(seed))

for name, param in self.model_runner.model.named_parameters():

if not self._should_perturb(name):

continue

noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

gemini-code-assist · 2026-05-06T23:04:05Z

+        for name, param in self.model_runner.model.named_parameters():
+            if not self._should_perturb(name):
+                continue
+            generator = torch.Generator(device=param.device)
+            generator.manual_seed(int(seed))
+            noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)


Similar to perturb_self_weights, the generator should be initialized once outside the loop to avoid correlated noise across parameters.

Suggested change

for name, param in self.model_runner.model.named_parameters():

if not self._should_perturb(name):

continue

generator = torch.Generator(device=param.device)

generator.manual_seed(int(seed))

noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

generator = torch.Generator(device=self.device)

generator.manual_seed(int(seed))

for name, param in self.model_runner.model.named_parameters():

if not self._should_perturb(name):

continue

noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

gemini-code-assist · 2026-05-06T23:04:05Z

+    def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size):
+        """Apply a normalized ES update on the current worker."""
+        for name, param in self.model_runner.model.named_parameters():
+            if not self._should_perturb(name):
+                continue
+
+            update_accumulator = torch.zeros_like(param.data, dtype=torch.float32)
+            for seed, coeff in zip(seeds, coeffs, strict=False):
+                generator = torch.Generator(device=param.device)
+                generator.manual_seed(int(seed))
+                noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)


In update_weights_from_seeds, generators are re-initialized for every parameter and every seed, leading to identical noise across all parameters for a given seed. Generators should be pre-initialized once for each seed before the parameter loop.

Suggested change

def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size):

"""Apply a normalized ES update on the current worker."""

for name, param in self.model_runner.model.named_parameters():

if not self._should_perturb(name):

continue

update_accumulator = torch.zeros_like(param.data, dtype=torch.float32)

for seed, coeff in zip(seeds, coeffs, strict=False):

generator = torch.Generator(device=param.device)

generator.manual_seed(int(seed))

noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size):

"""Apply a normalized ES update on the current worker."""

generators = []

for s in seeds:

g = torch.Generator(device=self.device)

g.manual_seed(int(s))

generators.append(g)

for name, param in self.model_runner.model.named_parameters():

if not self._should_perturb(name):

continue

update_accumulator = torch.zeros_like(param.data, dtype=torch.float32)

for g, coeff in zip(generators, coeffs, strict=False):

noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=g)

gemini-code-assist · 2026-05-06T23:04:05Z

+            "expected_numbers": numbers,
+        }
+    try:
+        result = eval(expression, {"__builtins__": None}, {})


Using eval() on model-generated strings is a security risk. While the regex check on line 162 provides some protection, it is better to use a dedicated math expression parser or a more restrictive evaluation method to prevent potential code injection if the regex is bypassed.

feat: add randopt recipe implementation

e58460b

Add the RandOpt training recipe, configuration, and Countdown example to support zeroth-order post-training workflows with verl and vLLM. Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add randopt algorithm#95

feat: add randopt algorithm#95
sunrainyg wants to merge 1 commit into
verl-project:mainfrom
sunrainyg:add-randopt-recipe

sunrainyg commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote(
	ray.remote(num_cpus=1, num_gpus=tensor_parallel_size, scheduling_strategy=strategy)(RandOptLLM).remote(

Conversation

sunrainyg commented May 6, 2026

What does this PR do?

Test

Test result

Usage from verl repo root

Quick local example

Design & Code Changes

High-level design

Main Files and Responsibilities

randopt/randopt_ray_trainer.py

randopt/main_randopt.py

randopt/task_utils.py

randopt/worker_extension.py

randopt/config/randopt_trainer.yaml

randopt/run_countdown_example.py

randopt/README.md

randopt/REQUIRED_VERL.txt

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`randopt/randopt_ray_trainer.py`

`randopt/main_randopt.py`

`randopt/task_utils.py`

`randopt/worker_extension.py`

`randopt/config/randopt_trainer.yaml`

`randopt/run_countdown_example.py`

`randopt/README.md`

`randopt/REQUIRED_VERL.txt`