Skip to content

feat: add randopt algorithm#95

Open
sunrainyg wants to merge 1 commit into
verl-project:mainfrom
sunrainyg:add-randopt-recipe
Open

feat: add randopt algorithm#95
sunrainyg wants to merge 1 commit into
verl-project:mainfrom
sunrainyg:add-randopt-recipe

Conversation

@sunrainyg

Copy link
Copy Markdown

What does this PR do?

This PR adds the implementation of the paper "Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights" (ICML 2026 Spotlight, arXiv:2603.12228).

RandOpt in this implementation:

  1. Samples Gaussian perturbations around a pretrained model.
  2. Evaluates perturbed models in parallel with vLLM.
  3. Selects top-performing perturbations by reward.
  4. Reports majority-vote ensemble accuracy for selected top-k sets.

Test

python3 -m randopt.run_countdown_example

Test result

train/reward_mean: 0.1045
train/reward_std: 0.0664
train/reward_min: 0.0130
train/reward_max: 0.3820
ensemble/top_10_accuracy: 42.0%
ensemble/top_50_accuracy: 64.0%

Usage from verl repo root

python3 -m recipe.randopt.main_randopt \
  model.path=Qwen/Qwen2.5-3B-Instruct \
  data.task_type=countdown \
  data.train_files=data/countdown/train.parquet \
  data.val_files=data/countdown/test.parquet \
  randopt.worker_extension_cls=recipe.randopt.worker_extension.WorkerExtension

Quick local example

python3 -m recipe.randopt.run_countdown_example
Standalone verl-recipe repo usage
python3 -m randopt.run_countdown_example
python3 -m randopt.main_randopt ...

Design & Code Changes

High-level design

This PR adds a full RandOpt pipeline for perturbation-based policy optimization with parallel rollout/evaluation and top-k majority-vote ensemble reporting.

Main Files and Responsibilities

randopt/randopt_ray_trainer.py

  • Core RandOpt training/evaluation loop.
  • Parallel perturbation evaluation with multiple vLLM engines.
  • Top-k selection and ensemble metric computation.

randopt/main_randopt.py

  • Main entrypoint for launching RandOpt training with config.

randopt/task_utils.py

  • Task-specific utilities (Countdown/parquet/custom prompt+reward plumbing).

randopt/worker_extension.py

  • Worker extension hooks used by the trainer/runtime.

randopt/config/randopt_trainer.yaml

  • Default RandOpt configuration and trainer/runtime knobs.

randopt/run_countdown_example.py

  • One-command toy data generation + smoke test path.

randopt/README.md

  • Setup, run instructions, common overrides, custom task usage, and citation.

randopt/REQUIRED_VERL.txt

  • Pinned tested verl version metadata.

Add the RandOpt training recipe, configuration, and Countdown example to support zeroth-order post-training workflows with verl and vLLM.

Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the RandOpt recipe, a zeroth-order post-training algorithm that samples Gaussian perturbations around pretrained models and evaluates them in parallel using Ray and vLLM. The implementation includes a dedicated Ray trainer, configuration files, and task utilities specifically for the 'countdown' task. Critical feedback was provided regarding Ray resource isolation, where the current implementation manually manipulates environment variables instead of requesting GPU resources through Ray. Additionally, a significant bug was identified in the weight perturbation logic where re-initializing random generators inside parameter loops leads to correlated noise across layers. Finally, a security concern was raised regarding the use of eval() for processing model-generated arithmetic expressions.

Comment on lines +27 to +30
def __init__(self, *args, **kwargs):
os.environ.pop("CUDA_VISIBLE_DEVICES", None)
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
super().__init__(*args, **kwargs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Popping CUDA_VISIBLE_DEVICES from the environment bypasses Ray's resource isolation. When combined with num_gpus=0 in the actor definition (line 99), this causes all vLLM instances on the same node to see all available GPUs, likely leading to resource contention and multiple instances attempting to use GPU 0. Instead, the actor should request the appropriate number of GPUs, allowing Ray to set the environment variables correctly.

Suggested change
def __init__(self, *args, **kwargs):
os.environ.pop("CUDA_VISIBLE_DEVICES", None)
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
super().__init__(*args, **kwargs)
def __init__(self, *args, **kwargs):
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
super().__init__(*args, **kwargs)

]

self.engines = [
ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Ray actor should request the GPUs it intends to use. By setting num_gpus=0, Ray does not set CUDA_VISIBLE_DEVICES for the actor, which necessitates the manual environment manipulation at line 28. Requesting tensor_parallel_size GPUs ensures proper isolation and environment setup by Ray.

Suggested change
ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote(
ray.remote(num_cpus=1, num_gpus=tensor_parallel_size, scheduling_strategy=strategy)(RandOptLLM).remote(

Comment on lines +50 to +55
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
generator = torch.Generator(device=param.device)
generator.manual_seed(int(seed))
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Re-initializing the generator with the same seed for every parameter causes all parameters with the same shape to receive identical noise perturbations. This significantly reduces the diversity of the sampled models and does not represent a true Gaussian perturbation of the model weights. The generator should be initialized once outside the loop to ensure independent noise across layers.

Suggested change
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
generator = torch.Generator(device=param.device)
generator.manual_seed(int(seed))
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)
generator = torch.Generator(device=self.device)
generator.manual_seed(int(seed))
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

Comment on lines +69 to +74
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
generator = torch.Generator(device=param.device)
generator.manual_seed(int(seed))
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to perturb_self_weights, the generator should be initialized once outside the loop to avoid correlated noise across parameters.

Suggested change
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
generator = torch.Generator(device=param.device)
generator.manual_seed(int(seed))
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)
generator = torch.Generator(device=self.device)
generator.manual_seed(int(seed))
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

Comment on lines +83 to +93
def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size):
"""Apply a normalized ES update on the current worker."""
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue

update_accumulator = torch.zeros_like(param.data, dtype=torch.float32)
for seed, coeff in zip(seeds, coeffs, strict=False):
generator = torch.Generator(device=param.device)
generator.manual_seed(int(seed))
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In update_weights_from_seeds, generators are re-initialized for every parameter and every seed, leading to identical noise across all parameters for a given seed. Generators should be pre-initialized once for each seed before the parameter loop.

Suggested change
def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size):
"""Apply a normalized ES update on the current worker."""
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
update_accumulator = torch.zeros_like(param.data, dtype=torch.float32)
for seed, coeff in zip(seeds, coeffs, strict=False):
generator = torch.Generator(device=param.device)
generator.manual_seed(int(seed))
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator)
def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size):
"""Apply a normalized ES update on the current worker."""
generators = []
for s in seeds:
g = torch.Generator(device=self.device)
g.manual_seed(int(s))
generators.append(g)
for name, param in self.model_runner.model.named_parameters():
if not self._should_perturb(name):
continue
update_accumulator = torch.zeros_like(param.data, dtype=torch.float32)
for g, coeff in zip(generators, coeffs, strict=False):
noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=g)

Comment thread randopt/task_utils.py
"expected_numbers": numbers,
}
try:
result = eval(expression, {"__builtins__": None}, {})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using eval() on model-generated strings is a security risk. While the regex check on line 162 provides some protection, it is better to use a dedicated math expression parser or a more restrictive evaluation method to prevent potential code injection if the regex is bypassed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant