feat: add randopt algorithm#95
Conversation
Add the RandOpt training recipe, configuration, and Countdown example to support zeroth-order post-training workflows with verl and vLLM. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request introduces the RandOpt recipe, a zeroth-order post-training algorithm that samples Gaussian perturbations around pretrained models and evaluates them in parallel using Ray and vLLM. The implementation includes a dedicated Ray trainer, configuration files, and task utilities specifically for the 'countdown' task. Critical feedback was provided regarding Ray resource isolation, where the current implementation manually manipulates environment variables instead of requesting GPU resources through Ray. Additionally, a significant bug was identified in the weight perturbation logic where re-initializing random generators inside parameter loops leads to correlated noise across layers. Finally, a security concern was raised regarding the use of eval() for processing model-generated arithmetic expressions.
| def __init__(self, *args, **kwargs): | ||
| os.environ.pop("CUDA_VISIBLE_DEVICES", None) | ||
| os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" | ||
| super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
Popping CUDA_VISIBLE_DEVICES from the environment bypasses Ray's resource isolation. When combined with num_gpus=0 in the actor definition (line 99), this causes all vLLM instances on the same node to see all available GPUs, likely leading to resource contention and multiple instances attempting to use GPU 0. Instead, the actor should request the appropriate number of GPUs, allowing Ray to set the environment variables correctly.
| def __init__(self, *args, **kwargs): | |
| os.environ.pop("CUDA_VISIBLE_DEVICES", None) | |
| os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" | |
| super().__init__(*args, **kwargs) | |
| def __init__(self, *args, **kwargs): | |
| os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" | |
| super().__init__(*args, **kwargs) |
| ] | ||
|
|
||
| self.engines = [ | ||
| ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote( |
There was a problem hiding this comment.
The Ray actor should request the GPUs it intends to use. By setting num_gpus=0, Ray does not set CUDA_VISIBLE_DEVICES for the actor, which necessitates the manual environment manipulation at line 28. Requesting tensor_parallel_size GPUs ensures proper isolation and environment setup by Ray.
| ray.remote(num_cpus=0, num_gpus=0, scheduling_strategy=strategy)(RandOptLLM).remote( | |
| ray.remote(num_cpus=1, num_gpus=tensor_parallel_size, scheduling_strategy=strategy)(RandOptLLM).remote( |
| for name, param in self.model_runner.model.named_parameters(): | ||
| if not self._should_perturb(name): | ||
| continue | ||
| generator = torch.Generator(device=param.device) | ||
| generator.manual_seed(int(seed)) | ||
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) |
There was a problem hiding this comment.
Re-initializing the generator with the same seed for every parameter causes all parameters with the same shape to receive identical noise perturbations. This significantly reduces the diversity of the sampled models and does not represent a true Gaussian perturbation of the model weights. The generator should be initialized once outside the loop to ensure independent noise across layers.
| for name, param in self.model_runner.model.named_parameters(): | |
| if not self._should_perturb(name): | |
| continue | |
| generator = torch.Generator(device=param.device) | |
| generator.manual_seed(int(seed)) | |
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) | |
| generator = torch.Generator(device=self.device) | |
| generator.manual_seed(int(seed)) | |
| for name, param in self.model_runner.model.named_parameters(): | |
| if not self._should_perturb(name): | |
| continue | |
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) |
| for name, param in self.model_runner.model.named_parameters(): | ||
| if not self._should_perturb(name): | ||
| continue | ||
| generator = torch.Generator(device=param.device) | ||
| generator.manual_seed(int(seed)) | ||
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) |
There was a problem hiding this comment.
Similar to perturb_self_weights, the generator should be initialized once outside the loop to avoid correlated noise across parameters.
| for name, param in self.model_runner.model.named_parameters(): | |
| if not self._should_perturb(name): | |
| continue | |
| generator = torch.Generator(device=param.device) | |
| generator.manual_seed(int(seed)) | |
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) | |
| generator = torch.Generator(device=self.device) | |
| generator.manual_seed(int(seed)) | |
| for name, param in self.model_runner.model.named_parameters(): | |
| if not self._should_perturb(name): | |
| continue | |
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) |
| def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size): | ||
| """Apply a normalized ES update on the current worker.""" | ||
| for name, param in self.model_runner.model.named_parameters(): | ||
| if not self._should_perturb(name): | ||
| continue | ||
|
|
||
| update_accumulator = torch.zeros_like(param.data, dtype=torch.float32) | ||
| for seed, coeff in zip(seeds, coeffs, strict=False): | ||
| generator = torch.Generator(device=param.device) | ||
| generator.manual_seed(int(seed)) | ||
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) |
There was a problem hiding this comment.
In update_weights_from_seeds, generators are re-initialized for every parameter and every seed, leading to identical noise across all parameters for a given seed. Generators should be pre-initialized once for each seed before the parameter loop.
| def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size): | |
| """Apply a normalized ES update on the current worker.""" | |
| for name, param in self.model_runner.model.named_parameters(): | |
| if not self._should_perturb(name): | |
| continue | |
| update_accumulator = torch.zeros_like(param.data, dtype=torch.float32) | |
| for seed, coeff in zip(seeds, coeffs, strict=False): | |
| generator = torch.Generator(device=param.device) | |
| generator.manual_seed(int(seed)) | |
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=generator) | |
| def update_weights_from_seeds(self, seeds, coeffs, alpha, population_size): | |
| """Apply a normalized ES update on the current worker.""" | |
| generators = [] | |
| for s in seeds: | |
| g = torch.Generator(device=self.device) | |
| g.manual_seed(int(s)) | |
| generators.append(g) | |
| for name, param in self.model_runner.model.named_parameters(): | |
| if not self._should_perturb(name): | |
| continue | |
| update_accumulator = torch.zeros_like(param.data, dtype=torch.float32) | |
| for g, coeff in zip(generators, coeffs, strict=False): | |
| noise = torch.randn(param.shape, dtype=param.dtype, device=param.device, generator=g) |
| "expected_numbers": numbers, | ||
| } | ||
| try: | ||
| result = eval(expression, {"__builtins__": None}, {}) |
There was a problem hiding this comment.
What does this PR do?
This PR adds the implementation of the paper "Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights" (ICML 2026 Spotlight, arXiv:2603.12228).
RandOpt in this implementation:
Test
Test result
Usage from verl repo root
Quick local example
Design & Code Changes
High-level design
This PR adds a full RandOpt pipeline for perturbation-based policy optimization with parallel rollout/evaluation and top-k majority-vote ensemble reporting.
Main Files and Responsibilities
randopt/randopt_ray_trainer.pyrandopt/main_randopt.pyrandopt/task_utils.pyrandopt/worker_extension.pyrandopt/config/randopt_trainer.yamlrandopt/run_countdown_example.pyrandopt/README.mdrandopt/REQUIRED_VERL.txtverlversion metadata.