[feat] Decoupled Inference and Reward Computation via Reward Queue by Silypie · Pull Request #113 · verl-project/verl-recipe

Silypie · 2026-06-17T09:47:57Z

Motivation

In the fully asynchronous training pipeline of VERL (verl.experimental.fully_async_policy), inference (generation) and reward computation are traditionally tightly coupled in a sequential pipeline:

Generation → Wait for Completion → Reward Computation → Training

┌─────────────────────────────────────────────────┐
│  For each Batch:                                │
│       Generation + Reward Computation           │
│            ↓ (Wait for batch completion)        │
│  Training                                       │
└─────────────────────────────────────────────────┘

Therefore, decoupling Inference and Reward allows their time consumption to mask each other, ideally reducing the total time to half of the original.

This coupling creates a critical performance bottleneck: when reward computation is slow (e.g., due to external LLM-based judges, complex scoring functions, or network latency), the GPU sits idle waiting for scores to be ready, wasting expensive compute resources.

The Reward Queue feature decouples inference from reward computation by introducing an intermediate queue between the two stages, enabling concurrent execution:

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Generation    │────▶│ RewardQueue  │────▶│ Reward Compute  │
│   (async)       │     │              │     │ (concurrent)    │
└─────────────────┘     └──────────────┘     └─────────────────┘

This allows generation and scoring to overlap in time, maximizing GPU utilization and throughput.

Proposed Design

Design Overview

This architecture introduces a RewardQueue as a central decoupling mechanism between inference and reward computation, enabling fully asynchronous processing across the reinforcement learning pipeline.

Core Pipeline

Sampling & Inference — The rollouter continuously feeds batches from the DataLoader. For each batch, sub-items are dispatched for async generation and immediately buffered into the RewardQueue without waiting for reward results.

Reward Computation — A dedicated consumer worker pulls sub-items from the RewardQueue and distributes them across a pool of reward workers for concurrent scoring. Backpressure is applied when needed to prevent resource exhaustion.

Aggregation & Training — Scored sub-items flow back to the aggregator, which assembles complete samples once all sub-items arrive. These are then published to a MessageQueue for the trainer to consume and process.

Key Design Points

- Temporal decoupling: Inference output and reward computation run at their own pace via the queue buffer
- Concurrent scoring: Multiple reward workers score sub-items in parallel, throttled by a concurrency limit
- Backpressure control: The consumer can pause/resume scoring based on system load
- Clean handoff boundary: MessageQueue separates rollouter and trainer execution domains

This design eliminates the traditional blocking pattern where inference waits for reward computation to complete, significantly improving pipeline throughput.

Data Flow

Phase 1: Sample Feeding

FullyAsyncRollouter._feed_samples() iterates over the DataLoader
Creates RolloutSample for each batch and puts into pending_queue

Phase 2: Inference and Queue Production

_processor_worker() processes samples from pending_queue
When enable_reward_queue=True, calls _process_sample_with_reward_queue():
- For each sub-item in batch, launches async generation via generate_single_for_reward_queue()
- Creates SubRewardDataItem with inference timing metadata
- Puts into RewardQueue via reward_queue_client.put_sample()

Phase 3: Reward Computation (Consumer)

_reward_consumer_worker() continuously:
- Checks if scoring should pause (_should_pause_scoring())
- Gets SubRewardDataItem from reward_queue_client.get_sample()
- Submits reward computation via reward_loop_worker.compute_score.remote()
- Limits concurrent reward tasks via max_concurrent_rewards

Phase 4: Aggregation and Finalization

SampleAggregator.add_scored_item() accumulates scored sub-items
When all sub-items for a sample are collected, _finalize_sample():
- Builds rm_scores tensor with scores at the last valid position
- Adds reward timing metadata to the batch
- Creates RolloutSample and puts into MessageQueue (for Trainer)

Phase 5: Training

FullyAsyncTrainer._get_samples_from_queue() retrieves samples
Calls assemble_batch_from_rollout_samples() with enable_reward_queue=True
Processes reward timing metadata for metrics collection

Key Data Structures

RewardQueue(reuse MessageQueue)

def create_reward_queue(config: DictConfig, max_queue_size: int = 1000):
    return MessageQueue.remote(config, max_queue_size, name="RewardQueue")

`SampleAggregator`

class SampleAggregator:
    def add_scored_item(self, sample_id, total_count, epoch, scored_item) -> bool:
        # Returns True when all sub-items for a sample are collected
        
    def get_and_remove(self, sample_id) -> _AggregationGroup:
        # Returns and removes the complete aggregation group

Configuration

async_training:
  # Enable reward queue feature (decouples inference from reward computation)
  enable_reward_queue: false

  # Maximum reward queue size (only effective when enable_reward_queue=true)
  # null means use default: max_required_samples * rollout_n
  reward_queue_size: null

Where:

max_required_samples = ppo_mini_batch_size * require_batches * (1 + staleness_threshold) * trigger_parameter_sync_step
rollout_n = actor_rollout_ref.rollout.n (number of responses per prompt)

Monitoring Metrics

The reward queue exports the following metrics:

Metric	Description
`monitor/queue/reward_queue_size`	Current reward queue size
`reward_queue/total_produced`	Total items produced to queue
`reward_queue/total_consumed`	Total items consumed from queue
`reward_queue/dropped_samples`	Samples dropped due to queue overflow
`static/max_reward_queue_size`	Maximum configured queue size
`timing_s/reward_compute/mean`	Mean reward computation time
`timing_s/reward_compute/max`	Max reward computation time
`timing_s/reward_compute/tp95`	95th percentile reward computation time

Use Cases

External LLM Judges: When reward computation involves calling external LLM APIs (e.g., for LLM-as-a-Judge scoring), network latency can be significant. Reward queue allows generation to continue while waiting for API responses.
Complex Scoring Functions: Multi-step reward computation pipelines with multiple model calls benefit from overlapping generation with scoring.
Variable Reward Latency: When reward computation time varies significantly across samples, the queue buffers fast results while waiting for slow ones.
Throughput Optimization: Maximizing GPU utilization by keeping either generation or scoring always active, even when the other is blocked.

gemini-code-assist

Code Review

This pull request introduces the 'Reward Queue' feature to decouple inference from reward computation in VERL's fully asynchronous training pipeline, adding several components including a custom rollouter, trainer, agent loop worker, and sample aggregator. The code review identified several critical issues, including potential AttributeError crashes due to uninitialized managers or clients, underestimation of staleness samples by ignoring the reward queue, and a severe concurrency bottleneck caused by holding a lock during an asynchronous wait. Additionally, the feedback highlights potential device mismatches on GPU, deprecated asyncio.wait usage, and several potential KeyError or IndexError exceptions when handling empty collections or missing keys.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T09:50:31Z

+            self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16
+            self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)


At the time set_max_required_samples is called in main.py, self.llm_server_manager has not yet been initialized (it is created asynchronously in _init_async_rollout_manager which is called during fit). Accessing self.llm_server_manager.get_replicas() here will raise an AttributeError and crash the application on startup. We should guard this call and defer the computation of max_concurrent_samples until llm_server_manager is actually initialized.

Suggested change

self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16

self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)

if hasattr(self, "llm_server_manager") and self.llm_server_manager is not None:

self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16

self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)

else:

self.max_concurrent_samples = self.max_required_samples

gemini-code-assist · 2026-06-17T09:50:31Z

+            self._resume_event.set()
+            self._scoring_resume_event.set()
+            # every time param change, reset staleness_samples
+            self.staleness_samples = len(self.active_tasks) + await self.message_queue_client.get_queue_size()


When enable_reward_queue is enabled, the staleness calculation completely ignores the samples currently in the reward queue and those currently being aggregated. This leads to an underestimation of staleness_samples, which can violate the PPO staleness threshold and cause training instability. We should include the in-flight samples from the reward queue and aggregator in the staleness count.

in_flight_samples = len(self.active_tasks) if self.enable_reward_queue and self.reward_queue_client: rq_size = await self.reward_queue_client.get_queue_size() in_flight_samples += int(rq_size / (self.rollout_n or 1)) in_flight_samples += self.sample_aggregator.pending_groups_count self.staleness_samples = in_flight_samples + await self.message_queue_client.get_queue_size()

gemini-code-assist · 2026-06-17T09:50:31Z

+            while len(self.active_tasks) >= self.max_concurrent_samples:
+                async with self.lock:
+                    if self.active_tasks:
+                        done_tasks, self.active_tasks = await asyncio.wait(
+                            self.active_tasks, return_when=asyncio.FIRST_COMPLETED
+                        )
+                        for task in done_tasks:
+                            await task
+


Holding self.lock while awaiting asyncio.wait is a highly problematic practice in asyncio. Since asyncio.wait blocks until a long-running generation task completes, holding the lock during this time will block other critical operations (such as reset_staleness called by the trainer) from acquiring the lock, causing a severe performance bottleneck or potential hangs. We should await the tasks outside the lock, and let the task completion callbacks handle cleanup naturally.

while len(self.active_tasks) >= self.max_concurrent_samples: await asyncio.wait(self.active_tasks, return_when=asyncio.FIRST_COMPLETED)

gemini-code-assist · 2026-06-17T09:50:32Z

+        rm_scores = torch.zeros_like(response_mask, dtype=torch.float32)
+        rm_scores[torch.arange(response_mask.size(0)), valid_response_length] = torch.tensor(
+            scores, dtype=torch.float32
+        )


If response_mask is on a GPU device, torch.arange and torch.tensor (which default to CPU) will cause a device mismatch error or slow CPU-GPU synchronization when indexing and assigning values. We should explicitly specify the device of the source tensors to match response_mask.device.

Suggested change

rm_scores = torch.zeros_like(response_mask, dtype=torch.float32)

rm_scores[torch.arange(response_mask.size(0)), valid_response_length] = torch.tensor(

scores, dtype=torch.float32

)

device = response_mask.device

rm_scores = torch.zeros_like(response_mask, dtype=torch.float32, device=device)

rm_scores[torch.arange(response_mask.size(0), device=device), valid_response_length] = torch.tensor(

scores, dtype=torch.float32, device=device

)

gemini-code-assist · 2026-06-17T09:50:32Z

+            done, pending = await asyncio.wait(
+                tasks_to_wait, return_when=asyncio.FIRST_COMPLETED
+            )


Passing a list to asyncio.wait is deprecated since Python 3.8 and can raise a TypeError or deprecation warnings in newer Python versions. We should convert tasks_to_wait to a set before passing it to asyncio.wait.

Suggested change

done, pending = await asyncio.wait(

tasks_to_wait, return_when=asyncio.FIRST_COMPLETED

)

done, pending = await asyncio.wait(

set(tasks_to_wait), return_when=asyncio.FIRST_COMPLETED

)

gemini-code-assist · 2026-06-17T09:50:32Z

+                self._scoring_resume_event.set()
+
+    async def _should_pause_reward_queue(self) -> bool:
+        reward_queue_stats = await self.reward_queue_client.get_statistics()


If self.reward_queue_client is not yet initialized or is None during startup, calling get_statistics() directly will raise an AttributeError. We should add a safety check to return False if the client is not available.

Suggested change

reward_queue_stats = await self.reward_queue_client.get_statistics()

if not self.reward_queue_client:

return False

reward_queue_stats = await self.reward_queue_client.get_statistics()

gemini-code-assist · 2026-06-17T09:50:32Z

+
+def addition_process(output: DataProto, enable_reward_queue: bool = False):
+    """collect metirics"""
+    metrics = output.meta_info.pop("metrics")  # List[Dict[str, str]]


Using pop("metrics") without a default value will raise a KeyError if the "metrics" key is missing from output.meta_info (e.g., during validation or custom evaluation runs). We should use pop("metrics", None) to handle this gracefully.

Suggested change

metrics = output.meta_info.pop("metrics") # List[Dict[str, str]]

metrics = output.meta_info.pop("metrics", None) # List[Dict[str, str]]

if metrics is None:

return output

gemini-code-assist · 2026-06-17T09:50:32Z

+        elif arr.ndim == 2:
+            for i in range(batch_size):
+                new_arr[i] = arr[i]
+        else:
+            for i in range(batch_size):
+                new_arr[i] = arr[i]


The elif arr.ndim == 2 block and the else block are identical, which is redundant and reduces code readability. We can simplify this by combining them into a single else block.

else: for i in range(batch_size): new_arr[i] = arr[i]

gemini-code-assist · 2026-06-17T09:50:32Z

+            index_val = batch.non_tensor_batch["index"]
+            index = [index_val[0] if isinstance(index_val, (list, np.ndarray)) else index_val]


If index_val is an empty list or numpy array, accessing index_val[0] will raise an IndexError. We should check the length of the list/array before accessing its first element.

Suggested change

index_val = batch.non_tensor_batch["index"]

index = [index_val[0] if isinstance(index_val, (list, np.ndarray)) else index_val]

index_val = batch.non_tensor_batch["index"]

if isinstance(index_val, (list, np.ndarray)):

index = [index_val[0]] if len(index_val) > 0 else [0]

else:

index = [index_val]

gemini-code-assist · 2026-06-17T09:50:32Z

+            batch.meta_info.get("global_steps", -1), index, batch.meta_info.get("validate", False)
+        )
+
+        kwargs = {k: v[0] for k, v in batch.non_tensor_batch.items()}


If any value v in batch.non_tensor_batch is an empty list or array, accessing v[0] will raise an IndexError. We should add a safety check to handle empty lists/arrays or non-iterable values.

Suggested change

kwargs = {k: v[0] for k, v in batch.non_tensor_batch.items()}

kwargs = {k: (v[0] if isinstance(v, (list, np.ndarray)) and len(v) > 0 else v) for k, v in batch.non_tensor_batch.items()}

feat: reward queue

31e46d3

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Silypie changed the title ~~[RFC] Decoupled Inference and Reward Computation via Reward Queue~~ [feat] Decoupled Inference and Reward Computation via Reward Queue Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Decoupled Inference and Reward Computation via Reward Queue#113

[feat] Decoupled Inference and Reward Computation via Reward Queue#113
Silypie wants to merge 1 commit into
verl-project:mainfrom
Silypie:main

Silypie commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16
		self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)

-        rm_scores = torch.zeros_like(response_mask, dtype=torch.float32)
-        rm_scores[torch.arange(response_mask.size(0)), valid_response_length] = torch.tensor(
-            scores, dtype=torch.float32
-        )
+        device = response_mask.device
+        rm_scores = torch.zeros_like(response_mask, dtype=torch.float32, device=device)
+        rm_scores[torch.arange(response_mask.size(0), device=device), valid_response_length] = torch.tensor(
+            scores, dtype=torch.float32, device=device
+        )

-    metrics = output.meta_info.pop("metrics")  # List[Dict[str, str]]
+    metrics = output.meta_info.pop("metrics", None)  # List[Dict[str, str]]
+    if metrics is None:
+        return output

		index_val = batch.non_tensor_batch["index"]
		index = [index_val[0] if isinstance(index_val, (list, np.ndarray)) else index_val]

	kwargs = {k: v[0] for k, v in batch.non_tensor_batch.items()}
	kwargs = {k: (v[0] if isinstance(v, (list, np.ndarray)) and len(v) > 0 else v) for k, v in batch.non_tensor_batch.items()}

Conversation

Silypie commented Jun 17, 2026

Motivation

Proposed Design

Design Overview

Core Pipeline

Key Design Points

Data Flow

Phase 1: Sample Feeding

Phase 2: Inference and Queue Production

Phase 3: Reward Computation (Consumer)

Phase 4: Aggregation and Finalization

Phase 5: Training

Key Data Structures

RewardQueue(reuse MessageQueue)

SampleAggregator

Configuration

Monitoring Metrics

Use Cases

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`SampleAggregator`