Skip to content

[feat] Decoupled Inference and Reward Computation via Reward Queue#113

Open
Silypie wants to merge 1 commit into
verl-project:mainfrom
Silypie:main
Open

[feat] Decoupled Inference and Reward Computation via Reward Queue#113
Silypie wants to merge 1 commit into
verl-project:mainfrom
Silypie:main

Conversation

@Silypie

@Silypie Silypie commented Jun 17, 2026

Copy link
Copy Markdown

Motivation

In the fully asynchronous training pipeline of VERL (verl.experimental.fully_async_policy), inference (generation) and reward computation are traditionally tightly coupled in a sequential pipeline:

Generation → Wait for Completion → Reward Computation → Training

┌─────────────────────────────────────────────────┐
│  For each Batch:                                │
│       Generation + Reward Computation           │
│            ↓ (Wait for batch completion)        │
│  Training                                       │
└─────────────────────────────────────────────────┘

Therefore, decoupling Inference and Reward allows their time consumption to mask each other, ideally reducing the total time to half of the original.

This coupling creates a critical performance bottleneck: when reward computation is slow (e.g., due to external LLM-based judges, complex scoring functions, or network latency), the GPU sits idle waiting for scores to be ready, wasting expensive compute resources.

The Reward Queue feature decouples inference from reward computation by introducing an intermediate queue between the two stages, enabling concurrent execution:

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Generation    │────▶│ RewardQueue  │────▶│ Reward Compute  │
│   (async)       │     │              │     │ (concurrent)    │
└─────────────────┘     └──────────────┘     └─────────────────┘

This allows generation and scoring to overlap in time, maximizing GPU utilization and throughput.


Proposed Design

Design Overview

This architecture introduces a RewardQueue as a central decoupling mechanism between inference and reward computation, enabling fully asynchronous processing across the reinforcement learning pipeline.

reward_queue_architecture

Core Pipeline

Sampling & Inference — The rollouter continuously feeds batches from the DataLoader. For each batch, sub-items are dispatched for async generation and immediately buffered into the RewardQueue without waiting for reward results.

Reward Computation — A dedicated consumer worker pulls sub-items from the RewardQueue and distributes them across a pool of reward workers for concurrent scoring. Backpressure is applied when needed to prevent resource exhaustion.

Aggregation & Training — Scored sub-items flow back to the aggregator, which assembles complete samples once all sub-items arrive. These are then published to a MessageQueue for the trainer to consume and process.

Key Design Points

- Temporal decoupling: Inference output and reward computation run at their own pace via the queue buffer
- Concurrent scoring: Multiple reward workers score sub-items in parallel, throttled by a concurrency limit
- Backpressure control: The consumer can pause/resume scoring based on system load
- Clean handoff boundary: MessageQueue separates rollouter and trainer execution domains

This design eliminates the traditional blocking pattern where inference waits for reward computation to complete, significantly improving pipeline throughput.

Data Flow

Phase 1: Sample Feeding

  1. FullyAsyncRollouter._feed_samples() iterates over the DataLoader
  2. Creates RolloutSample for each batch and puts into pending_queue

Phase 2: Inference and Queue Production

  1. _processor_worker() processes samples from pending_queue
  2. When enable_reward_queue=True, calls _process_sample_with_reward_queue():
    • For each sub-item in batch, launches async generation via generate_single_for_reward_queue()
    • Creates SubRewardDataItem with inference timing metadata
    • Puts into RewardQueue via reward_queue_client.put_sample()

Phase 3: Reward Computation (Consumer)

  1. _reward_consumer_worker() continuously:
    • Checks if scoring should pause (_should_pause_scoring())
    • Gets SubRewardDataItem from reward_queue_client.get_sample()
    • Submits reward computation via reward_loop_worker.compute_score.remote()
    • Limits concurrent reward tasks via max_concurrent_rewards

Phase 4: Aggregation and Finalization

  1. SampleAggregator.add_scored_item() accumulates scored sub-items
  2. When all sub-items for a sample are collected, _finalize_sample():
    • Builds rm_scores tensor with scores at the last valid position
    • Adds reward timing metadata to the batch
    • Creates RolloutSample and puts into MessageQueue (for Trainer)

Phase 5: Training

  1. FullyAsyncTrainer._get_samples_from_queue() retrieves samples
  2. Calls assemble_batch_from_rollout_samples() with enable_reward_queue=True
  3. Processes reward timing metadata for metrics collection

Key Data Structures

RewardQueue(reuse MessageQueue)

def create_reward_queue(config: DictConfig, max_queue_size: int = 1000):
    return MessageQueue.remote(config, max_queue_size, name="RewardQueue")

SampleAggregator

class SampleAggregator:
    def add_scored_item(self, sample_id, total_count, epoch, scored_item) -> bool:
        # Returns True when all sub-items for a sample are collected
        
    def get_and_remove(self, sample_id) -> _AggregationGroup:
        # Returns and removes the complete aggregation group

Configuration

async_training:
  # Enable reward queue feature (decouples inference from reward computation)
  enable_reward_queue: false

  # Maximum reward queue size (only effective when enable_reward_queue=true)
  # null means use default: max_required_samples * rollout_n
  reward_queue_size: null

Where:

  • max_required_samples = ppo_mini_batch_size * require_batches * (1 + staleness_threshold) * trigger_parameter_sync_step
  • rollout_n = actor_rollout_ref.rollout.n (number of responses per prompt)

Monitoring Metrics

The reward queue exports the following metrics:

Metric Description
monitor/queue/reward_queue_size Current reward queue size
reward_queue/total_produced Total items produced to queue
reward_queue/total_consumed Total items consumed from queue
reward_queue/dropped_samples Samples dropped due to queue overflow
static/max_reward_queue_size Maximum configured queue size
timing_s/reward_compute/mean Mean reward computation time
timing_s/reward_compute/max Max reward computation time
timing_s/reward_compute/tp95 95th percentile reward computation time

Use Cases

  1. External LLM Judges: When reward computation involves calling external LLM APIs (e.g., for LLM-as-a-Judge scoring), network latency can be significant. Reward queue allows generation to continue while waiting for API responses.

  2. Complex Scoring Functions: Multi-step reward computation pipelines with multiple model calls benefit from overlapping generation with scoring.

  3. Variable Reward Latency: When reward computation time varies significantly across samples, the queue buffers fast results while waiting for slow ones.

  4. Throughput Optimization: Maximizing GPU utilization by keeping either generation or scoring always active, even when the other is blocked.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the 'Reward Queue' feature to decouple inference from reward computation in VERL's fully asynchronous training pipeline, adding several components including a custom rollouter, trainer, agent loop worker, and sample aggregator. The code review identified several critical issues, including potential AttributeError crashes due to uninitialized managers or clients, underestimation of staleness samples by ignoring the reward queue, and a severe concurrency bottleneck caused by holding a lock during an asynchronous wait. Additionally, the feedback highlights potential device mismatches on GPU, deprecated asyncio.wait usage, and several potential KeyError or IndexError exceptions when handling empty collections or missing keys.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread reward_queue/rollouter.py
Comment on lines +116 to +117
self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16
self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

At the time set_max_required_samples is called in main.py, self.llm_server_manager has not yet been initialized (it is created asynchronously in _init_async_rollout_manager which is called during fit). Accessing self.llm_server_manager.get_replicas() here will raise an AttributeError and crash the application on startup. We should guard this call and defer the computation of max_concurrent_samples until llm_server_manager is actually initialized.

Suggested change
self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16
self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)
if hasattr(self, "llm_server_manager") and self.llm_server_manager is not None:
self.max_concurrent_samples = len(self.llm_server_manager.get_replicas()) * 16
self.max_concurrent_samples = min(self.max_concurrent_samples, self.max_required_samples)
else:
self.max_concurrent_samples = self.max_required_samples

Comment thread reward_queue/rollouter.py
self._resume_event.set()
self._scoring_resume_event.set()
# every time param change, reset staleness_samples
self.staleness_samples = len(self.active_tasks) + await self.message_queue_client.get_queue_size()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When enable_reward_queue is enabled, the staleness calculation completely ignores the samples currently in the reward queue and those currently being aggregated. This leads to an underestimation of staleness_samples, which can violate the PPO staleness threshold and cause training instability. We should include the in-flight samples from the reward queue and aggregator in the staleness count.

            in_flight_samples = len(self.active_tasks)
            if self.enable_reward_queue and self.reward_queue_client:
                rq_size = await self.reward_queue_client.get_queue_size()
                in_flight_samples += int(rq_size / (self.rollout_n or 1))
                in_flight_samples += self.sample_aggregator.pending_groups_count
            self.staleness_samples = in_flight_samples + await self.message_queue_client.get_queue_size()

Comment thread reward_queue/rollouter.py
Comment on lines +280 to +288
while len(self.active_tasks) >= self.max_concurrent_samples:
async with self.lock:
if self.active_tasks:
done_tasks, self.active_tasks = await asyncio.wait(
self.active_tasks, return_when=asyncio.FIRST_COMPLETED
)
for task in done_tasks:
await task

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Holding self.lock while awaiting asyncio.wait is a highly problematic practice in asyncio. Since asyncio.wait blocks until a long-running generation task completes, holding the lock during this time will block other critical operations (such as reset_staleness called by the trainer) from acquiring the lock, causing a severe performance bottleneck or potential hangs. We should await the tasks outside the lock, and let the task completion callbacks handle cleanup naturally.

            while len(self.active_tasks) >= self.max_concurrent_samples:
                await asyncio.wait(self.active_tasks, return_when=asyncio.FIRST_COMPLETED)

Comment thread reward_queue/rollouter.py
Comment on lines +520 to +523
rm_scores = torch.zeros_like(response_mask, dtype=torch.float32)
rm_scores[torch.arange(response_mask.size(0)), valid_response_length] = torch.tensor(
scores, dtype=torch.float32
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If response_mask is on a GPU device, torch.arange and torch.tensor (which default to CPU) will cause a device mismatch error or slow CPU-GPU synchronization when indexing and assigning values. We should explicitly specify the device of the source tensors to match response_mask.device.

Suggested change
rm_scores = torch.zeros_like(response_mask, dtype=torch.float32)
rm_scores[torch.arange(response_mask.size(0)), valid_response_length] = torch.tensor(
scores, dtype=torch.float32
)
device = response_mask.device
rm_scores = torch.zeros_like(response_mask, dtype=torch.float32, device=device)
rm_scores[torch.arange(response_mask.size(0), device=device), valid_response_length] = torch.tensor(
scores, dtype=torch.float32, device=device
)

Comment thread reward_queue/rollouter.py
Comment on lines +618 to +620
done, pending = await asyncio.wait(
tasks_to_wait, return_when=asyncio.FIRST_COMPLETED
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing a list to asyncio.wait is deprecated since Python 3.8 and can raise a TypeError or deprecation warnings in newer Python versions. We should convert tasks_to_wait to a set before passing it to asyncio.wait.

Suggested change
done, pending = await asyncio.wait(
tasks_to_wait, return_when=asyncio.FIRST_COMPLETED
)
done, pending = await asyncio.wait(
set(tasks_to_wait), return_when=asyncio.FIRST_COMPLETED
)

Comment thread reward_queue/rollouter.py
self._scoring_resume_event.set()

async def _should_pause_reward_queue(self) -> bool:
reward_queue_stats = await self.reward_queue_client.get_statistics()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If self.reward_queue_client is not yet initialized or is None during startup, calling get_statistics() directly will raise an AttributeError. We should add a safety check to return False if the client is not available.

Suggested change
reward_queue_stats = await self.reward_queue_client.get_statistics()
if not self.reward_queue_client:
return False
reward_queue_stats = await self.reward_queue_client.get_statistics()

Comment thread reward_queue/utils.py

def addition_process(output: DataProto, enable_reward_queue: bool = False):
"""collect metirics"""
metrics = output.meta_info.pop("metrics") # List[Dict[str, str]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using pop("metrics") without a default value will raise a KeyError if the "metrics" key is missing from output.meta_info (e.g., during validation or custom evaluation runs). We should use pop("metrics", None) to handle this gracefully.

Suggested change
metrics = output.meta_info.pop("metrics") # List[Dict[str, str]]
metrics = output.meta_info.pop("metrics", None) # List[Dict[str, str]]
if metrics is None:
return output

Comment thread reward_queue/utils.py
Comment on lines +179 to +184
elif arr.ndim == 2:
for i in range(batch_size):
new_arr[i] = arr[i]
else:
for i in range(batch_size):
new_arr[i] = arr[i]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The elif arr.ndim == 2 block and the else block are identical, which is redundant and reduces code readability. We can simplify this by combining them into a single else block.

        else:
            for i in range(batch_size):
                new_arr[i] = arr[i]

Comment on lines +223 to +224
index_val = batch.non_tensor_batch["index"]
index = [index_val[0] if isinstance(index_val, (list, np.ndarray)) else index_val]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If index_val is an empty list or numpy array, accessing index_val[0] will raise an IndexError. We should check the length of the list/array before accessing its first element.

Suggested change
index_val = batch.non_tensor_batch["index"]
index = [index_val[0] if isinstance(index_val, (list, np.ndarray)) else index_val]
index_val = batch.non_tensor_batch["index"]
if isinstance(index_val, (list, np.ndarray)):
index = [index_val[0]] if len(index_val) > 0 else [0]
else:
index = [index_val]

batch.meta_info.get("global_steps", -1), index, batch.meta_info.get("validate", False)
)

kwargs = {k: v[0] for k, v in batch.non_tensor_batch.items()}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If any value v in batch.non_tensor_batch is an empty list or array, accessing v[0] will raise an IndexError. We should add a safety check to handle empty lists/arrays or non-iterable values.

Suggested change
kwargs = {k: v[0] for k, v in batch.non_tensor_batch.items()}
kwargs = {k: (v[0] if isinstance(v, (list, np.ndarray)) and len(v) > 0 else v) for k, v in batch.non_tensor_batch.items()}

@Silypie Silypie changed the title [RFC] Decoupled Inference and Reward Computation via Reward Queue [feat] Decoupled Inference and Reward Computation via Reward Queue Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant