[algo] fix: remove dead code in GPG advantage estimator by EazyReal · Pull Request #6803 · verl-project/verl

EazyReal · 2026-06-21T09:24:41Z

What

Removes dead code in compute_gpg_outcome_advantage (verl/trainer/ppo/core_algos.py), addressing Item B of #6478, plus an adjacent device-mismatch fix.

id2std is never read. The function builds an id2std dict and populates it per group, but normalization divides centered scores by f_norm, not by the group std. Pure dead code.
The alpha parameter is a no-op. alpha (default 1.0) is unconditionally overwritten by alpha = bsz / m.clamp(min=1) before its first use, so any caller-supplied value is silently discarded. No caller passes it (ray_trainer.py builds adv_kwargs without alpha, and the function keeps **kwargs), so removing the parameter is non-breaking.
Singleton-group device mismatch. The singleton branch set id2mean[idx] = torch.tensor(0.0) (always CPU), while the multi-response branch uses torch.mean(...) on the scores' device. On GPU this mismatches in the advantage loop. Use a plain 0.0 (device-agnostic; identical on CPU).

Change

Delete the id2std dict and the two lines that populate it.
Remove the dead alpha parameter (and its docstring entry).
Set the singleton mean to 0.0 instead of a CPU tensor.

The live local alpha = bsz / count_nonzero(scores) and its use in the scaling line are kept — that is the actual GPG scaling. Behavior is unchanged on CPU.

Scope

Only Item B of #6478. Item A (the negative-approx-kl clamp) is handled separately in #6538 and is not touched here.

Test

Adds tests/trainer/ppo/test_gpg_advantage_on_cpu.py (CPU-only, a new file to avoid conflicts with #6538/#4677 which also edit test_core_algos_on_cpu.py):

test_gpg_singleton_group_returns_raw_score — a single non-zero-scored response gets advantage == raw masked score.
test_gpg_applies_n_over_nonzero_scaling — a 2-response group (scores 4, 0) yields alpha * (score - mean) = 2 * (score - 2).

CLAassistant · 2026-06-21T09:24:48Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request removes the unused alpha parameter and the unused id2std dictionary from the compute_gpg_outcome_advantage function, and introduces unit tests to verify the GPG outcome advantage calculations. The review feedback correctly identifies a potential device mismatch issue where id2mean[idx] is initialized with a CPU tensor (torch.tensor(0.0)), which can lead to runtime errors if the input scores are on a GPU. It is recommended to use a device-agnostic Python float 0.0 or explicitly match the device and dtype of the input scores.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

`compute_gpg_outcome_advantage` builds an `id2std` dict (and the per-group std/constant assignments that populate it) that is never read: the normalization on the final line divides by `f_norm`, not by the group std. It also accepts an `alpha` parameter that is unconditionally overwritten by `alpha = bsz / count_nonzero(scores)` before its first use, so the caller-supplied value can never take effect. No caller passes `alpha` (the advantage dispatch in ray_trainer.py builds kwargs without it). Remove the unused `id2std` dict and the no-op `alpha` parameter, and fix an adjacent device mismatch: the singleton-group mean was `torch.tensor(0.0)` (CPU), which mismatches GPU scores in the advantage loop; use a plain `0.0`. Behavior is unchanged on CPU; the function still applies the documented N/N_nonzero scaling to group-centered scores. Add a CPU unit test covering the scaling and the singleton-group case. Addresses issue verl-project#6478 (Item B). Item A (clamp) is handled separately in verl-project#6538. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal · 2026-06-21T09:48:27Z

Good catch — fixed. The singleton-group mean is now a plain 0.0 instead of torch.tensor(0.0), so it no longer mismatches GPU scores in the advantage loop (identical on CPU). Folded it into this cleanup since it's the same function. (The same idiom appears in a few sibling estimators; happy to do those in a separate focused PR.)

EazyReal requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners June 21, 2026 09:24

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread verl/trainer/ppo/core_algos.py

EazyReal force-pushed the fix/gpg-remove-dead-id2std branch from 441f5a3 to a03eae5 Compare June 21, 2026 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[algo] fix: remove dead code in GPG advantage estimator#6803

[algo] fix: remove dead code in GPG advantage estimator#6803
EazyReal wants to merge 1 commit into
verl-project:mainfrom
EazyReal:fix/gpg-remove-dead-id2std

EazyReal commented Jun 21, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

EazyReal commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EazyReal commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Change

Scope

Test

Uh oh!

CLAassistant commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

EazyReal commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EazyReal commented Jun 21, 2026 •

edited

Loading

CLAassistant commented Jun 21, 2026 •

edited

Loading