Skip to content

[algo] fix: remove dead code in GPG advantage estimator#6803

Open
EazyReal wants to merge 1 commit into
verl-project:mainfrom
EazyReal:fix/gpg-remove-dead-id2std
Open

[algo] fix: remove dead code in GPG advantage estimator#6803
EazyReal wants to merge 1 commit into
verl-project:mainfrom
EazyReal:fix/gpg-remove-dead-id2std

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 21, 2026

Copy link
Copy Markdown

What

Removes dead code in compute_gpg_outcome_advantage (verl/trainer/ppo/core_algos.py), addressing Item B of #6478, plus an adjacent device-mismatch fix.

  1. id2std is never read. The function builds an id2std dict and populates it per group, but normalization divides centered scores by f_norm, not by the group std. Pure dead code.

  2. The alpha parameter is a no-op. alpha (default 1.0) is unconditionally overwritten by alpha = bsz / m.clamp(min=1) before its first use, so any caller-supplied value is silently discarded. No caller passes it (ray_trainer.py builds adv_kwargs without alpha, and the function keeps **kwargs), so removing the parameter is non-breaking.

  3. Singleton-group device mismatch. The singleton branch set id2mean[idx] = torch.tensor(0.0) (always CPU), while the multi-response branch uses torch.mean(...) on the scores' device. On GPU this mismatches in the advantage loop. Use a plain 0.0 (device-agnostic; identical on CPU).

Change

  • Delete the id2std dict and the two lines that populate it.
  • Remove the dead alpha parameter (and its docstring entry).
  • Set the singleton mean to 0.0 instead of a CPU tensor.

The live local alpha = bsz / count_nonzero(scores) and its use in the scaling line are kept — that is the actual GPG scaling. Behavior is unchanged on CPU.

Scope

Only Item B of #6478. Item A (the negative-approx-kl clamp) is handled separately in #6538 and is not touched here.

Test

Adds tests/trainer/ppo/test_gpg_advantage_on_cpu.py (CPU-only, a new file to avoid conflicts with #6538/#4677 which also edit test_core_algos_on_cpu.py):

  • test_gpg_singleton_group_returns_raw_score — a single non-zero-scored response gets advantage == raw masked score.
  • test_gpg_applies_n_over_nonzero_scaling — a 2-response group (scores 4, 0) yields alpha * (score - mean) = 2 * (score - 2).

@CLAassistant

CLAassistant commented Jun 21, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the unused alpha parameter and the unused id2std dictionary from the compute_gpg_outcome_advantage function, and introduces unit tests to verify the GPG outcome advantage calculations. The review feedback correctly identifies a potential device mismatch issue where id2mean[idx] is initialized with a CPU tensor (torch.tensor(0.0)), which can lead to runtime errors if the input scores are on a GPU. It is recommended to use a device-agnostic Python float 0.0 or explicitly match the device and dtype of the input scores.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread verl/trainer/ppo/core_algos.py
`compute_gpg_outcome_advantage` builds an `id2std` dict (and the per-group
std/constant assignments that populate it) that is never read: the
normalization on the final line divides by `f_norm`, not by the group std.
It also accepts an `alpha` parameter that is unconditionally overwritten by
`alpha = bsz / count_nonzero(scores)` before its first use, so the
caller-supplied value can never take effect. No caller passes `alpha` (the
advantage dispatch in ray_trainer.py builds kwargs without it).

Remove the unused `id2std` dict and the no-op `alpha` parameter, and fix an
adjacent device mismatch: the singleton-group mean was `torch.tensor(0.0)`
(CPU), which mismatches GPU scores in the advantage loop; use a plain `0.0`.
Behavior is unchanged on CPU; the function still applies the documented
N/N_nonzero scaling to group-centered scores. Add a CPU unit test covering the
scaling and the singleton-group case.

Addresses issue verl-project#6478 (Item B). Item A (clamp) is handled separately in verl-project#6538.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the fix/gpg-remove-dead-id2std branch from 441f5a3 to a03eae5 Compare June 21, 2026 09:48
@EazyReal

Copy link
Copy Markdown
Author

Good catch — fixed. The singleton-group mean is now a plain 0.0 instead of torch.tensor(0.0), so it no longer mismatches GPU scores in the advantage loop (identical on CPU). Folded it into this cleanup since it's the same function. (The same idiom appears in a few sibling estimators; happy to do those in a separate focused PR.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants