[algo] fix: remove dead code in GPG advantage estimator#6803
Conversation
There was a problem hiding this comment.
Code Review
This pull request removes the unused alpha parameter and the unused id2std dictionary from the compute_gpg_outcome_advantage function, and introduces unit tests to verify the GPG outcome advantage calculations. The review feedback correctly identifies a potential device mismatch issue where id2mean[idx] is initialized with a CPU tensor (torch.tensor(0.0)), which can lead to runtime errors if the input scores are on a GPU. It is recommended to use a device-agnostic Python float 0.0 or explicitly match the device and dtype of the input scores.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
`compute_gpg_outcome_advantage` builds an `id2std` dict (and the per-group std/constant assignments that populate it) that is never read: the normalization on the final line divides by `f_norm`, not by the group std. It also accepts an `alpha` parameter that is unconditionally overwritten by `alpha = bsz / count_nonzero(scores)` before its first use, so the caller-supplied value can never take effect. No caller passes `alpha` (the advantage dispatch in ray_trainer.py builds kwargs without it). Remove the unused `id2std` dict and the no-op `alpha` parameter, and fix an adjacent device mismatch: the singleton-group mean was `torch.tensor(0.0)` (CPU), which mismatches GPU scores in the advantage loop; use a plain `0.0`. Behavior is unchanged on CPU; the function still applies the documented N/N_nonzero scaling to group-centered scores. Add a CPU unit test covering the scaling and the singleton-group case. Addresses issue verl-project#6478 (Item B). Item A (clamp) is handled separately in verl-project#6538. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
441f5a3 to
a03eae5
Compare
|
Good catch — fixed. The singleton-group mean is now a plain |
What
Removes dead code in
compute_gpg_outcome_advantage(verl/trainer/ppo/core_algos.py), addressing Item B of #6478, plus an adjacent device-mismatch fix.id2stdis never read. The function builds anid2stddict and populates it per group, but normalization divides centered scores byf_norm, not by the group std. Pure dead code.The
alphaparameter is a no-op.alpha(default1.0) is unconditionally overwritten byalpha = bsz / m.clamp(min=1)before its first use, so any caller-supplied value is silently discarded. No caller passes it (ray_trainer.pybuildsadv_kwargswithoutalpha, and the function keeps**kwargs), so removing the parameter is non-breaking.Singleton-group device mismatch. The singleton branch set
id2mean[idx] = torch.tensor(0.0)(always CPU), while the multi-response branch usestorch.mean(...)on the scores' device. On GPU this mismatches in the advantage loop. Use a plain0.0(device-agnostic; identical on CPU).Change
id2stddict and the two lines that populate it.alphaparameter (and its docstring entry).0.0instead of a CPU tensor.The live local
alpha = bsz / count_nonzero(scores)and its use in the scaling line are kept — that is the actual GPG scaling. Behavior is unchanged on CPU.Scope
Only Item B of #6478. Item A (the negative-approx-kl clamp) is handled separately in #6538 and is not touched here.
Test
Adds
tests/trainer/ppo/test_gpg_advantage_on_cpu.py(CPU-only, a new file to avoid conflicts with #6538/#4677 which also edittest_core_algos_on_cpu.py):test_gpg_singleton_group_returns_raw_score— a single non-zero-scored response gets advantage == raw masked score.test_gpg_applies_n_over_nonzero_scaling— a 2-response group (scores 4, 0) yieldsalpha * (score - mean) = 2 * (score - 2).