Skip to content

[fsdp, veomni] fix: wire fused top-k distillation outputs#6737

Open
zhangxin81 wants to merge 2 commits into
verl-project:mainfrom
zhangxin81:feat/veomni-topk-distill-integration
Open

[fsdp, veomni] fix: wire fused top-k distillation outputs#6737
zhangxin81 wants to merge 2 commits into
verl-project:mainfrom
zhangxin81:feat/veomni-topk-distill-integration

Conversation

@zhangxin81

@zhangxin81 zhangxin81 commented Jun 15, 2026

Copy link
Copy Markdown

What does this PR do?

This PR fixes the VeOmni fused-kernel path for top-k forward-KL distillation.

When distillation_use_topk=True with use_fused_kernels=True, VeOmni's patched causal LM loss can compute per-token top-k distillation auxiliary outputs through chunk_topk_distill_function. This PR wires the missing pieces so those outputs are correctly produced and consumed by verl:

  • forwards teacher_ids / teacher_logprobs into VeOmni model forward as teacher_topk_ids / teacher_topk_log_probs;
  • supports both jagged NestedTensor teacher tensors and already-rmpad teacher tensors;
  • forwards optional log_prob_min_clamp;
  • fails closed if fused top-k distillation is requested but fused_linear_aux does not contain the expected distillation outputs;
  • adds CPU regression coverage for the fused aux-output path and VeOmni teacher top-k input passthrough.

Duplicate-work check:

  • gh pr list --repo verl-project/verl --state open --search "veomni topk distillation" returns only this PR (#6737).
  • gh pr list --repo verl-project/verl --state open --search "VeOmni fused top-k distillation" returns only this PR (#6737).
  • gh pr list --repo verl-project/verl --state open --search "teacher_topk_log_probs" returns only this PR (#6737).

AI assistance was used to prepare this change. I reviewed the changed lines and ran the checks listed below.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+is%3Aopen+veomni+topk+distillation
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Passed:

/private/tmp/uv-cache/archive-v0/XkW9QpETjgABHVawsIv1F/ruff-0.13.2.data/scripts/ruff check \
  tests/workers/test_distillation_topk_symmetry_on_cpu.py \
  tests/workers/test_router_replay_engine_helpers_on_cpu.py \
  verl/workers/engine/fsdp/transformer_impl.py \
  verl/workers/engine/veomni/transformer_impl.py
/private/tmp/uv-cache/archive-v0/XkW9QpETjgABHVawsIv1F/ruff-0.13.2.data/scripts/ruff format --check \
  tests/workers/test_distillation_topk_symmetry_on_cpu.py \
  tests/workers/test_router_replay_engine_helpers_on_cpu.py \
  verl/workers/engine/fsdp/transformer_impl.py \
  verl/workers/engine/veomni/transformer_impl.py
PYTHONPYCACHEPREFIX=/private/tmp/verl-pycache python3 -m py_compile \
  tests/workers/test_distillation_topk_symmetry_on_cpu.py \
  tests/workers/test_router_replay_engine_helpers_on_cpu.py \
  verl/workers/engine/fsdp/transformer_impl.py \
  verl/workers/engine/veomni/transformer_impl.py
/Users/bytedance/Documents/VeOmni/.venv/bin/python /private/tmp/test_veomni_topk_integration.py

The targeted stub test passed 4 cases:

  • fused distillation aux outputs are emitted as nested outputs;
  • missing fused aux outputs raise a clear assertion;
  • VeOmni forwards nested teacher top-k tensors and clamp config;
  • VeOmni accepts already-rmpad teacher tensors.

Attempted but blocked by local environment/network:

PRE_COMMIT_HOME=/private/tmp/pre-commit-cache ./.venv/bin/pre-commit run --all-files --show-diff-on-failure --color=always

This failed during hook environment installation while downloading ruff==0.12.2 with repeated ConnectionResetError. To partially cover this, I ran cached ruff 0.13.2 on the changed files as shown above.

./.venv/bin/python -m pytest \
  tests/workers/test_distillation_topk_symmetry_on_cpu.py \
  tests/workers/test_router_replay_engine_helpers_on_cpu.py -q

This failed during collection because the ad-hoc local .venv lacks ray. Installing the full dependency set was also interrupted by network ConnectionResetError. The PR adds CPU regression tests that should run in the normal verl CI environment.

API and Usage Example

No public API change.

This change affects the existing internal distillation path when the batch metadata enables:

distillation_use_topk = True
use_fused_kernels = True
use_remove_padding = True

and the micro-batch contains:

teacher_ids
teacher_logprobs
# optional:
log_prob_min_clamp

Design & Code Changes

  • verl/workers/engine/veomni/transformer_impl.py

    • Converts teacher_ids / teacher_logprobs into VeOmni kernel argument names:
      • teacher_topk_ids
      • teacher_topk_log_probs
    • Handles both jagged NestedTensor and pre-rmpad tensor layouts.
    • Preserves existing Ulysses SP slicing behavior.
    • Passes optional log_prob_min_clamp.
  • verl/workers/engine/fsdp/transformer_impl.py

    • In the fused-kernel top-k distillation branch, requires output.fused_linear_aux.distillation_losses to exist.
    • Converts fused auxiliary outputs into nested model outputs:
      • distillation_losses
      • student_mass
      • teacher_mass
  • Tests

    • Adds FSDP fused-output regression coverage to tests/workers/test_distillation_topk_symmetry_on_cpu.py.
    • Adds VeOmni teacher top-k passthrough coverage to tests/workers/test_router_replay_engine_helpers_on_cpu.py.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
    • Attempted, but local hook environment installation repeatedly failed while downloading ruff==0.12.2 with ConnectionResetError. Changed files were checked with cached ruff 0.13.2 and py_compile.
  • Add / Update the documentation.
    • Not applicable: this is an internal engine wiring fix with no public API or user-facing documentation change.
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: added CPU regression tests in existing worker test files; no workflow changes needed because these files are under the existing test tree.
  • Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
  • If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.
    • Not applicable: this PR does not touch recipe.

Forward top-k distillation teacher tensors and clamp settings into VeOmni fused log-prob forwards, and fail closed when fused auxiliary distillation outputs are missing.

Tests: PYTHONPYCACHEPREFIX=/private/tmp/verl-pycache python3 -m py_compile verl/workers/engine/fsdp/transformer_impl.py verl/workers/engine/veomni/transformer_impl.py

Tests: /Users/bytedance/Documents/VeOmni/.venv/bin/python /private/tmp/test_veomni_topk_integration.py

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for non-nested teacher tensors in the VeOmni transformer implementation, forwards the 'log_prob_min_clamp' configuration from the micro-batch to the model inputs, and introduces an assertion in the FSDP transformer implementation to ensure distillation outputs are populated when 'distillation_use_topk' is enabled. Feedback is provided regarding a potential bug where 2D teacher tensors are not unsqueezed to 3D, which would cause sequence parallel slicing to slice along the wrong dimension. A code suggestion is provided to safely handle 2D tensors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +881 to +890
if teacher_ids.is_nested:
teacher_topk_ids = teacher_ids.values().unsqueeze(0)
teacher_topk_log_probs = teacher_logprobs.values().unsqueeze(0)
else:
# Tensors may already be in the rmpad [1, total_nnz, K]
# layout expected by VeOmni (for example when the caller has
# preprocessed the distillation batch). Avoid assuming a
# NestedTensor-only representation.
teacher_topk_ids = teacher_ids
teacher_topk_log_probs = teacher_logprobs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If teacher_ids is a 2D tensor of shape [total_nnz, K] (instead of a 3D tensor of shape [1, total_nnz, K]), assigning it directly to teacher_topk_ids without unsqueezing will cause sequence parallel slicing (slice_input_tensor(..., dim=1)) to slice along the K (top-k) dimension instead of the sequence dimension (total_nnz). To prevent this critical correctness bug, ensure that 2D tensors are unsqueezed to 3D.

Suggested change
if teacher_ids.is_nested:
teacher_topk_ids = teacher_ids.values().unsqueeze(0)
teacher_topk_log_probs = teacher_logprobs.values().unsqueeze(0)
else:
# Tensors may already be in the rmpad [1, total_nnz, K]
# layout expected by VeOmni (for example when the caller has
# preprocessed the distillation batch). Avoid assuming a
# NestedTensor-only representation.
teacher_topk_ids = teacher_ids
teacher_topk_log_probs = teacher_logprobs
if teacher_ids.is_nested:
teacher_topk_ids = teacher_ids.values().unsqueeze(0)
teacher_topk_log_probs = teacher_logprobs.values().unsqueeze(0)
else:
# Tensors may already be in the rmpad [1, total_nnz, K]
# layout expected by VeOmni (for example when the caller has
# preprocessed the distillation batch). Avoid assuming a
# NestedTensor-only representation.
teacher_topk_ids = teacher_ids.unsqueeze(0) if teacher_ids.dim() == 2 else teacher_ids
teacher_topk_log_probs = teacher_logprobs.unsqueeze(0) if teacher_logprobs.dim() == 2 else teacher_logprobs

@Luosuu Luosuu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether this is necessary as if the user wants to use this feature then the model engine backend should be veomni?

Cover fused distillation auxiliary outputs and VeOmni teacher top-k passthrough for both nested and pre-rmpad tensor layouts.

Tests: /private/tmp/uv-cache/archive-v0/XkW9QpETjgABHVawsIv1F/ruff-0.13.2.data/scripts/ruff check tests/workers/test_distillation_topk_symmetry_on_cpu.py tests/workers/test_router_replay_engine_helpers_on_cpu.py verl/workers/engine/fsdp/transformer_impl.py verl/workers/engine/veomni/transformer_impl.py

Tests: /private/tmp/uv-cache/archive-v0/XkW9QpETjgABHVawsIv1F/ruff-0.13.2.data/scripts/ruff format --check tests/workers/test_distillation_topk_symmetry_on_cpu.py tests/workers/test_router_replay_engine_helpers_on_cpu.py verl/workers/engine/fsdp/transformer_impl.py verl/workers/engine/veomni/transformer_impl.py

Tests: PYTHONPYCACHEPREFIX=/private/tmp/verl-pycache python3 -m py_compile tests/workers/test_distillation_topk_symmetry_on_cpu.py tests/workers/test_router_replay_engine_helpers_on_cpu.py verl/workers/engine/fsdp/transformer_impl.py verl/workers/engine/veomni/transformer_impl.py

Tests: /Users/bytedance/Documents/VeOmni/.venv/bin/python /private/tmp/test_veomni_topk_integration.py

Note: git commit hook was bypassed because pre-commit hook environment installation repeatedly failed while downloading ruff==0.12.2 with ConnectionResetError.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
@zhangxin81 zhangxin81 changed the title Integrate VeOmni fused top-k distillation [fsdp, veomni] fix: wire fused top-k distillation outputs Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants