Skip to content

[recipe] add routing-aware replay utilities for MoE RL#114

Open
kaining-never-stop wants to merge 2 commits into
verl-project:mainfrom
kaining-never-stop:recipe/routing-aware-replay
Open

[recipe] add routing-aware replay utilities for MoE RL#114
kaining-never-stop wants to merge 2 commits into
verl-project:mainfrom
kaining-never-stop:recipe/routing-aware-replay

Conversation

@kaining-never-stop

@kaining-never-stop kaining-never-stop commented Jun 21, 2026

Copy link
Copy Markdown

This is a small draft PR for the RFC here:

verl-project/verl#6805

The idea is to keep the first version deliberately lightweight: a self-contained routing_aware_replay recipe for comparing replay masks in MoE RL, without touching verl core.

What is included:

  • Fisher-weighted replay mask construction
  • budget-matched uniform/random controls
  • compact replay diagnostics
  • a CPU-only synthetic example
  • unit tests for the mask and diagnostics behavior

This is not meant to be a full training recipe yet. I kept it CPU-testable so it is easier to review first; if the direction looks useful, I can add a small MoE training config or align the schema with an upstream router replay output format in a follow-up.

I ran:

python examples/synthetic_router_replay_demo.py
python -m unittest discover -s tests
pytest tests

and the ruff hooks on the new Python files:

pre-commit run ruff --files <new python files>
pre-commit run ruff-format --files <new python files>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the routing_aware_replay utility, which provides tools for studying routing-aware replay policies in MoE RL post-training, including Fisher-weighted replay masks, budget-matched baselines, and diagnostics. The review feedback highlights a few improvement opportunities: handling identical scores during min-max normalization to avoid discarding high identical scores, validating that tau lies between theta_low and theta_high in the configuration schema, and adding corresponding unit tests to verify the behavior of identical scores.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread routing_aware_replay/routing_aware_replay/fisher_mask.py Outdated
Comment thread routing_aware_replay/routing_aware_replay/schema.py
Comment thread routing_aware_replay/tests/test_fisher_mask.py
@kaining-never-stop

kaining-never-stop commented Jun 21, 2026

Copy link
Copy Markdown
Author

Thanks, these edge cases make sense. I pushed 7eb7c50 to handle them:

  • identical positive scores now keep the replay mask on instead of being normalized away;
  • identical zero scores still produce a zero mask;
  • tau now has to stay between theta_low and theta_high;
  • added tests for those cases.

I re-ran the demo, unittest, pytest, ruff, and the ruff pre-commit hooks on the new Python files.

@kaining-never-stop kaining-never-stop marked this pull request as ready for review June 21, 2026 12:04
@kaining-never-stop

Copy link
Copy Markdown
Author

Just checking in. The bot review comments have been addressed, and the PR is ready for maintainer review. Happy to rename the recipe or trim the scope if that would make the first version easier to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant