verl-project · OliverLeeXZ · May 27, 2026 · May 27, 2026
diff --git a/dmpo/README.md b/dmpo/README.md
@@ -0,0 +1,43 @@
+# Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
+
+See [`REQUIRED_VERL.txt`](REQUIRED_VERL.txt) for the upstream repository, install mode (rolling `main`, pinned release tag, or pinned git commit), and copy-pastable `pip` / `git` instructions where they exist.
+
+
+This repository hosts the community implementation for the paper [Beyond Mode Collapse: Distribution Matching for Diverse Reasoning](https://arxiv.org/pdf/2605.19461). 
+
+DMPO adds a group-wise distribution-matching objective over rollouts that share the same prompt `uid`.
+
+The default implementation is `grpo_dmpo`, which combines the standard GRPO policy loss with the DMPO
+distribution-matching loss. The recipe also provides these variants:
+
+- `grpo_dmpo_zero`: skips zero-advantage groups during training, so groups without a useful advantage signal do not
+  contribute to the DMPO term.
+- `grpo_dmpo_js`: computes the gap between the current distribution and the target distribution with
+  Jensen-Shannon divergence instead of the default MSE objective.
+- `pure_dmpo`: updates only with the DMPO objective and does not include the GRPO policy loss.
+
+## Usage
+
+Run from a verl checkout that has this repository mounted as the `recipe` submodule:
+
+```bash
+bash recipe/dmpo/run_qwen2.5-7b_math_grpo_dmpo_zero.sh
+```
+
+
+## 🖊️ Citation
+
+If you find this work helpful, please consider to **star🌟** this repo and cite this paper. Thanks for your support!
+
+```bib
+@misc{li2026modecollapsedistributionmatching,
+      title={Beyond Mode Collapse: Distribution Matching for Diverse Reasoning}, 
+      author={Xiaozhe Li and Yang Li and Xinyu Fang and Shengyuan Ding and Peiji Li and Yongkang Chen and Yichuan Ma and Tianyi Lyu and Linyang Li and Dahua Lin and Qipeng Guo and Qingwen Liu and Kai Chen},
+      year={2026},
+      eprint={2605.19461},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2605.19461}, 
+}
+```
+
diff --git a/dmpo/REQUIRED_VERL.txt b/dmpo/REQUIRED_VERL.txt
@@ -0,0 +1,13 @@
+# dmpo — rolling; exact commits refreshed from this workspace
+UPSTREAM=https://github.com/verl-project/verl.git
+MODE=rolling
+BRANCH=main
+# Exact upstream verl commit this file was refreshed against (core library).
+VERL_COMMIT=bcb638649a50e58494a8ddd92085ad1174f674b8
+PIP_INSTALL=pip install verl@git+https://github.com/verl-project/verl.git@bcb638649a50e58494a8ddd92085ad1174f674b8
+GIT_SETUP=git clone https://github.com/verl-project/verl.git && cd verl && git checkout bcb638649a50e58494a8ddd92085ad1174f674b8 && git submodule update --init --recursive recipe
+# Recipe submodule snapshot at the same verl checkout (see `git ls-tree HEAD recipe` in verl).
+RECIPE_SUBMODULE_COMMIT=ba246418f4de12b845a09bba975f1a5242adc898
+RECIPE_FOLDER=dmpo
+NOTES=DMPO relies on the model-engine PPO path and patches the actor loss wrapper to pass prompt uid groups to the registered policy loss.
+REFRESH=Recompute: (cd verl && git rev-parse HEAD); (cd verl/recipe && git rev-parse HEAD); (cd verl/recipe && git log -1 --format=%H -- dmpo)
diff --git a/dmpo/__init__.py b/dmpo/__init__.py
@@ -0,0 +1,15 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""DMPO recipe."""
diff --git a/dmpo/config.py b/dmpo/config.py
@@ -0,0 +1,32 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+from verl.workers.config import FSDPActorConfig, PolicyLossConfig
+
+
+@dataclass
+class DMPOPolicyLossConfig(PolicyLossConfig):
+    """Policy loss config with DMPO-specific hyperparameters."""
+
+    dmpo_beta: float = 1.0
+    dmpo_temperature: float = 1.0 / 15.0
+
+
+@dataclass
+class DMPOActorConfig(FSDPActorConfig):
+    """Actor config that accepts DMPO policy loss fields."""
+
+    policy_loss: DMPOPolicyLossConfig = field(default_factory=DMPOPolicyLossConfig)
diff --git a/dmpo/config/dmpo_trainer.yaml b/dmpo/config/dmpo_trainer.yaml
@@ -0,0 +1,25 @@
+# DMPO config overrides for verl PPO trainer.
+
+hydra:
+  searchpath:
+    - file://verl/trainer/config
+
+defaults:
+  - ppo_trainer
+  - _self_
+
+actor_rollout_ref:
+  model:
+    external_lib: recipe.dmpo.dmpo_patch
+
+  actor:
+    _target_: recipe.dmpo.config.DMPOActorConfig
+
+    policy_loss:
+      _target_: recipe.dmpo.config.DMPOPolicyLossConfig
+      loss_mode: grpo_dmpo_zero
+      dmpo_beta: 1.0
+      dmpo_temperature: 0.06666666666666667
+
+algorithm:
+  adv_estimator: grpo