Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions dmpo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

See [`REQUIRED_VERL.txt`](REQUIRED_VERL.txt) for the upstream repository, install mode (rolling `main`, pinned release tag, or pinned git commit), and copy-pastable `pip` / `git` instructions where they exist.


This repository hosts the community implementation for the paper [Beyond Mode Collapse: Distribution Matching for Diverse Reasoning](https://arxiv.org/pdf/2605.19461).

DMPO adds a group-wise distribution-matching objective over rollouts that share the same prompt `uid`.

The default implementation is `grpo_dmpo`, which combines the standard GRPO policy loss with the DMPO
distribution-matching loss. The recipe also provides these variants:

- `grpo_dmpo_zero`: skips zero-advantage groups during training, so groups without a useful advantage signal do not
contribute to the DMPO term.
- `grpo_dmpo_js`: computes the gap between the current distribution and the target distribution with
Jensen-Shannon divergence instead of the default MSE objective.
- `pure_dmpo`: updates only with the DMPO objective and does not include the GRPO policy loss.

## Usage

Run from a verl checkout that has this repository mounted as the `recipe` submodule:

```bash
bash recipe/dmpo/run_qwen2.5-7b_math_grpo_dmpo_zero.sh
```


## 🖊️ Citation

If you find this work helpful, please consider to **star🌟** this repo and cite this paper. Thanks for your support!

```bib
@misc{li2026modecollapsedistributionmatching,
title={Beyond Mode Collapse: Distribution Matching for Diverse Reasoning},
author={Xiaozhe Li and Yang Li and Xinyu Fang and Shengyuan Ding and Peiji Li and Yongkang Chen and Yichuan Ma and Tianyi Lyu and Linyang Li and Dahua Lin and Qipeng Guo and Qingwen Liu and Kai Chen},
year={2026},
eprint={2605.19461},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.19461},
}
```

13 changes: 13 additions & 0 deletions dmpo/REQUIRED_VERL.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# dmpo — rolling; exact commits refreshed from this workspace
UPSTREAM=https://github.com/verl-project/verl.git
MODE=rolling
BRANCH=main
# Exact upstream verl commit this file was refreshed against (core library).
VERL_COMMIT=bcb638649a50e58494a8ddd92085ad1174f674b8
PIP_INSTALL=pip install verl@git+https://github.com/verl-project/verl.git@bcb638649a50e58494a8ddd92085ad1174f674b8
GIT_SETUP=git clone https://github.com/verl-project/verl.git && cd verl && git checkout bcb638649a50e58494a8ddd92085ad1174f674b8 && git submodule update --init --recursive recipe
# Recipe submodule snapshot at the same verl checkout (see `git ls-tree HEAD recipe` in verl).
RECIPE_SUBMODULE_COMMIT=ba246418f4de12b845a09bba975f1a5242adc898
RECIPE_FOLDER=dmpo
NOTES=DMPO relies on the model-engine PPO path and patches the actor loss wrapper to pass prompt uid groups to the registered policy loss.
REFRESH=Recompute: (cd verl && git rev-parse HEAD); (cd verl/recipe && git rev-parse HEAD); (cd verl/recipe && git log -1 --format=%H -- dmpo)
15 changes: 15 additions & 0 deletions dmpo/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2025 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""DMPO recipe."""
32 changes: 32 additions & 0 deletions dmpo/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2025 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dataclasses import dataclass, field

from verl.workers.config import FSDPActorConfig, PolicyLossConfig


@dataclass
class DMPOPolicyLossConfig(PolicyLossConfig):
"""Policy loss config with DMPO-specific hyperparameters."""

dmpo_beta: float = 1.0
dmpo_temperature: float = 1.0 / 15.0


@dataclass
class DMPOActorConfig(FSDPActorConfig):
"""Actor config that accepts DMPO policy loss fields."""

policy_loss: DMPOPolicyLossConfig = field(default_factory=DMPOPolicyLossConfig)
25 changes: 25 additions & 0 deletions dmpo/config/dmpo_trainer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# DMPO config overrides for verl PPO trainer.

hydra:
searchpath:
- file://verl/trainer/config

defaults:
- ppo_trainer
- _self_

actor_rollout_ref:
model:
external_lib: recipe.dmpo.dmpo_patch

actor:
_target_: recipe.dmpo.config.DMPOActorConfig

policy_loss:
_target_: recipe.dmpo.config.DMPOPolicyLossConfig
loss_mode: grpo_dmpo_zero
dmpo_beta: 1.0
dmpo_temperature: 0.06666666666666667

algorithm:
adv_estimator: grpo
Loading
Loading