## Description The current RLVR stage uses GRPO only. Add DPO and ORPO as alternative alignment strategies. ## Tasks - [ ] Generate preference pairs from trajectory data (chosen = verified solution, rejected = failed attempt) - [ ] Implement DPO trainer using TRL's `DPOTrainer` - [ ] Implement ORPO trainer as an alternative (combines SFT + preference in single stage) - [ ] Add training stage selector in config: `rlvr_method: grpo | dpo | orpo` - [ ] Benchmark comparison: GRPO vs DPO vs ORPO on SWE-bench Verified - [ ] Update notebook 02_Training.ipynb with stage selection ## References - DPO: Rafailov et al., 2023 - ORPO: Hong et al., 2024 - VeRPO dense rewards should inform preference pair construction
Description
The current RLVR stage uses GRPO only. Add DPO and ORPO as alternative alignment strategies.
Tasks
DPOTrainerrlvr_method: grpo | dpo | orpoReferences