Skip to content

Implement DPO/ORPO training stage alongside GRPO #4

Description

@Edmon02

Description

The current RLVR stage uses GRPO only. Add DPO and ORPO as alternative alignment strategies.

Tasks

  • Generate preference pairs from trajectory data (chosen = verified solution, rejected = failed attempt)
  • Implement DPO trainer using TRL's DPOTrainer
  • Implement ORPO trainer as an alternative (combines SFT + preference in single stage)
  • Add training stage selector in config: rlvr_method: grpo | dpo | orpo
  • Benchmark comparison: GRPO vs DPO vs ORPO on SWE-bench Verified
  • Update notebook 02_Training.ipynb with stage selection

References

  • DPO: Rafailov et al., 2023
  • ORPO: Hong et al., 2024
  • VeRPO dense rewards should inform preference pair construction

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions