Implement DPO/ORPO training stage alongside GRPO

## Description

The current RLVR stage uses GRPO only. Add DPO and ORPO as alternative alignment strategies.

## Tasks

- [ ] Generate preference pairs from trajectory data (chosen = verified solution, rejected = failed attempt)
- [ ] Implement DPO trainer using TRL's `DPOTrainer`
- [ ] Implement ORPO trainer as an alternative (combines SFT + preference in single stage)
- [ ] Add training stage selector in config: `rlvr_method: grpo | dpo | orpo`
- [ ] Benchmark comparison: GRPO vs DPO vs ORPO on SWE-bench Verified
- [ ] Update notebook 02_Training.ipynb with stage selection

## References

- DPO: Rafailov et al., 2023
- ORPO: Hong et al., 2024
- VeRPO dense rewards should inform preference pair construction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DPO/ORPO training stage alongside GRPO #4

Description

Tasks

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement DPO/ORPO training stage alongside GRPO #4

Description

Description

Tasks

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions