Your current environment
vllm version: 0.20.2
🐛 Describe the bug
vllm version: 0.20.2
In the launch script, +trainer.val_only=True and trainer.val_before_train=True were enabled to evaluate the inference. The evaluation scores on the same dataset (geo3k) were 0.49 and 0.42, respectively.
trainer.project_name='verl_grpo_example_geo3k' \
trainer.experiment_name='qwen3_vl_30b_megatron' \
trainer.n_gpus_per_node=16 \
trainer.device=npu \
trainer.nnodes=1 \
+trainer.val_only=True \
trainer.val_before_train=True \
trainer.resume_mode=disable \
trainer.default_local_dir='/home/l00937981/qwen3vl_vllm0202/ckpt' \
trainer.save_freq=5 \
trainer.test_freq=5 \
The training accuracy curves are also completely misaligned. In particular, there is a significant discrepancy in the rollout_probs_diff_mean metric, which averages 0.02 with graph mode enabled, but drops to 0.007 when graph mode is disabled.
Before submitting a new issue...
Your current environment
vllm version: 0.20.2
🐛 Describe the bug
vllm version: 0.20.2
In the launch script, +trainer.val_only=True and trainer.val_before_train=True were enabled to evaluate the inference. The evaluation scores on the same dataset (geo3k) were 0.49 and 0.42, respectively.
The training accuracy curves are also completely misaligned. In particular, there is a significant discrepancy in the rollout_probs_diff_mean metric, which averages 0.02 with graph mode enabled, but drops to 0.007 when graph mode is disabled.
Before submitting a new issue...