[Bug]: Qwen3-VL shows inconsistent accuracy between enabled and disabled graph modes on VLLM0.20.2

### Your current environment

vllm version: 0.20.2

### 🐛 Describe the bug

vllm version: 0.20.2

In the launch script, +trainer.val_only=True and trainer.val_before_train=True were enabled to evaluate the inference. The evaluation scores on the same dataset (geo3k) were 0.49 and 0.42, respectively.
```python
    trainer.project_name='verl_grpo_example_geo3k' \
    trainer.experiment_name='qwen3_vl_30b_megatron' \
    trainer.n_gpus_per_node=16 \
    trainer.device=npu \
    trainer.nnodes=1 \
    +trainer.val_only=True \
    trainer.val_before_train=True \
    trainer.resume_mode=disable \
    trainer.default_local_dir='/home/l00937981/qwen3vl_vllm0202/ckpt' \
    trainer.save_freq=5 \
    trainer.test_freq=5 \
``` 
The training accuracy curves are also completely misaligned. In particular, there is a significant discrepancy in the rollout_probs_diff_mean metric, which averages 0.02 with graph mode enabled, but drops to 0.007 when graph mode is disabled.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Qwen3-VL shows inconsistent accuracy between enabled and disabled graph modes on VLLM0.20.2 #45904

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Qwen3-VL shows inconsistent accuracy between enabled and disabled graph modes on VLLM0.20.2 #45904

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions