Skip to content

Loss scale mismatch when enabling sequence parallelism #72

@DSTTSD

Description

@DSTTSD

Reminder

  • I have read the README and searched the existing issues.

System Info

When sequence parallelism is enabled, the reported training loss becomes significantly larger than in the non‑SP setup (e.g., ~5 vs ~1.x). This is likely due to the loss not being normalized over the correct global number of tokens across SP ranks. I am using transformers==4.51.3

Reproduction

### model
model_name_or_path: 

### method
stage: sft
do_train: true
finetuning_type: full
# lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: ccaicorpus_hard
template: qwen
cutoff_len: 4000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 20


### output
output_dir: 
logging_steps: 100
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
# resume_from_checkpoint: true
# enable_thinking: false
flash_attn: fa2
# neat_packing: true
sequence_parallel_size: 4

report_to: none

Expected behavior

Without SP:

Image

With SP=4:
Image

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions