Skip to content

加载数据集报错 #76

Description

@Zheng-Liming

Reminder

  • I have read the README and searched the existing issues.

System Info

你好,我在尝试使用自定义数据进行训练时,在数据集加载过程中,preprocess_sp_dataset函数出现以下报错:

len(seq_ids) = 131072, world_size = 2
len(seq_ids) = 131072, world_size = 2
len(seq_ids) = 1, world_size = 2
['/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/01986958_fortwaltonmetalrecycling.com/image_inputs_anyres/start0_0_1280_1418_crop.png']
len(seq_ids) = 2, world_size = 2
['/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/08713441_www.raysahelian.com_water.html/image_inputs_anyres/start0_0_1280_1768_crop.png', '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/07875683_www.locktoncharity.org.uk_post_edinburg/image_inputs_anyres/start0_0_1280_1051_crop.png']

[rank1]: multiprocess.pool.RemoteTraceback: 
[rank1]: """
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
[rank1]:     result = (True, func(*args, **kwds))
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank1]:     for i, result in enumerate(func(**kwargs)):
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single
[rank1]:     batch = apply_function_on_filtered_inputs(
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs
[rank1]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/code/agent/llama_factory_sequence_parallel/360-LLaMA-Factory/src/llamafactory/data/processors/sequence_parallel.py", line 46, in sp_split
[rank1]:     preprocess_sp_dataset(row, model_args.sequence_parallel_size, model_args.sequence_parallel_mode)
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/code/agent/llama_factory_sequence_parallel/360-LLaMA-Factory/src/llamafactory/data/data_utils.py", line 103, in preprocess_sp_dataset
[rank1]:     value_chunks = [seq_ids[s : s + step] for s in range(0, len(seq_ids), step)]
[rank1]: ValueError: range() arg 3 must not be zero
[rank1]: """

具体报错位置如下,为方便排查问题,seq_ids等信息已打印在上方log中:

def preprocess_sp_dataset(seq_ids, world_size, sequence_parallel_mode):
    if sequence_parallel_mode == "zigzag-ring":
        step = len(seq_ids) // (2 * world_size)
        value_chunks = [seq_ids[s : s + step] for s in range(0, len(seq_ids), step)]
        ...

我使用的数据集为openai格式数据集,格式如下:

[
  {
    "messages": [{"role": "user", "content": "<image>xxx"}, {"role": "assistant", "content": "xxx"}],
    "images": ["/aaa/bbb.png"]
  },
  ...
]

请问该问题产生的原因是什么,应该如何解决?

Reproduction

NNODES=1
GPUS_PER_NODE=4
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=12345
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" src/train.py \
    --deepspeed $DS_CONFIG_PATH \
    --flash_attn fa2 \
    --sequence_parallel_size 2 \
    --sequence_parallel_mode zigzag-ring \
    --stage sft \
    --do_train \
    --model_name_or_path ${PRETRAINED} \
    --dataset $DATA \
    --dataset_dir ${DATASET_DIR} \
    --template qwen2_vl_orig \
    --finetuning_type full \
    --output_dir ${output_root}/${exp_name} \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --gradient_accumulation_steps 1 \
    --ddp_timeout 500000 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --cutoff_len 131072 \
    --save_steps 4000 \
    --save_total_limit 10 \
    --plot_loss \
    --overwrite_cache \
    --num_train_epochs ${num_train_epochs} \
    --bf16 \
    --preprocessing_num_workers 8 \
    --preprocessing_batch_size 8 \
    --packing True \
    --tf32 True \
    --cache_dir /home/user/360-LLaMA_Factory/exps/cache 

Expected behavior

No response

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions