Skip to content

使用流式加载报错,传递了额外参数:TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc' #57

Description

@CxsGhost

Reminder

  • I have read the README and searched the existing issues.

System Info

使用流式加载时,TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'
问题出在:https://github.com/Qihoo360/360-LLaMA-Factory/blob/3bc07289eefcf8c8ea05f553e4ef0b82008419e4/src/llamafactory/data/loader.py#L224。
经检查Datasets库中IterableDataset map函数无法接收kwargs中的三个参数:

    kwargs = dict(
        num_proc=data_args.preprocessing_num_workers,
        load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
        desc="Running sequence parallel split on dataset",
    )

Reproduction

开启流式加载即可 --streaming True

Expected behavior

一般的Dataset map函数可以接收这些参数:
Image
流式加载IterableDataset map:
Image

修复方式:只需在 _get_sequence_parallel_dataset 中添加额外的判断逻辑即可,目前我本地运行良好

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions