load_dataset fails on distributed recipes for datasets with remote code #1178

pbontrager · 2024-07-15T18:56:13Z

Description

When you load a dataset from HF with remote code, the load_dataset function prompts the user for permission to run remote code. This prompt only happens the first time the user downloads the dataset, but will cause a crash if the first time the user uses a dataset is with a distributed recipe. The likely causes a crash because load_dataset is called on all distributed processes, but the user is only prompted for permission on rank 0. The user will give permission but all the other ranks will hang waiting for a response which causes torch.distributed to crash.

Reproduce

This can be reproduced with cnn_dailymail_articles_dataset which runs remote code.

First ensure that you don't have the dataset cached (this will remove all cached datasets but it's the only way to guarantee you remove the cached files.

rm -r ~/.cache/huggingface/datasets
rm -r ~/.cache/huggingface/modules/datasets_modules/datasets

Then run any distributed recipe with this dataset. For example

tune run --nproc-per-node=2 lora_finetune_distributed --config llama2/7B_lora dataset._component_=torchtune.datasets.cnn_dailymail_articles_dataset

This should reach dataset initialization and then ask for permission to run remote code. Whichever response you provide will cause the recipe to crash since the processes will be out of sync after.

Possible Solutions

We could explore making load_dataset only happen on one rank per node but this breaks assumptions around packed datasets and would only work in single node setttings.
We could hard code trust_remote_code=True for datasets we know and support like cnn_dailymail_articles_dataset but we risk making users vulnerable to changes in hf hub.
Capture this question ourself in cnn_dailymail_articles_dataset on rank 0, and then broadcast the answer via a torch tensor of size 1 to all the other ranks.

The text was updated successfully, but these errors were encountered:

RdoubleA · 2024-07-15T23:33:01Z

Thanks for pointing this out @pbontrager. I've also found this is an issue with grammar_dataset. HF datasets that run remote code may be more prevalent than we thought, so we should prioritize this in the near future. I can look into it.

pbontrager · 2024-07-17T16:08:32Z

It might be worth checking what setting num_processes in load_dataset would do. Documentation

joecummings added the bug Something isn't working label Jul 15, 2024

RdoubleA self-assigned this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_dataset fails on distributed recipes for datasets with remote code #1178

load_dataset fails on distributed recipes for datasets with remote code #1178

pbontrager commented Jul 15, 2024 •

edited

Loading

RdoubleA commented Jul 15, 2024

pbontrager commented Jul 17, 2024

load_dataset fails on distributed recipes for datasets with remote code #1178

load_dataset fails on distributed recipes for datasets with remote code #1178

Comments

pbontrager commented Jul 15, 2024 • edited Loading

Description

Reproduce

Possible Solutions

RdoubleA commented Jul 15, 2024

pbontrager commented Jul 17, 2024

pbontrager commented Jul 15, 2024 •

edited

Loading