Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset fails on distributed recipes for datasets with remote code #1178

Open
pbontrager opened this issue Jul 15, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@pbontrager
Copy link
Contributor

pbontrager commented Jul 15, 2024

Description

When you load a dataset from HF with remote code, the load_dataset function prompts the user for permission to run remote code. This prompt only happens the first time the user downloads the dataset, but will cause a crash if the first time the user uses a dataset is with a distributed recipe. The likely causes a crash because load_dataset is called on all distributed processes, but the user is only prompted for permission on rank 0. The user will give permission but all the other ranks will hang waiting for a response which causes torch.distributed to crash.

Reproduce

This can be reproduced with cnn_dailymail_articles_dataset which runs remote code.

First ensure that you don't have the dataset cached (this will remove all cached datasets but it's the only way to guarantee you remove the cached files.

rm -r ~/.cache/huggingface/datasets
rm -r ~/.cache/huggingface/modules/datasets_modules/datasets

Then run any distributed recipe with this dataset. For example

tune run --nproc-per-node=2 lora_finetune_distributed --config llama2/7B_lora dataset._component_=torchtune.datasets.cnn_dailymail_articles_dataset

This should reach dataset initialization and then ask for permission to run remote code. Whichever response you provide will cause the recipe to crash since the processes will be out of sync after.

Possible Solutions

  • We could explore making load_dataset only happen on one rank per node but this breaks assumptions around packed datasets and would only work in single node setttings.
  • We could hard code trust_remote_code=True for datasets we know and support like cnn_dailymail_articles_dataset but we risk making users vulnerable to changes in hf hub.
  • Capture this question ourself in cnn_dailymail_articles_dataset on rank 0, and then broadcast the answer via a torch tensor of size 1 to all the other ranks.
@joecummings joecummings added the bug Something isn't working label Jul 15, 2024
@RdoubleA
Copy link
Contributor

Thanks for pointing this out @pbontrager. I've also found this is an issue with grammar_dataset. HF datasets that run remote code may be more prevalent than we thought, so we should prioritize this in the near future. I can look into it.

@RdoubleA RdoubleA self-assigned this Jul 15, 2024
@pbontrager
Copy link
Contributor Author

It might be worth checking what setting num_processes in load_dataset would do. Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants