Add improved data downloading class / pipeline #39

sdtblck · 2021-01-05T12:12:18Z

run prepare_dataset.py with your model args to download the required dataset (if you run with deepspeed it downloads once for every rank)
to add a new dataset, create a class that inherits from DataDownloader and add it to the DATA_DOWNLOADERS registry

sdtblck · 2021-01-05T12:52:28Z

Have now added the dataset downloading to the main train scripts so you don't have to run a separate python file.
iirc torch.distributed.barrier() has a timeout of 30 mins, so we need to figure out how to increase that in the case that the data takes longer to download

sdtblck · 2021-01-05T13:30:56Z

Also removed adam optimizer if using deepspeed - passing in an optimizer override's deepspeed's ZeRO optimizer, which we want to use

sdtblck and others added 3 commits January 5, 2021 02:31

add class for automatically downloading datasets

7e3a048

add class for automatically downloading datasets

9cd776f

cleanup

07b37d2

sdtblck requested a review from a team as a code owner January 5, 2021 12:12

sdtblck requested review from StellaAthena and leogao2 January 5, 2021 12:12

add dataset preparation to main train scripts

f636b2d

fix train data path

ee7aad0

StellaAthena linked an issue Jan 5, 2021 that may be closed by this pull request

Dataset downloads <number of GPUs> times when running deepspeed train.py #37

Closed

fix enwik8 errors

a38537e

This was linked to issues Jan 5, 2021

Openwebtext2 dataset checks for presence of tar.gz file to assess whether to auto-download rather than extracted dataset #38

Closed

Hardcoded paths in gpt3_small.json #26

Closed

lucidrains merged commit a38537e into main Jan 5, 2021

lucidrains deleted the data_downloading branch January 5, 2021 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add improved data downloading class / pipeline #39

Add improved data downloading class / pipeline #39

sdtblck commented Jan 5, 2021

sdtblck commented Jan 5, 2021

sdtblck commented Jan 5, 2021

Add improved data downloading class / pipeline #39

Add improved data downloading class / pipeline #39

Conversation

sdtblck commented Jan 5, 2021

sdtblck commented Jan 5, 2021

sdtblck commented Jan 5, 2021