Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more details for reproducing training runs #79

Closed
zplizzi opened this issue Mar 21, 2023 · 5 comments
Closed

Add more details for reproducing training runs #79

zplizzi opened this issue Mar 21, 2023 · 5 comments

Comments

@zplizzi
Copy link

zplizzi commented Mar 21, 2023

In the readme you have a note

TODO: forthcoming: more information on how to replicate + relaunch the Pythia training runs, once the data is actually downloaded.

We are trying to reproduce your results and it would be awesome to get some more details here. One particular thing that would be helpful would be to point us to a version/commit of gpt-neox that is similar to the one you used for these training runs, unless you're confident that the newest version of that library will still closely reproduce these results.

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Mar 25, 2023

Hi, sorry for being slow to respond on this, was on vacation for the past week!

v1.0 of GPT-NeoX is what you want! I believe the exact commit I used for the most recent runs is here, but v1.0 should only differ in some features not enabled in the training runs. The major breaking change done in v2.0 would be the switch of Deepspeed versions from DeeperSpeed (forked around 0.3.15 from DeepSpeed) to DeepSpeed's most recent version as of v2.0.

Note that in ~the next few days we'll be sharing a very slightly updated version of all models. The old models will remain available under the name pythia-v0-* on Huggingface, and the new models will have NeoX configs uploaded to the repo, with the following being the changes:


- All model sizes are now trained with uniform batch size of 2M tokens. Prior, the models of size 160M, 410M, and 1.4B parameters were trained with batch sizes of 4M tokens, but in the course of training the initial suite we discovered that it was feasible to train all models with uniform batch size, though based on prior literature we had not been certain of this fact before performing our own experiments on batch size.
- We additionally save checkpoints at initialization (step 0), and checkpoints at steps 1,2,4,8,16,32,64,128,256.
- Flash attention is enabled.
- in the original suite: all models of size 2.8B parameters or smaller had a learning rate (LR) schedule which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and 12B models all used an LR schedule which decayed to a minimum LR of 0. In the redone training runs: all models now were trained with LR decaying to a minimum of $0.1 \times$ their maximum LR.

Hope this helps!

Are there any other particular details you'd want to have for replicating our runs? Happy to share, e.g. more on the hardware setup and library versions used for these runs!

Additionally, would be useful to know what your goal in reproducing the models is--curious because if you want identical checkpoints precisely it may be a little difficult, as I've observed things like (what I think is caused by) CUDA versions / torch versions introducing very slight floating point error when attempting to replicate the first 1k steps of an older training run exactly. In practice this small error shouldn't affect things downstream though, as far as I'm aware though?

@zplizzi
Copy link
Author

zplizzi commented Mar 26, 2023

Thank, you, the new models sound great! Are those still trained with v1.0 or are they now using v2.0? We had been regenerating 0-step checkpoints ourselves so that's really helpful too.

We (Generally Intelligent) are trying to replicate your results in a different LLM library (a modified version of MosaicML) since pythia seems to be the best-documented and decent-performing medium-sized LLM around, so it's a really useful baseline to verify our implementation. Honestly impressed with the quality of all y'alls stuff. We aren't trying to get identical checkpoints, but just want to get matching loss/eval curves over a training run. Currently things look great when we use your tokenized dataset+dataloader with our model/training setup, so seems like the model/training stuff is replicating well, but we're still having trouble getting our version of the dataset+dataloader to match. Most likely the problem is on our side but that's why I was asking the exact details on how you generated the deduped tokenized dataset in the other thread - hoping to deterministically match them so we can compare the tokenized datasets token-for-token.

@haileyschoelkopf
Copy link
Collaborator

We trained with v1.0 still for these!

Got it, I see. Is your fork of Composer / MosaicML/examples public? As I'm parsing it it sounds like you've added the Megatron dataset/dataloader code to a Composer training script and are working on using perhaps Mosaic's streaming library to get the same exact inputs / ordering in that codebase(?)

If this is the case, then getting all the concat+shuffling to play out precisely as Megatron does it will be a huge headache. Megatron does a lot of stuff with shuffling + sampling ratios under the hood--some of that like the BlendableDataset code can be ignored in the 1-datasource case as is the case here with just the deduped Pile, but plenty of the rest would more or less require reimplementing the Megatron data code in your codebase verbatim.

Maybe I've misunderstood the scenario you're describing though!

@zplizzi
Copy link
Author

zplizzi commented Mar 28, 2023

I think #81 was the bulk of my confusion, I had accidentally unsharded the un-deduped tokenized dataset and was thinking it was the deduped one. I also discovered that the way Mosaic's StreamingDataset shuffling works isn't ideal - they pick a random shard, shuffle the sequences in the shard, and then yield all the sequences from that shard before continuing. But that seems to introduce some autocorrelations between batches that causes worse performance than a full global shuffle of tokenized sequences. This is most problematic when only using one GPU/dataloader worker, as you're yielding every batch from the same shard.

@StellaAthena
Copy link
Member

@zplizzi is any follow-up required, or can we close this issue?

@zplizzi zplizzi closed this as completed Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants