Single node Pythia 14M training on ngc pytorch 24.02 container #1170
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Swapped the raw cuda container for the ngc pytorch 24.02 container.
This one has OpenMPI, PyTorch and Apex already installed, so those setup steps can be removed.
flash-attn
was recently hotfixed for that container version, so 2.5.6 is a hard requirement.To validate the setups, I ran a few training steps of Pythia 14M on 8xA100 and a 22B model on 8xA100 and 8xH100 w/ TP8.
The apt and pip install lists and requirements files could be cleaned up. E.g. trition is already present and gets downgraded. I will leave it like this for now in order to keep this PR small.
For multi node tests I will have to adapt the launcher first.