Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single node Pythia 14M training on ngc pytorch 24.02 container #1170

Merged
merged 2 commits into from
Mar 4, 2024

Conversation

tf-nv
Copy link
Contributor

@tf-nv tf-nv commented Mar 4, 2024

Swapped the raw cuda container for the ngc pytorch 24.02 container.

This one has OpenMPI, PyTorch and Apex already installed, so those setup steps can be removed.

flash-attn was recently hotfixed for that container version, so 2.5.6 is a hard requirement.

To validate the setups, I ran a few training steps of Pythia 14M on 8xA100 and a 22B model on 8xA100 and 8xH100 w/ TP8.

The apt and pip install lists and requirements files could be cleaned up. E.g. trition is already present and gets downgraded. I will leave it like this for now in order to keep this PR small.

For multi node tests I will have to adapt the launcher first.

@CLAassistant
Copy link

CLAassistant commented Mar 4, 2024

CLA assistant check
All committers have signed the CLA.

@Quentin-Anthony
Copy link
Member

Yep this LGTM. Agree on the requirements cleanup, but let's leave that for a followup.

@Quentin-Anthony Quentin-Anthony merged commit 119950c into EleutherAI:main Mar 4, 2024
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants