Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate tensor parallelism code to use OSLO #578

Open
3 tasks
sdtblck opened this issue Mar 1, 2022 · 7 comments
Open
3 tasks

Migrate tensor parallelism code to use OSLO #578

sdtblck opened this issue Mar 1, 2022 · 7 comments
Labels
feature request New feature or request oslo issues relating to refactoring NeoX to use OSLO

Comments

@sdtblck
Copy link
Contributor

sdtblck commented Mar 1, 2022

Is your feature request related to a problem? Please describe.
Would be good to remove the megatron tensor parallelism code from NeoX, and OSLO currently has support for this, and a slightly nicer interface.

Describe the solution you'd like

Steps:

  • Rewrite all current modules as plain pytorch implementations, removing the mpu dependency from any internal code as much as possible. (so, anything that's currently an mpu.[Column|Row]ParallelLinear or mpu.VocabParallelEmbedding should be replaced with its plain pytorch equivalent (nn.Linear / nn.Embedding respectively).
  • Write a mapping for neox modules, which oslo uses to handle parallelization.
  • Ensure backwards compatibility
@sdtblck sdtblck added feature request New feature or request oslo issues relating to refactoring NeoX to use OSLO labels Mar 1, 2022
@sdtblck sdtblck added this to To do in OSLO refactor via automation Mar 1, 2022
@hyunwoongko
Copy link
Member

I will actively support this work.

@hyunwoongko
Copy link
Member

hyunwoongko commented Mar 1, 2022

The main problem is that currently the model is loaded on the CPU and then moved to the GPU. OSLO was originally designed for transformers, and there was no way to pass downloaded checkpoints directly to the GPU in the transformers. (At least when I'm developing, so I didn't care about this) But we need to implement something like deepspeed.ZeroInit internally so that it's allocated to the GPU from scratch. I will try this right from tomorrow.

@sdtblck
Copy link
Contributor Author

sdtblck commented Mar 1, 2022

@hyunwoongko actually in neox we also load onto the CPU and then move to the GPU, so i'm not sure this is a problem

@StellaAthena
Copy link
Member

The main problem is that currently the model is loaded on the CPU and then moved to the GPU. OSLO was originally designed for transformers, and there was no way to pass downloaded checkpoints directly to the GPU in the transformers. (At least when I'm developing, so I didn't care about this) But we need to implement something like deepspeed.ZeroInit internally so that it's allocated to the GPU from scratch. I will try this right from tomorrow.

this is actually something we have a work-around for. I don't know if Transformers ever got around to merging it though.

@hyunwoongko
Copy link
Member

@sdtblck please check my branch. https://github.com/EleutherAI/gpt-neox/tree/kevin_new
I am restructuring our code based on plain torch.

@hyunwoongko
Copy link
Member

@sdtblck Did you check my branch?

@Quentin-Anthony
Copy link
Member

@hyunwoongko -- Would you like to restart this effort?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request oslo issues relating to refactoring NeoX to use OSLO
Projects
Development

No branches or pull requests

4 participants