-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate distilling #293
Comments
Just moving the same is enough or should there be any changes to it? |
Much of the code can be copy and pasted from |
The distillation code follows only data-parallelism. You are aware right? or should we need to use the model from the framework. |
@preethamgali and I discussed this on discord. We do want to use the GPT-NeoX modeling framework and to capture as much of the optimizations that our code provides. What I had in mind was to make the student model the last stage(s) in a pipeline, so that instead of having @sdtblck suggested running something like teacher = GPT2ModelPipe(**kwargs)
student = GPT2ModelPipe(**student_kwargs)
...
teacher, _, _, _ = deepspeed.initialize(
model=teacher,
optimizer=optimizer,
...)
student, optimizer, _, lr_scheduler = deepspeed.initialize(
model=student,
optimizer=optimizer,
...) However I worry that DS will not play nicely with multiple models. This is purely conjecture though, and Sid’s suggestion is absolutely worth trying. Or more realistically, asking DS people about. |
The cross_entropy loss function used in the framework is mpu.vocab_parallel_cross_entropy, which has implemented to work on 3D parallelism. But for distillation, we need KLDivLoss, MSELoss, CosineEmbedding loss functions. So we also need to implement the same for these losses as well. @StellaAthena please raise the feature request. |
We are abandoning this effort as unsuccessful |
@preethamgali wrote a model distilling framework here which we should aim to integrate into GPT-NeoX
The text was updated successfully, but these errors were encountered: