Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate distilling #293

Closed
StellaAthena opened this issue May 4, 2021 · 6 comments
Closed

Integrate distilling #293

StellaAthena opened this issue May 4, 2021 · 6 comments
Assignees
Labels
feature request New feature or request
Projects

Comments

@StellaAthena
Copy link
Member

@preethamgali wrote a model distilling framework here which we should aim to integrate into GPT-NeoX

@StellaAthena StellaAthena added the feature request New feature or request label May 4, 2021
@preethamgali
Copy link
Member

preethamgali commented May 9, 2021

Just moving the same is enough or should there be any changes to it?

@StellaAthena
Copy link
Member Author

Just moving that same is enough or should there be any changes to it?

Much of the code can be copy and pasted from distiller.py, it just needs to be modified to work with the GPT-NeoX framework.

@preethamgali
Copy link
Member

preethamgali commented May 10, 2021

The distillation code follows only data-parallelism. You are aware right? or should we need to use the model from the framework.

@StellaAthena
Copy link
Member Author

StellaAthena commented May 13, 2021

@preethamgali and I discussed this on discord. We do want to use the GPT-NeoX modeling framework and to capture as much of the optimizations that our code provides.

What I had in mind was to make the student model the last stage(s) in a pipeline, so that instead of having T1 -> T2 -> T3 -> T4 and S1 -> S2 you have a single model T1 -> T2 -> T3 -> T4 -> S1 -> S2. Then when you do backprop you just stop after finishing the student model. The teacher and the student models likely will have different widths though, and I’m not sure if that’ll do anything wonky.

@sdtblck suggested running something like

    teacher = GPT2ModelPipe(**kwargs)
    student = GPT2ModelPipe(**student_kwargs)
    ...
    
    teacher, _, _, _ = deepspeed.initialize(
            model=teacher,
            optimizer=optimizer,
            ...)
    student, optimizer, _, lr_scheduler = deepspeed.initialize(
            model=student,
            optimizer=optimizer,
            ...)

However I worry that DS will not play nicely with multiple models. This is purely conjecture though, and Sid’s suggestion is absolutely worth trying. Or more realistically, asking DS people about.

@StellaAthena StellaAthena added this to To do in 1T or BUST via automation May 14, 2021
@StellaAthena StellaAthena moved this from To do to In progress in 1T or BUST May 14, 2021
@preethamgali
Copy link
Member

preethamgali commented May 14, 2021

The cross_entropy loss function used in the framework is mpu.vocab_parallel_cross_entropy, which has implemented to work on 3D parallelism. But for distillation, we need KLDivLoss, MSELoss, CosineEmbedding loss functions. So we also need to implement the same for these losses as well. @StellaAthena please raise the feature request.

@StellaAthena StellaAthena self-assigned this Jun 20, 2021
@StellaAthena
Copy link
Member Author

We are abandoning this effort as unsuccessful

1T or BUST automation moved this from In progress to Done Sep 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Development

No branches or pull requests

2 participants