Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug-fix] enable finetuning option(set optimizer params correctly) #927

Merged
merged 2 commits into from
May 9, 2023

Conversation

taegyeongeo
Copy link
Contributor

@taegyeongeo taegyeongeo commented May 6, 2023

neox repo has problem to use "finetune" option (#767)
this option can reset hyperparams in optimizer/lr_scheduler but, doesn't set model parameters correctly

i you use finetune in main branch it doesn't sync module with optimizer params

To sync module with optimizer params, change just 1 params simply

        checkpoint_name, state_dict = model.load_checkpoint(
            neox_args.load,
            load_optimizer_states=load_optim_and_scheduler,
            load_lr_scheduler_states=load_optim_and_scheduler,
            load_module_only=not load_optim_and_scheduler,
            tag=tag,
        )

i add one line code to use load_module_only option in deepspeed's load_checkpoint function

        if load_module_only:
            deepspeed_states = ['module']
            if self.optimizer is not None and self.fp16_enabled():
                self.optimizer.refresh_fp32_params()

in deepspeed's load_checkpoint function, load_module_only enable syncing optimizer params with module params

i check this code with 6b model and find that finetuning works correctly (varifying output response quality and valid_loss)

i commit to contribute this project, please review my codes

@taegyeongeo taegyeongeo requested a review from a team as a code owner May 6, 2023 00:36
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


logan.eo seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@StellaAthena
Copy link
Member

It looks like you’ve only changed the example config file, not the actual bug?

@taegyeongeo
Copy link
Contributor Author

@StellaAthena
sorry, i didn't commit "megatron/checkpoint.py"
i push it now

@StellaAthena
Copy link
Member

@taegyeongeo who or what is “logan.eo” and why are they an author of this PR?

@taegyeongeo
Copy link
Contributor Author

taegyeongeo commented May 8, 2023

@StellaAthena
oh,, it has conflict with my github enterprise account
logan.eo, that is me.

@StellaAthena
Copy link
Member

@taegyeongeo Apologies that we haven't merged this yet... we've been very busy of late and haven't been able to test it rigorously.

You say

i add one line code to use load_module_only option in deepspeed's load_checkpoint function

To be clear, does means that you also need to update DeeperSpeed to use this?

@Quentin-Anthony
Copy link
Member

@taegyeongeo Apologies that we haven't merged this yet... we've been very busy of late and haven't been able to test it rigorously.

You say

i add one line code to use load_module_only option in deepspeed's load_checkpoint function

To be clear, does means that you also need to update DeeperSpeed to use this?

No, @taegyeongeo is just saying that we're using DeepSpeed's existing load_module_only here: https://github.com/microsoft/DeepSpeed/blob/58c4d230920f10b9a0c33891b6cb88afc1a6a5f4/deepspeed/runtime/engine.py#L2542

This doesn't require DeeperSpeed changes.

Copy link
Member

@Quentin-Anthony Quentin-Anthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for this fix!

@Quentin-Anthony Quentin-Anthony merged commit befd133 into EleutherAI:main May 9, 2023
0 of 3 checks passed
@WaveLi123
Copy link

neox repo has problem to use "finetune" option (#767) this option can reset hyperparams in optimizer/lr_scheduler but, doesn't set model parameters correctly

i you use finetune in main branch it doesn't sync module with optimizer params

To sync module with optimizer params, change just 1 params simply

        checkpoint_name, state_dict = model.load_checkpoint(
            neox_args.load,
            load_optimizer_states=load_optim_and_scheduler,
            load_lr_scheduler_states=load_optim_and_scheduler,
            load_module_only=not load_optim_and_scheduler,
            tag=tag,
        )

i add one line code to use load_module_only option in deepspeed's load_checkpoint function

        if load_module_only:
            deepspeed_states = ['module']
            if self.optimizer is not None and self.fp16_enabled():
                self.optimizer.refresh_fp32_params()

in deepspeed's load_checkpoint function, load_module_only enable syncing optimizer params with module params

i check this code with 6b model and find that finetuning works correctly (varifying output response quality and valid_loss)

i commit to contribute this project, please review my codes

Following your option to fine-tune the original 20B checkpoint, still got the error ""Empty ds_version in checkpoint"" as #767.

@taegyeongeo
Copy link
Contributor Author

taegyeongeo commented May 10, 2023

@WaveLi123
i know this problem
for the test, i disabled "ds_version check" lines in deepspeed
if you want to test no editing deepspeed codes, you just add "ds_version" in 20b checkpoint
it is not issue about code but checkpoint is not compatible with current deepspeed version

@WaveLi123
Copy link

@WaveLi123 i know this problem for the test, i disabled "ds_version check" lines in deepspeed if you want to test no editing deepspeed codes, you just add "ds_version" in 20b checkpoint it is not issue about code but checkpoint is not compatible with current deepspeed version

Yes, it is the version mismatch of deepspeed. How to add "ds_verison" in 20b checkpint?Any code reference? Thanks a lot.

@StellaAthena
Copy link
Member

@WaveLi123 Just add it to 20B.yml

@WaveLi123
Copy link

WaveLi123 commented May 12, 2023

@WaveLi123 Just add it to 20B.yml

Not work to simply add " "ds_version": "0.3.15", " in 20B.yml. Got error info: " TypeError: init() got an unexpected keyword argument 'ds_version' "

@Quentin-Anthony
Copy link
Member

@WaveLi123 Just add it to 20B.yml

This won't work. You need to add it to the model's state_dict object itself.

@taegyeongeo -- Do you have a snippet you can share? I'd love to be able to point gpt-neox 1.0 users to a small snippet that just loads the checkpoint and adds the ds_version attribute (arbitrarily based on an arg).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants