Implement DeepSpeed Main autotuning for NeoX #739

dashstander · 2022-12-09T21:27:52Z

Implements autotuning feature from DeepSpeed, which programmatically explores different combinations of microbatch size, gradient accumulation, and ZeRO sharding options given your model and compute set up.

Signed-off-by: Dashiell Stander <[email protected]>

…o autotune

Signed-off-by: Dashiell Stander <[email protected]>

…o autotune

Signed-off-by: Dashiell Stander <[email protected]>

…o autotune

Signed-off-by: Dashiell Stander <[email protected]>

…o autotune

Signed-off-by: Dashiell Stander <[email protected]>

…o autotune

dashstander · 2023-01-17T18:00:16Z

Ok, should be ready for review @Quentin-Anthony

…o autotune

Dashiell Stander and others added 30 commits September 21, 2022 20:33

Merge branch 'srun' into autotune

26e2255

Add autotuning

07f49ae

Signed-off-by: Dashiell Stander <[email protected]>

Add autotuning config

c3611e9

Need to add it to deepspeed args

5d5626b

Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…

80c4d3d

…o autotune

Do not calculate derived values when autotuning

c26a656

Do not calculate derived values when autotuning

80661a1

Do not calculate derived values when autotuning

b2de9ba

Do not calculate derived values when autotuning

08a6300

Do not calculate derived values when autotuning

f5a35da

Need to set no_ssh_check argument with slurm....

7ea39ee

set master_address for SLURM

ee22677

set master_address for SLURM

eb658e3

let json be a file ending

21e1708

Write configs to json files instead of passing them in as CL arguments

110a31f

Signed-off-by: Dashiell Stander <[email protected]>

Write configs to json files instead of passing them in as CL arguments

dece01b

Signed-off-by: Dashiell Stander <[email protected]>

Pass in slurm_comment directly to DeepSpeed

ecd8f8c

Signed-off-by: Dashiell Stander <[email protected]>

Move slurm_comment to deepspeed args

c390be1

Signed-off-by: Dashiell Stander <[email protected]>

Move slurm_comment to deepspeed args

8033d35

Signed-off-by: Dashiell Stander <[email protected]>

Slurm comment

e1d6b92

Signed-off-by: Dashiell Stander <[email protected]>

Slurm comment

bb0209d

Signed-off-by: Dashiell Stander <[email protected]>

Slurm comment

0097a39

Signed-off-by: Dashiell Stander <[email protected]>

Move configs out of \/tmp

669ee08

Signed-off-by: Dashiell Stander <[email protected]>

Get values from ds_config when autotuning

dfbe565

Signed-off-by: Dashiell Stander <[email protected]>

Get values from ds_config when autotuning

ef94f8d

Signed-off-by: Dashiell Stander <[email protected]>

Merge main

091b115

Signed-off-by: Dashiell Stander <[email protected]>

Pass in autotuning config properly

ccf1fff

Signed-off-by: Dashiell Stander <[email protected]>

Debug print statement

eb45fc0

Signed-off-by: Dashiell Stander <[email protected]>

lower mem requirement in tune.sh

7d093e3

Signed-off-by: Dashiell Stander <[email protected]>

Cursed hack to pass in autotuning config properly

36ca337

Signed-off-by: Dashiell Stander <[email protected]>

dashstander and others added 15 commits January 16, 2023 17:46

Remove debugging configs

1c2fb3f

Signed-off-by: Dashiell Stander <[email protected]>

Remove test scripts

00c9df6

Signed-off-by: Dashiell Stander <[email protected]>

Update NeoXArgs docs automatically

9c1d2fe

Remove test script

2c5dedd

Signed-off-by: Dashiell Stander <[email protected]>

Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…

917b0b1

…o autotune

Update NeoXArgs docs automatically

a28f4b8

Clean up

6326ca1

Signed-off-by: Dashiell Stander <[email protected]>

Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…

08edb21

…o autotune

Update NeoXArgs docs automatically

6cd80ab

Run pre-commit hooks

0f2e492

Signed-off-by: Dashiell Stander <[email protected]>

Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…

a59c9ec

…o autotune

Update NeoXArgs docs automatically

b92e936

base64 error

bc586c9

Signed-off-by: Dashiell Stander <[email protected]>

Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…

e05c967

…o autotune

Update NeoXArgs docs automatically

c19b020

dashstander mentioned this pull request Jan 17, 2023

Implements autotuning features to make it compatible with GPT-NeoX EleutherAI/DeeperSpeed#43

Merged

Quentin-Anthony and others added 2 commits February 14, 2023 17:01

Merge branch 'main' into autotune

d9996d7

Update NeoXArgs docs automatically

14565fc

Quentin-Anthony changed the base branch from main to deepspeed_main February 14, 2023 22:02

Merge branch 'deepspeed_main' into autotune

a79b566

Base automatically changed from deepspeed_main to main March 9, 2023 16:55

Quentin-Anthony added 4 commits March 9, 2023 12:23

Merge branch 'main' into autotune

b156b02

remove duplicated einops

cb7a8bc

Move autotuning configs into their own subdir

2e69e0f

Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…

9a9e773

…o autotune

Quentin-Anthony approved these changes Mar 9, 2023

View reviewed changes

Quentin-Anthony merged commit e897c23 into main Mar 9, 2023

Quentin-Anthony deleted the autotune branch March 9, 2023 17:28

silverriver mentioned this pull request May 5, 2023

Configs are not broadcasted properly in multi-node training #925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DeepSpeed Main autotuning for NeoX #739

Implement DeepSpeed Main autotuning for NeoX #739

dashstander commented Dec 9, 2022 •

edited

Loading

dashstander commented Jan 17, 2023

Implement DeepSpeed Main autotuning for NeoX #739

Implement DeepSpeed Main autotuning for NeoX #739

Conversation

dashstander commented Dec 9, 2022 • edited Loading

dashstander commented Jan 17, 2023

dashstander commented Dec 9, 2022 •

edited

Loading