How PP and ZeRO stage 2+ work together? #682

hyunwoongko · 2022-04-06T15:36:13Z

At least as far as I know, ZeRO2 splits gradients and PP accumulates gradients, so there's no real performance boost for these two mechanisms working together.

related issues

Fix DeepSpeed (ZeRO2 + Pipeline Parallel) EleutherAI/gpt-neox#67
Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together EleutherAI/gpt-neox#62
Details about pipeline parallelism implementation in DeepSpeed microsoft/DeepSpeed#1110

But #477 makes PP and ZeRO stage 2+ work together. how they are working together?

ver217 · 2022-04-07T03:04:42Z

We currently just want to support more parallel training methods. As ZeRO and PP are both important and useful parallel training methods, we think users may want to use them together in some special cases. For example, when a layer cannot fit in the memory, like GPT-3, we must partition the layer. Absolutely, we can use TP+PP. But if the GPU memory is not enough for TP, ZeRO+PP may be another solution. As for the efficiency, we can optimize it in the future.

ver217 · 2022-04-07T03:46:31Z

Here is an example of training GPT, and this is an example config file of using ZeRO, TP and PP together.

hyunwoongko · 2022-04-07T04:02:16Z

Combining ZeRO2 with PP is not mechanistically efficient. ZeRO2 has to split the gradients, but PP has to accumulate the gradients, so there's no real performance boost, it's actually slower and memory inefficient than normal ZeRO2 (without PP). And If the user runs out of memory, the user can use ZeRO3 instead of ZeRO2 + PP.

I'm saying it's mechanistically inefficient to use these two algorithms together. I've been testing this for a year. And it won't be easy to improve the efficiency of using the two together. To do that, one of the two mechanisms would have to change in a completely new way.

That's why DeepSpeed prevents PP and ZeRO2 from being used together. Not because they can't implement it, but because it's ineffective to do so. https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/engine.py#L72 If you want to allow ZeRO + PP, I think it is reasonable to limit this to stage 1.

hyunwoongko · 2022-04-07T04:31:26Z

I also don't think it makes sense for Colossal AI to use the name of ShardedModel. Because for ZeRO1 and 2 we don't actually split the model. This name only seems reasonable on ZeRO3. What do you think? Of course, all of this is just advice, so feel free to do whatever you guys want :)

feifeibear · 2022-04-07T04:40:36Z

I also don't think it makes sense for Colossal AI to use the name of ShardedModel. Because for ZeRO1 and 2 we don't actually split the model. This name only seems reasonable on ZeRO3. What do you think? Of course, all of this is just advice, so feel free to do whatever you want :)

Thanks for your advice, maybe ShardableModel is more appropriate.

Dounm · 2022-07-27T09:55:05Z

Combining ZeRO2 with PP is not mechanistically efficient. ZeRO2 has to split the gradients, but PP has to accumulate the gradients, so there's no real performance boost, it's actually slower and memory inefficient than normal ZeRO2 (without PP).

@hyunwoongko Hi, I was wondering if it's possible to implement PP with ZeRO-2/3.

Say we have a model of 48L(L for Layers) to train with 2DPx4PP, which means each PP stage gets 2 devices and 12L duplicated on these 2 devs. Now I want to use torch-FSDP(which is equal to ZeRO-3 as I know) to implement the 2DP by splitting these 12L into 3 FSDP Units(each wraps 4L).

With the placement above, when dev0 forward the model, it will allgather parameters for the 1st FSDP Unit(4L), do compute, discard the parameters and then move to the 2nd FSDP Unit(4L)。 So, at any point in time, dev0 only materializes parameters/grads only for 4L instead of 12L, thus reducing the memory peak usage.

Also, For every micro_batch computed(fw&bw) by dev0, it will allgather the weight and allreduce the grad, which leads to many replicated communications(for every micro_batch of a batch in an iteration). but all these communications can be overlapped with computations if we schedule them carefully.

Of course, the scheduling algorithm would be much more complicated for Pipeline with FSDP(ZeRO-3), but as a result of using FSDPxPP, we reduce the mem usage without communication overhead(the communication can be overlapped), compared with TPxPP (the communication cannot be overlapped).

binmakeswell · 2023-04-13T03:52:30Z

We have updated a lot. This issue was closed due to inactivity. Thanks.

feifeibear added the documentation Improvements or additions to documentation label Apr 7, 2022

LifeIsStrange mentioned this issue Apr 16, 2022

Consider leveraging deepspeed hpcaitech/FastFold#2

Closed

binmakeswell closed this as completed Apr 13, 2023

insujang mentioned this issue May 8, 2024

How Oobleck supports both fsdp and resilient? SymbioticLab/Oobleck#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How PP and ZeRO stage 2+ work together? #682

How PP and ZeRO stage 2+ work together? #682

hyunwoongko commented Apr 6, 2022 •

edited

Loading

ver217 commented Apr 7, 2022

ver217 commented Apr 7, 2022

hyunwoongko commented Apr 7, 2022 •

edited

Loading

hyunwoongko commented Apr 7, 2022 •

edited

Loading

feifeibear commented Apr 7, 2022

Dounm commented Jul 27, 2022 •

edited

Loading

binmakeswell commented Apr 13, 2023

How PP and ZeRO stage 2+ work together? #682

How PP and ZeRO stage 2+ work together? #682

Comments

hyunwoongko commented Apr 6, 2022 • edited Loading

related issues

ver217 commented Apr 7, 2022

ver217 commented Apr 7, 2022

hyunwoongko commented Apr 7, 2022 • edited Loading

hyunwoongko commented Apr 7, 2022 • edited Loading

feifeibear commented Apr 7, 2022

Dounm commented Jul 27, 2022 • edited Loading

binmakeswell commented Apr 13, 2023

hyunwoongko commented Apr 6, 2022 •

edited

Loading

hyunwoongko commented Apr 7, 2022 •

edited

Loading

hyunwoongko commented Apr 7, 2022 •

edited

Loading

Dounm commented Jul 27, 2022 •

edited

Loading