Remove PP Grad Tail Check #2538

Quentin-Anthony · 2022-11-22T19:00:44Z

After the patch in #1400 for BigScience, the final element of the inputs tuple is conditional on whether its grad is null (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L995).

This will always fail if elt.grad is None in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1026) with a IndexError: tuple index out of range because the index_grad_tail in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1006) doesn't exist.

We're facing this issue intermittently whenever the grad tail is full of nan. I propose we just always include the grad tail.

@dashstander @jeffra @ShadenSmith

Co-authored-by: Dashiell Stander <[email protected]>

Remove PP Grad Tail Check (until microsoft#2538 is merged to upstream)

deepspeed/runtime/pipe/engine.py

loadams · 2023-09-01T17:51:15Z

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Quentin-Anthony · 2023-09-01T17:59:26Z

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out.

loadams · 2023-09-01T20:22:20Z

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out.

Thanks! Other than a current known failure, it is passing all tests at least, but we will just want to figure out who would be best to review.

RUAN-ZX · 2023-10-26T12:22:27Z

Hi, I have met the same problem as well. I wonder if we can move forwards and merge this solution ASAP?

* Remove PP Grad Tail Check (microsoft#2538) * Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <[email protected]> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> * Added __HIP_PLATFORM_AMD__=1 (microsoft#4570) * fix multiple definition while building evoformer (microsoft#4556) Current builder for evoformer use the same name for `attention.cpp` and `attention.cu`, leading to same intermediate filename `attention.o`: ```shell march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe - isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention_back.o ``` and ``` `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)': tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here ``` I use following to reproduce and confirm my fix works: ``` git clone https://github.com/NVIDIA/cutlass --depth 1 CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext" ``` ![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff) Co-authored-by: Conglong Li <[email protected]> * Update ccl.py --------- Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Ramya Ramineni <[email protected]> Co-authored-by: Xie Zejian <[email protected]> Co-authored-by: Conglong Li <[email protected]>

* Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <[email protected]> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Quentin-Anthony and others added 2 commits November 22, 2022 14:16

Only communicate grad tail if it exists

54b5573

Co-authored-by: Dashiell Stander <[email protected]>

Revert previous patch and just always send the grad tail

f3a28da

Quentin-Anthony requested review from ShadenSmith and duli2012 as code owners November 22, 2022 19:00

Quentin-Anthony and others added 6 commits November 22, 2022 14:02

Merge branch 'master' into pp_fix

fb37e87

Formatting

bdf805e

Merge branch 'master' into pp_fix

50c32f0

Merge branch 'master' into pp_fix

b27840e

Merge branch 'master' into pp_fix

72bb819

Merge branch 'master' into pp_fix

a9428d9

Quentin-Anthony added a commit to EleutherAI/DeeperSpeed that referenced this pull request Mar 9, 2023

Update engine.py

64d6c5a

Remove PP Grad Tail Check (until microsoft#2538 is merged to upstream)

Merge branch 'master' into pp_fix

d170c47

jeffra mentioned this pull request Mar 21, 2023

fix pop off grad mistakenly #3064

Closed

Quentin-Anthony added 2 commits March 21, 2023 14:58

Merge branch 'master' into pp_fix

297cbad

Merge branch 'master' into pp_fix

fe13d2f

ShadenSmith reviewed Mar 22, 2023

View reviewed changes

deepspeed/runtime/pipe/engine.py Show resolved Hide resolved

Quentin-Anthony mentioned this pull request Apr 21, 2023

tuple index out of range in _exec_send_grads p2p.send EleutherAI/gpt-neox#884

Closed

Merge branch 'master' into pp_fix

b54cd03

loadams self-assigned this Aug 22, 2023

Merge branch 'master' into pp_fix

85cfe76

loadams requested review from ShadenSmith and StellaAthena September 1, 2023 20:22

Merge branch 'master' into pp_fix

c855803

tjruwase approved these changes Oct 26, 2023

View reviewed changes

loadams enabled auto-merge October 26, 2023 16:06

loadams added this pull request to the merge queue Oct 26, 2023

Merged via the queue into microsoft:master with commit 3589cad Oct 26, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove PP Grad Tail Check #2538

Remove PP Grad Tail Check #2538

Quentin-Anthony commented Nov 22, 2022

loadams commented Sep 1, 2023

Quentin-Anthony commented Sep 1, 2023

loadams commented Sep 1, 2023

RUAN-ZX commented Oct 26, 2023

Remove PP Grad Tail Check #2538

Remove PP Grad Tail Check #2538

Conversation

Quentin-Anthony commented Nov 22, 2022

loadams commented Sep 1, 2023

Quentin-Anthony commented Sep 1, 2023

loadams commented Sep 1, 2023

RUAN-ZX commented Oct 26, 2023