Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deeperspeed final #46

Merged
merged 914 commits into from
Mar 9, 2023
Merged

Update deeperspeed final #46

merged 914 commits into from
Mar 9, 2023

Conversation

Quentin-Anthony
Copy link
Member

No description provided.

mrwyattii and others added 30 commits September 14, 2022 01:11
* add quant unit test

* add codeowner

* format fix

* fix undefined symbol: curandSetPseudoRandomGeneratorSeed

* modify ref fn name and add comment

* add comments

* add 4bit quant 16groups

* fix

* modify groups in ref code

* parameterize tensor shape

* single param

* detach tensor

* remove -lcurand flag

* add back -lcurand flag

Co-authored-by: Ammar Ahmad Awan <[email protected]>
MOE residual matmul unit tests

Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
* Fix formatting

* Remove redundant variable
* mem access for quantize kernel

* format

* format fp32

* modify quant kernel

* modify quant kernel2

* modify format

* format

* fix comments in pytest

* fix comments in pytest

* format

* rerun

Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
* Unify macro definitions and constants in a single file

* Conversion utility implementation.

* Fix reversion from formatting

* Bugfixes after testing with correct DeepSpeed

* Inline markers are available on both HIP + CUDA
Co-authored-by: Saeyeol Lee <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
* Collect error messages in results.csv

Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
* batch of refactored tests

* more test refactoring

* fp16 test refactor

* more refactors

* added DistributedFixture class

* applied DistributedFixture to first batch of tests as a trial

* added DistributedFixture test and documentation

* last tests

* fixes for refactored tests

* remove subdirs in workflow files

* fix pytest syntax error

* fix another syntax error

* update imports

* use DistFixture with elastic checkpoint test

* missing import

* update to shared class tmpdir for elastic test

* moved test files

* avoid duplicate test file name

* last refactor and moving test files

* formatting

* fix broken import

* testing forked AMD tests

* update abstract method

* use blob storage for accelerate and transformers tests

* upgrade torch for acclerate CI

Co-authored-by: Olatunji Ruwase <[email protected]>
molly-smith and others added 27 commits February 21, 2023 11:52
* data efficiency library update

* data efficiency library update

* data efficiency update

* data efficiency update
* Make z3 respect comm dtype

* Support fp32 comm dtype

* Remove obsolete assert

* Code cleanup
* Modify table for compatible web format

* Add tutorial links to navigation

* Add news bit to main readme

* Update docs/_tutorials/automatic-tensor-parallelism.md

Co-authored-by: Michael Wyatt <[email protected]>

---------

Co-authored-by: Michael Wyatt <[email protected]>
* Check device count before running dist tests

* fixing format for "Check device count before running dist tests"

* Check device count against max world size

* Check GPU count before launching dist tests

* double-check GPU actually exists

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
* Remove deprecated `torch._six` imports

Closes microsoft#2845.

* Support older versions of PyTorch as well.

---------

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
* Enable tensor fragments for zero 2

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Support offload

* Support multi-gpu

* Cleanup

* WIP

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <[email protected]>

* Support padding

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <[email protected]>

* z3 optimizer state support; aligned api

* Support frozen z3 params

* Unit tests

* Check NVMe offload capability

* Formatting

* Docs

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <[email protected]>

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <[email protected]>

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* More docs

* Support unsharded fp32 grad

* Remove debug prints

* Fix off-by-one detection of empty grads

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <[email protected]>

* Fix off-by-one error

* Skip ranks with no gradient data

* Formatting

* Add license

* Fix license

---------

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy.
* microsoft#1213: Fix CPUAdam for when `vendor_id_raw` is not provided

* formatting (yapf) fix

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Updates `deepspeed/monitor/monitor.py`
to instantiate objects with correct configs

Relevant issue:
microsoft#2853

Co-authored-by: Olatunji Ruwase <[email protected]>
* MPICH support

* MPICH changes

* MPICH changes

* MPICH changes

* MPICH changes

* accelerator runtime modifications

* Accelerator runtime changes

* Accelerator runtime modifications

* Remove redundant print from single node

* Move hostfile to tmp

* Code cleanup for MPICH class

* Code cleanup, rm whitespace

* Removing mpiexec environment check details

* Not needed tmp hostfile as pass directly

* Remove debugging comments

* rm print statement

* Revert comm changes as WA not needed

* Use MPICHRunner name for class

* Use MPICHRunner as class name

* No need to use args.force_multi and args.launcher .

This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"

* Adhere to code pattern

* Rm empty lines in MPICHRunner class

* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh

* pass MPICH hostfile through launcher_args in gpt-3.6b.sh

* Clean code and remove args hostfile

* fix merge

* fix merge

---------

Co-authored-by: Abhilash Majumder <[email protected]>

* clean up and fix format

* add ut

---------

Co-authored-by: Abhilash Majumder <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
* check kernel injection supported models

* Clarify why user should use kernel injection
@Quentin-Anthony Quentin-Anthony merged commit fdfb825 into main Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet