Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeRO-Inference refresh #4197

Merged
merged 113 commits into from
Sep 11, 2023
Merged

ZeRO-Inference refresh #4197

merged 113 commits into from
Sep 11, 2023

Conversation

tjruwase
Copy link
Contributor

Two new optimizations

  • Weight quantization
  • KV cache offloading to CPU memory

donglinz and others added 30 commits May 5, 2023 11:16
* INT4 weight only quantization

* pre commit

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* add zero3 test

* quantize small weight first to prevent oom

* fold quantization config into ds_config

* Fix license & refactor ds_config & rebase master

* fix UT
… kernel (#522)

* Add experimental int4 dequantize kernel

* move quantiation into post_init_method

* fix
* Move int 4 code to deepspeed/inference

* fix

* fix

* fix
* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------

Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format
Co-authored-by: Stephen Youn <[email protected]>
Co-authored-by: Arash Bakhtiari <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Ethan Doe <[email protected]>
Co-authored-by: yidoe <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: HeyangQin <[email protected]>
Co-authored-by: GuanhuaWang <[email protected]>
Co-authored-by: cmikeh2 <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------

Co-authored-by: HeyangQin <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
* fix autotuner when backward is not called

* fix format

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
* Bug fix

* Fixed formatting error

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Stephen Youn <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
@delock
Copy link
Contributor

delock commented Aug 24, 2023

@tjruwase are there examples using weight quantization and kv_cache offloading? Wonder what it takes to support int4 on other accelerators as well.

@tjruwase
Copy link
Contributor Author

@tjruwase are there examples using weight quantization and kv_cache offloading? Wonder what it takes to support int4 on other accelerators as well.

Sorry, that release won't be ready until next week. It will be great to get your feedback.

@tjruwase
Copy link
Contributor Author

FYI, @donglinz @cli99

@ftian1
Copy link
Contributor

ftian1 commented Sep 11, 2023

@tjruwase May I get some inputs from your side about contributing more weight only quantization algorithms like WeightOnly RTN/AWQ/GPTQ to DeepSpeed inference? is it ok for DeepSpeed inference to accept other compression algos except for ZeRO? and some of such algos may require calibration on some datasets which may have impact on inference API. do you think it's valuable to add? or shall we stick on those data-free compression at first?

@ftian1
Copy link
Contributor

ftian1 commented Sep 11, 2023

and some questions/comments about current PR:

  1. will _init_group_wise_weight_quantization() be the final interface exposing to user to enable weight quantization? it's a little weird as its name convention is more like internal function naming and the user experiment is not quite good.

  2. will this feature be enabled in DeepSpeed Inference path? that's user can enable this feature by invoking deepspeed.init_inference() but not below

        model = AutoModel.from_pretrained('facebook/opt-125m', torch_dtype=torch.float16)
        model = model.eval()

        model = _init_group_wise_weight_quantization(model, ds_config)
        ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
  1. may I know when this PR will be merged? or is there any special concerns on not merging such feature to DeepSpeed?

@tjruwase
Copy link
Contributor Author

@tjruwase May I get some inputs from your side about contributing more weight only quantization algorithms like WeightOnly RTN/AWQ/GPTQ to DeepSpeed inference? is it ok for DeepSpeed inference to accept other compression algos except for ZeRO? and some of such algos may require calibration on some datasets which may have impact on inference API. do you think it's valuable to add? or shall we stick on those data-free compression at first?

@ftian1, thanks for the questions. We will share a response asap.

@tjruwase
Copy link
Contributor Author

and some questions/comments about current PR:

  1. will _init_group_wise_weight_quantization() be the final interface exposing to user to enable weight quantization? it's a little weird as its name convention is more like internal function naming and the user experiment is not quite good.

_init_group_wise_weight_quantization() is still experimental. We hope to finalize the interface soon.

  1. will this feature be enabled in DeepSpeed Inference path? that's user can enable this feature by invoking deepspeed.init_inference() but not below
        model = AutoModel.from_pretrained('facebook/opt-125m', torch_dtype=torch.float16)
        model = model.eval()

        model = _init_group_wise_weight_quantization(model, ds_config)
        ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]

Yes, this is the plan.

  1. may I know when this PR will be merged? or is there any special concerns on not merging such feature to DeepSpeed?

This week.

@tjruwase tjruwase added this pull request to the merge queue Sep 11, 2023
Merged via the queue into master with commit aa4a740 Sep 11, 2023
16 checks passed
CurryRice233 pushed a commit to CurryRice233/DeepSpeed that referenced this pull request Sep 15, 2023
* origin/master: (48 commits)
  Fix autotune to support Triton 2.1  (microsoft#4340)
  Fix skipped inference tests (microsoft#4336)
  Suppress noise (microsoft#4310)
  Fix a bug in the implementation of dequantization for inference (microsoft#3433)
  DS-Chat BLOOM: Fix Attention mask (microsoft#4338)
  clear redundant timers (microsoft#4308)
  Add release version checking (microsoft#4328)
  Fix Zero3 contiguous grads, reduce scatter false  accuracy issue (microsoft#4321)
  Clean up modeling code (microsoft#4320)
  Handle empty parameter groups (microsoft#4277)
  Update README.md (microsoft#4316)
  README update (microsoft#4303)
  Update release and bump patch versioning flow (microsoft#4286)
  added a bert-model check for triton (microsoft#4266)
  ZeRO-Inference v2 release
  bump to 0.10.4
  Update index.md (microsoft#4297)
  fix user args parsing of string with spaces on runner (microsoft#4265)
  ZeRO-Inference refresh (microsoft#4197)
  AMD Kernel Compatibility Fixes (microsoft#3180)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet