ZeRO-Inference refresh #4197

tjruwase · 2023-08-23T17:30:43Z

Two new optimizations

Weight quantization
KV cache offloading to CPU memory

* INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * add zero3 test * quantize small weight first to prevent oom * fold quantization config into ds_config * Fix license & refactor ds_config & rebase master * fix UT

… kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix

* Move int 4 code to deepspeed/inference * fix * fix * fix

* fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

* Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]>

* zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format

Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]>

* zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]>

* fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]>

Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Co-authored-by: Jeff Rasley <[email protected]>

* Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]>

Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

delock · 2023-08-24T01:10:31Z

@tjruwase are there examples using weight quantization and kv_cache offloading? Wonder what it takes to support int4 on other accelerators as well.

tjruwase · 2023-08-24T01:31:08Z

@tjruwase are there examples using weight quantization and kv_cache offloading? Wonder what it takes to support int4 on other accelerators as well.

Sorry, that release won't be ready until next week. It will be great to get your feedback.

tjruwase · 2023-08-24T01:38:43Z

FYI, @donglinz @cli99

ftian1 · 2023-09-11T03:32:58Z

@tjruwase May I get some inputs from your side about contributing more weight only quantization algorithms like WeightOnly RTN/AWQ/GPTQ to DeepSpeed inference? is it ok for DeepSpeed inference to accept other compression algos except for ZeRO? and some of such algos may require calibration on some datasets which may have impact on inference API. do you think it's valuable to add? or shall we stick on those data-free compression at first?

ftian1 · 2023-09-11T05:41:51Z

and some questions/comments about current PR:

will _init_group_wise_weight_quantization() be the final interface exposing to user to enable weight quantization? it's a little weird as its name convention is more like internal function naming and the user experiment is not quite good.
will this feature be enabled in DeepSpeed Inference path? that's user can enable this feature by invoking deepspeed.init_inference() but not below

        model = AutoModel.from_pretrained('facebook/opt-125m', torch_dtype=torch.float16)
        model = model.eval()

        model = _init_group_wise_weight_quantization(model, ds_config)
        ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]

may I know when this PR will be merged? or is there any special concerns on not merging such feature to DeepSpeed?

tjruwase · 2023-09-11T12:55:54Z

@tjruwase May I get some inputs from your side about contributing more weight only quantization algorithms like WeightOnly RTN/AWQ/GPTQ to DeepSpeed inference? is it ok for DeepSpeed inference to accept other compression algos except for ZeRO? and some of such algos may require calibration on some datasets which may have impact on inference API. do you think it's valuable to add? or shall we stick on those data-free compression at first?

@ftian1, thanks for the questions. We will share a response asap.

tjruwase · 2023-09-11T13:02:10Z

and some questions/comments about current PR:

will _init_group_wise_weight_quantization() be the final interface exposing to user to enable weight quantization? it's a little weird as its name convention is more like internal function naming and the user experiment is not quite good.

_init_group_wise_weight_quantization() is still experimental. We hope to finalize the interface soon.

will this feature be enabled in DeepSpeed Inference path? that's user can enable this feature by invoking deepspeed.init_inference() but not below

        model = AutoModel.from_pretrained('facebook/opt-125m', torch_dtype=torch.float16)
        model = model.eval()

        model = _init_group_wise_weight_quantization(model, ds_config)
        ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]

Yes, this is the plan.

may I know when this PR will be merged? or is there any special concerns on not merging such feature to DeepSpeed?

This week.

* origin/master: (48 commits) Fix autotune to support Triton 2.1 (microsoft#4340) Fix skipped inference tests (microsoft#4336) Suppress noise (microsoft#4310) Fix a bug in the implementation of dequantization for inference (microsoft#3433) DS-Chat BLOOM: Fix Attention mask (microsoft#4338) clear redundant timers (microsoft#4308) Add release version checking (microsoft#4328) Fix Zero3 contiguous grads, reduce scatter false accuracy issue (microsoft#4321) Clean up modeling code (microsoft#4320) Handle empty parameter groups (microsoft#4277) Update README.md (microsoft#4316) README update (microsoft#4303) Update release and bump patch versioning flow (microsoft#4286) added a bert-model check for triton (microsoft#4266) ZeRO-Inference v2 release bump to 0.10.4 Update index.md (microsoft#4297) fix user args parsing of string with spaces on runner (microsoft#4265) ZeRO-Inference refresh (microsoft#4197) AMD Kernel Compatibility Fixes (microsoft#3180) ...

donglinz and others added 30 commits May 5, 2023 11:16

Moving quantization into post_init_method and add int4 dequantization…

2461449

… kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix

Refactor: move int4 code to deepspeed/inference (#528)

8751edf

* Move int 4 code to deepspeed/inference * fix * fix * fix

zero++ tutorial PR (#3783)

df1859d

fix interpolate flops compute (#3782)

a8c182a

use Flops Profiler to test model.generate() (#2515)

c4c442f

* Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]>

revert PR #3611 (#3786)

fc9e1ee

bump to 0.9.6

40045dc

ZeRO++ chinese blog (#3793)

49a0a1b

* zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format

remove staging trigger (#3792)

2c62cb4

adding zero++ to navigation panel of deepspeed.ai (#3796)

01b843a

Bug Fixes for autotuner and flops profiler (#1880)

b4a2c0a

* fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]>

Missing strided copy for gated MLP (#3788)

b7e1010

Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Requires grad checking. (#3789)

e5b1ead

Co-authored-by: Jeff Rasley <[email protected]>

bump to 0.10.0

9c756cf

Fix Bug in transform.cu (#3534)

a204edc

* Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]>

bug fix: triton importing error (#3799)

f6e2e38

Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

Merge branch 'master' of github.com:microsoft/DeepSpeed

c1a7d3c

Merge branch 'master' of github.com:microsoft/DeepSpeed

65ed548

Merge branch 'master' of github.com:microsoft/DeepSpeed

d7ac329

Merge branch 'master' of github.com:microsoft/DeepSpeed

83f1102

Merge branch 'master' of github.com:microsoft/DeepSpeed

16555b2

Merge branch 'master' of github.com:microsoft/DeepSpeed

9d7b654

Merge branch 'master' of github.com:microsoft/DeepSpeed

c121f90

Merge branch 'master' of github.com:microsoft/DeepSpeed

f6b2962

Merge branch 'master' of github.com:microsoft/DeepSpeed

dd6bb04

jeffra and others added 7 commits August 19, 2023 01:00

Merge branch 'master' of github.com:microsoft/DeepSpeed

c816d50

Merge branch 'master' of github.com:microsoft/DeepSpeed

d9b1672

Merge branch 'master' of github.com:microsoft/DeepSpeed

a746aca

Merge branch 'master' of github.com:microsoft/DeepSpeed

f0afcf3

Merge branch 'master' of github.com:microsoft/DeepSpeed

956ed2f

Merge branch 'master' of github.com:microsoft/DeepSpeed

e1276ab

Rebase

f940e1e

tjruwase requested review from jeffra, samyam, mrwyattii, RezaYazdaniAminabadi, awan-10, cmikeh2 and arashb as code owners August 23, 2023 17:30

Fix rebase conflict

0129db2

Merge branch 'master' into staging-zero-inference-v1

a63e92b

tjruwase mentioned this pull request Sep 8, 2023

ZeRO-Inference refresh microsoft/DeepSpeedExamples#722

Merged

awan-10 approved these changes Sep 8, 2023

View reviewed changes

Merge branch 'master' into staging-zero-inference-v1

9723d93

tjruwase added this pull request to the merge queue Sep 11, 2023

Merged via the queue into master with commit aa4a740 Sep 11, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRO-Inference refresh #4197

ZeRO-Inference refresh #4197

tjruwase commented Aug 23, 2023

delock commented Aug 24, 2023 •

edited

Loading

tjruwase commented Aug 24, 2023

tjruwase commented Aug 24, 2023

ftian1 commented Sep 11, 2023

ftian1 commented Sep 11, 2023 •

edited

Loading

tjruwase commented Sep 11, 2023

tjruwase commented Sep 11, 2023

ZeRO-Inference refresh #4197

ZeRO-Inference refresh #4197

Conversation

tjruwase commented Aug 23, 2023

delock commented Aug 24, 2023 • edited Loading

tjruwase commented Aug 24, 2023

tjruwase commented Aug 24, 2023

ftian1 commented Sep 11, 2023

ftian1 commented Sep 11, 2023 • edited Loading

tjruwase commented Sep 11, 2023

tjruwase commented Sep 11, 2023

delock commented Aug 24, 2023 •

edited

Loading

ftian1 commented Sep 11, 2023 •

edited

Loading