-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeRO-Inference refresh #4197
ZeRO-Inference refresh #4197
Conversation
* INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * add zero3 test * quantize small weight first to prevent oom * fold quantization config into ds_config * Fix license & refactor ds_config & rebase master * fix UT
… kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix
* Move int 4 code to deepspeed/inference * fix * fix * fix
* fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
* Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Cheng Li <[email protected]>
* zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format
Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Ethan Doe <[email protected]> Co-authored-by: yidoe <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: HeyangQin <[email protected]> Co-authored-by: GuanhuaWang <[email protected]> Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]>
* zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <[email protected]> Co-authored-by: Conglong Li <[email protected]>
* fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
* Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Stephen Youn <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
@tjruwase are there examples using weight quantization and kv_cache offloading? Wonder what it takes to support int4 on other accelerators as well. |
Sorry, that release won't be ready until next week. It will be great to get your feedback. |
@tjruwase May I get some inputs from your side about contributing more weight only quantization algorithms like WeightOnly RTN/AWQ/GPTQ to DeepSpeed inference? is it ok for DeepSpeed inference to accept other compression algos except for ZeRO? and some of such algos may require calibration on some datasets which may have impact on inference API. do you think it's valuable to add? or shall we stick on those data-free compression at first? |
and some questions/comments about current PR:
model = AutoModel.from_pretrained('facebook/opt-125m', torch_dtype=torch.float16)
model = model.eval()
model = _init_group_wise_weight_quantization(model, ds_config)
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
|
@ftian1, thanks for the questions. We will share a response asap. |
Yes, this is the plan.
This week. |
* origin/master: (48 commits) Fix autotune to support Triton 2.1 (microsoft#4340) Fix skipped inference tests (microsoft#4336) Suppress noise (microsoft#4310) Fix a bug in the implementation of dequantization for inference (microsoft#3433) DS-Chat BLOOM: Fix Attention mask (microsoft#4338) clear redundant timers (microsoft#4308) Add release version checking (microsoft#4328) Fix Zero3 contiguous grads, reduce scatter false accuracy issue (microsoft#4321) Clean up modeling code (microsoft#4320) Handle empty parameter groups (microsoft#4277) Update README.md (microsoft#4316) README update (microsoft#4303) Update release and bump patch versioning flow (microsoft#4286) added a bert-model check for triton (microsoft#4266) ZeRO-Inference v2 release bump to 0.10.4 Update index.md (microsoft#4297) fix user args parsing of string with spaces on runner (microsoft#4265) ZeRO-Inference refresh (microsoft#4197) AMD Kernel Compatibility Fixes (microsoft#3180) ...
Two new optimizations