llm int8 integration into Megatron part of metaseq #10

erichan1 · 2023-01-27T06:04:07Z

Add LLM int8 to Megatron.
Steps to use

Flip on QUANTIZED_INFERENCE in layers.py
Comment out _log_weight_stats here. Not sure if this is simply buggy or that the weight stats is unhappy if we don't initialize the weights. Potentially a divide by zero issue.
Run thru metaseq (or whatever runner of your choice)

Why use hooks for load_state_dict?

tl;dr it's basically a way to stream weights linear by linear from CPU=>GPU.
I had a problem where I started with an fp16 checkpoint, but I wanted to load it into less GPUs than it was trained on. FP16 wouldn't fit into GPU memory. Tried fiddling with offline conversion, but it was messy. So fundamentally, we want to create fp16 weights on CPU, then only convert to int8 as we send to GPU.
pre_hook allows us to not blow up CPU memory by creating all weights at once (eg during initial model creation). We only create CPU-side weights when a particular linear has load_state_dict called on it. That CPU weight is immediately destroyed when we send to GPU.
After we load in fp16 weight onto CPU during load_state_dict, the post hook takes care of converting to int8 and sending to GPU.

erichan1 · 2023-02-17T00:26:25Z

megatron/mpu/quantization_utils.py

+QUANTIZATION_LEVEL = 0 # 0 == None, 1 == LLMint8, 2 == Smoothquant W8A16, 3 == Smoothquant W8A8
+QUANTIZATION_IS_LOAD_STATE_DICT = True # Only flip to False for benchmarking purposes if not loading state dict


Set QUANTIZATION_LEVEL here for different quantization types

erichan1 added 2 commits January 26, 2023 21:30

experimental first version of llm int8 integration into metaseq

f3ccc3e

working version of llm int8

08e710d

erichan1 changed the title ~~experimental first version of llm int8 integration into metaseq~~ llm int8 integration into Megatron part of metaseq Feb 10, 2023

erichan1 added 2 commits February 10, 2023 12:01

clean it up a bit

51ab309

add unfused W8A16 and refactor out into a separate file

97d3864

erichan1 commented Feb 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm int8 integration into Megatron part of metaseq #10

llm int8 integration into Megatron part of metaseq #10

erichan1 commented Jan 27, 2023 •

edited

Loading

erichan1 Feb 17, 2023

		QUANTIZATION_LEVEL = 0 # 0 == None, 1 == LLMint8, 2 == Smoothquant W8A16, 3 == Smoothquant W8A8
		QUANTIZATION_IS_LOAD_STATE_DICT = True # Only flip to False for benchmarking purposes if not loading state dict

llm int8 integration into Megatron part of metaseq #10

Are you sure you want to change the base?

llm int8 integration into Megatron part of metaseq #10

Conversation

erichan1 commented Jan 27, 2023 • edited Loading

erichan1 Feb 17, 2023

Choose a reason for hiding this comment

erichan1 commented Jan 27, 2023 •

edited

Loading