Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm int8 integration into Megatron part of metaseq #10

Open
wants to merge 4 commits into
base: fairseq_v3
Choose a base branch
from

Conversation

erichan1
Copy link

@erichan1 erichan1 commented Jan 27, 2023

Add LLM int8 to Megatron.
Steps to use

  1. Flip on QUANTIZED_INFERENCE in layers.py
  2. Comment out _log_weight_stats here. Not sure if this is simply buggy or that the weight stats is unhappy if we don't initialize the weights. Potentially a divide by zero issue.
  3. Run thru metaseq (or whatever runner of your choice)

Why use hooks for load_state_dict?

  • tl;dr it's basically a way to stream weights linear by linear from CPU=>GPU.
  • I had a problem where I started with an fp16 checkpoint, but I wanted to load it into less GPUs than it was trained on. FP16 wouldn't fit into GPU memory. Tried fiddling with offline conversion, but it was messy. So fundamentally, we want to create fp16 weights on CPU, then only convert to int8 as we send to GPU.
  • pre_hook allows us to not blow up CPU memory by creating all weights at once (eg during initial model creation). We only create CPU-side weights when a particular linear has load_state_dict called on it. That CPU weight is immediately destroyed when we send to GPU.
  • After we load in fp16 weight onto CPU during load_state_dict, the post hook takes care of converting to int8 and sending to GPU.

@erichan1 erichan1 changed the title experimental first version of llm int8 integration into metaseq llm int8 integration into Megatron part of metaseq Feb 10, 2023
Comment on lines +3 to +4
QUANTIZATION_LEVEL = 0 # 0 == None, 1 == LLMint8, 2 == Smoothquant W8A16, 3 == Smoothquant W8A8
QUANTIZATION_IS_LOAD_STATE_DICT = True # Only flip to False for benchmarking purposes if not loading state dict
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set QUANTIZATION_LEVEL here for different quantization types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant