AutoAWQForCausalLM requires the download of pile-val-backup #2458

e576082c · 2024-01-16T20:11:45Z

I installed vllm to automatically run some tests on a bunch of Mistral-7B models, (what I cooked up locally, and I do NOT want to upload to huggingface before properly testing them). The plan is to:

Convert the fp16 safetensors model to AutoAWQ. But don't save it, keep it in memory. Yeah, I don't have any more free space on my HDD, and anyway I don't want to litter my HDD with a bunch of AWQ quantized failed or not failed experimental models. Each one of the fp16 safetensors models on disk is around 14.5G, and the quantized models should be much smaller than that (If I check stuff on HF uploaded by TheBloke, the model size should be around just 4.15GB in safetensors format). Both the quantized and the unquantized models would fit into my 64GB system RAM, even at the same time.
Run the tests ("inference" you might call it so) on the quantized model (taken from RAM, loaded into VRAM) with vllm. vllm is only for GPU inference, and one AWQ quantized 7B model would surely fit into my 12GB VRAM (NVIDIA card with working CUDA), because yeah, I have no problems running Q8 GGUF 7B Mistral models fully offloaded to my GPU on kobold.cpp. Point is, vllm should be faster with running an even smaller 4bit AWQ 7B model, and unlike kobold.cpp, I can make a python program using vllm, what would iterate over every model I want to test, and run all tests in a batch quickly.
Before over-complicating things with testing logic and my spaghetti code looping over multiple model folders, the first logical thing to do is to check, whether my AutoAWQ+vlmm idea would work at all. (Could it split out some text from a known good model or not?).
Here's the problem: AutoAWQForCausalLM wants to download mit-han-lab/pile-val-backup for no sane reason, without any explanation or warning. My disk is full, so it wouldn't fit in there anyway. IMHO, I can mount a tmpfs (RAM disk) and slap in there the files of "pile-val-backup", then bind-mount it to whatever place where AutoAWQForCausalLM may require it, but this download looks suspicious. Why is this dataset required, and why does it have no explanation whatsoever its license might be? On the dataset model card I can see: "Please respect the original license of the dataset." But the license is not stated anywhere! Also, I can find in its name "backup", an other warning sign, that it might be something rouge, what was taken down, then re-uploaded by someone random on the net.
So, I checked around of course, and I found only this dismissed, ignored issue from the AutoAWQ repo. The "solution" says: "Perhaps your network is blocking Huggingface?" Well, don't ya say? Why would I wanna even download this dataset? Issue closed, and ignored. Everybody is happy, just download it, without thinking, right? Welp, no. I don't think so.

So back to the issue about vllm, and how all of this might be related to vllm:

For quick testing, I copy-pasted and modded some code from docs.vllm.ai/en/latest/quantization/auto_awq.html. My code isn't much different from the one in the official docs at vllm.ai, and this particular code triggers the download of "pile-val-backup".

Perharps I messed up somthing in the code, but I honestly don't think so. Please have a look at it:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/mnt/AI/models/safetensors/loyal-piano-m7"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True}, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model=model, quantization="AWQ")

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

I suppose my code is quite trivial, almost the same as in the docs.
Ah, and before I forget it:

$ cd "/mnt/AI/models/safetensors/loyal-piano-m7"
$ ls
.
..
config.json
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
README.md
special_tokens_map.json
tokenizer_config.json
tokenizer.json
tokenizer.model

$ python3 --version
Python 3.11.2

$ pip show vllm
Name: vllm
Version: 0.2.7
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License: Apache 2.0
Location: /mnt/AI/runner/venv/lib/python3.11/site-packages
Requires: aioprometheus, fastapi, ninja, numpy, psutil, pydantic, ray, sentencepiece, torch, transformers, uvicorn, xformers
Required-by:

$ pip show torch
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /mnt/AI/runner/venv/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, autoawq, lm_eval, peft, torchaudio, torchvision, vllm, xformers

The error I get is:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.46it/s]
Traceback (most recent call last):
  File "/mnt/AI/runner/vllm_autoAWQ_runner.py", line 12, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/models/base.py", line 89, in quantize
    quantizer = AwqQuantizer(
                ^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 36, in __init__
    self.modules, self.module_kwargs, self.inps = self.init_quant()
                                                  ^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 320, in init_quant
    samples = get_calib_dataset(
              ^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/utils/calib_data.py", line 11, in get_calib_dataset
    dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/datasets/load.py", line 2523, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/datasets/load.py", line 2195, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/datasets/load.py", line 1838, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couldn't reach the Hugging Face Hub for dataset 'mit-han-lab/pile-val-backup': Offline mode is enabled.

And finally, here is some genreic code, what currently works, but it's slow, so I'm not happy with it, so I would like to use vllm instead:

import torch
from transformers import LlamaTokenizer, MistralForCausalLM, BitsAndBytesConfig, pipeline
from transformers import set_seed

set_seed(1)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_use_double_quant = False,
    bnb_4bit_compute_dtype=torch.float16
)

model_name = "/mnt/AI/models/safetensors/loyal-piano-m7"

tokenizer = LlamaTokenizer.from_pretrained(model_name)

model = MistralForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config = bnb_config,
        torch_dtype = torch.float16,
        device_map = "auto",
        trust_remote_code = False,
        low_cpu_mem_usage=True
    )


pipe = pipeline(
    "text-generation", 
    model = model, 
    tokenizer = tokenizer, 
    torch_dtype = torch.float16, 
    device_map = "auto"
)


prompt = "\nWikipedia is "

sequences = pipe(
    prompt,
    do_sample = True,
    max_new_tokens = 512,
    temperature = 1.0, 
    top_k = 1, 
    top_p = 1.0,
    typical_p = 1.0,
    repetition_penalty = 1.0,
    epsilon_cutoff = 0.0,
    eta_cutoff = 0.0,
    diversity_penalty = 0.0,
    length_penalty = 1.0,
    return_full_text=True,
    use_cache = False,
    num_return_sequences = 1
)

print(sequences[0]['generated_text'])

So... vllm doesn't work, while the generic code I put together from Huggingface docs does work, but it's too slow.

I would really like to try out vllm, but I won't download a random shady dataset (pile-val-backup), what AWQ requires for whatever reason.

Please remove the dependency on "pile-val-backup".

The text was updated successfully, but these errors were encountered:

MeJerry215 · 2024-01-17T06:25:33Z

Sorry buddy. AutoAwq is another project, not maintained by vllm. You should ask in autoawq not vllm.

wasertech · 2024-01-18T02:35:14Z

@e576082c you should really save the awq quantization and then you can reload it from the save dir (or even a repo if you push it manually). The dataset is used to calibrate the quantization from what I get but @casper-hansen should correct me on this one.

The proper way to infer your AWQ model is as follow.

Quantize and save

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/mnt/AI/models/safetensors/loyal-piano-m7"
quant_path = f"{model_path}-awq"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
# This takes a while so you really want to save the result 
# to make inference loading faster
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
# The whole point of awq format is to have a smaller footprint
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Load your AWQ model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/mnt/AI/models/safetensors/loyal-piano-m7-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
...

Also I'm pretty sure AWQ needs the full precision model in pytorch format as I don't think it supports safetensors format. At least I didn't had luck with it.

casper-hansen · 2024-01-18T09:31:31Z

We need a quantization dataset and use the pile validation split for that. It works as intended, I would suggest you increase your disk capacity or clean up the disk.

You also need to save the model, that’s just a basic step and what you should expect. If it’s a problem, I recommend renting a machine that fits the requirements for quantizing.

e576082c · 2024-01-21T16:28:27Z

Thanks for the code @wasertech, and thanks for the advice everybody! I didn't know that AWQ quantization requires a calibration dataset (because so far I only used GGUF, what just works).

Regardless, at least putting a warning to the docs about the dataset download would be nice (because of its unknown license).

I am skeptical about calibration dataset reliant quantization methods (is the result deterministic? what about multi-lingual tests? what's the point of a calibration dataset anyway? The model should be able to write program code in 4 programming languages, and be fluent in 6 human languages, so 'calibrating' on an English dataset seems pointless to me. Yeah, I have not looked into the dataset, so I don't even have a clue about what might be inside of it, so sorry if I guessed wrong.).

Anyway I choose to use GGUF (the older version, not the new one), and made a simple bash script for llama.cpp to iterate over my test prompts. (using the --file, --logdir, and --model arguments of llama.cpp).

Saving quantized models to disk is not something I would do, even if the code requires writing out the file to disk (probably cleaner that way, due to memory management issues?), I saved the quantized file into a tmpfs anyway. (I have a lot of unused RAM, so why not use it?) My WD red SMR disks (lol 'NAS optimized'), are slow, so I prefer to avoid using them. Even if I could have managed to make some free space on my disks, rewriting a 4-7 GB big file for each test model would have taken a prohibitively long time.

Anyway, thanks again for the help. My problem is solved, I guess (llama.cpp is now chewing through my tests at 20T/s speed. xD Gonna take a few more days to complete.) But I'll leave this issue open regarding the unknown license of the pile-val-backup dataset, and the absence of warnings before its download.

wasertech · 2024-01-21T16:34:42Z

My problem is solved, ... But I'll leave this issue open regarding the unknown license of the pile-val-backup dataset, and the absence of warnings before its download.

@e576082c please close this issue here with a link to a new issue in autoAWQ repo because you are in vLLM repo atm so not the right one to raise this issue 😅

hmellor · 2024-03-06T09:15:56Z

Closing as this issue should be raised with AutoAWQ instead.

hmellor closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoAWQForCausalLM requires the download of pile-val-backup #2458

AutoAWQForCausalLM requires the download of pile-val-backup #2458

e576082c commented Jan 16, 2024

MeJerry215 commented Jan 17, 2024

wasertech commented Jan 18, 2024 •

edited

Loading

casper-hansen commented Jan 18, 2024

e576082c commented Jan 21, 2024

wasertech commented Jan 21, 2024

hmellor commented Mar 6, 2024

AutoAWQForCausalLM requires the download of pile-val-backup #2458

AutoAWQForCausalLM requires the download of pile-val-backup #2458

Comments

e576082c commented Jan 16, 2024

MeJerry215 commented Jan 17, 2024

wasertech commented Jan 18, 2024 • edited Loading

Quantize and save

Load your AWQ model

casper-hansen commented Jan 18, 2024

e576082c commented Jan 21, 2024

wasertech commented Jan 21, 2024

hmellor commented Mar 6, 2024

wasertech commented Jan 18, 2024 •

edited

Loading