Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue inferencing HuggingFace's GPT-J 4 bits model #539

Open
webpolis opened this issue Sep 25, 2023 · 1 comment
Open

Issue inferencing HuggingFace's GPT-J 4 bits model #539

webpolis opened this issue Sep 25, 2023 · 1 comment

Comments

@webpolis
Copy link

This is a follow up of #371 (comment)

After converting a GPT-J 4 bits model into ggml using the convert-h5-to-ggml.py script, the inferencing fails with the following:

main: seed = 1695659205
gptj_model_load: loading model from 'ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: ftype   = 1
gptj_model_load: qntvr   = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: tensor 'transformer.h.0.attn.k_proj.weight' has wrong size in model file
main: failed to load model from 'ggml-model-f16.bin'
gptj_model_load: 

Apparently, it's duplicating the tensor's size as I added some verbosity here:

https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L333

The original tensor has 8388608 while ggml expects 16777216:

gptj_model_load: tensor 'transformer.h.0.attn.k_proj.weight' has wrong size in model file (16777216, 8388608)
main: failed to load model from 'ggml-model-f16.bin'
gptj_model_load:

I assume this might be related with the model being 4 bits, but I'm yet not sure what to touch.

@webpolis webpolis reopened this Sep 27, 2023
@webpolis
Copy link
Author

webpolis commented Oct 1, 2023

I partially solved this but it's generating a bunch of A:

❯ ./build/bin/gpt-j -m ./ggml-model-f16.bin -p "A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n" -n 512 --top_p 0.8 --temp 0.2
main: seed = 1696200850
gptj_model_load: loading model from './ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: ftype   = 1
gptj_model_load: qntvr   = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ............................ done
gptj_model_load: model size = 11540.60 MB / num tensors = 229
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 68

A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n
A
A
A
A
A
A
A
A
A
A

My current implementation:

# load quantized version of https://huggingface.co/bertin-project/bertin-gpt-j-6B-alpaca
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    # llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
    'bertin-project/bertin-gpt-j-6B-alpaca',
    low_cpu_mem_usage=True,
    device_map=device_map, # split between 2 GPUs
    torch_dtype='auto',
    quantization_config=bnb_config,
    use_cache=False
)

Proceed to quantization and export:

cls = bnb.nn.Linear4bit


def write_data(name, d):
    orig_data_shape = d.shape
    orig_data_size = sys.getsizeof(d)
    d = d.to(torch.float16).squeeze().to('cpu').numpy()
    n_dims = len(d.shape)

    print("Writting: " + name + " with shape: ", d.shape)
    print('Original shape: ', orig_data_shape)
    print((orig_data_size, sys.getsizeof(d)))

    ftype_cur = 0
    if ftype != 0:
        if name[-7:] == ".weight" and n_dims == 2:
            print("  Converting to float16")
            d = d.astype(np.float16)
            ftype_cur = 1
        else:
            print("  Converting to float32")
            d = d.astype(np.float32)
            ftype_cur = 0
    else:
        if d.dtype != np.float32:
            print("  Converting to float32")
            d = d.astype(np.float32)
            ftype_cur = 0

    # header
    str = name.encode('utf-8')
    fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
    for i in range(n_dims):
        fout.write(struct.pack("i", d.shape[n_dims - 1 - i]))
    fout.write(str)

    # write file
    d.tofile(fout)

# dequantize (if required) and export modules
with torch.no_grad():
    for orig_name, module in model.named_modules():
        if orig_name.endswith("attn.masked_bias") or orig_name.endswith(".attn.bias"):
            print("  Skipping variable: " + orig_name)
            continue

        if isinstance(module, cls):
            name = f'{orig_name}.weight'
            print(f"Dequantizing `{orig_name}`...")

            quant_state = copy.deepcopy(module.weight.quant_state)
            # quant_state.dtype = torch.bfloat16
            weight_deq = F.dequantize_4bit(
                module.weight.data, quant_state=quant_state, quant_type="nf4").to(torch.bfloat16)

            write_data(name, weight_deq)
        elif f'{orig_name}.weight' in list_vars or \
                f'{orig_name}.bias' in list_vars:
            if hasattr(module, 'weight'):
                name = f'{orig_name}.weight'
                data = module.weight.data

                write_data(name, data)

            if hasattr(module, 'bias'):
                name = f'{orig_name}.bias'
                data = module.bias.data

                write_data(name, data)

fout.close()

Somehow, the embeddings or tokenization is messed up and I can't find the reason.

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant