-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue inferencing HuggingFace's GPT-J 4 bits model #539
Comments
I partially solved this but it's generating a bunch of A: ❯ ./build/bin/gpt-j -m ./ggml-model-f16.bin -p "A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n" -n 512 --top_p 0.8 --temp 0.2
main: seed = 1696200850
gptj_model_load: loading model from './ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: ftype = 1
gptj_model_load: qntvr = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ............................ done
gptj_model_load: model size = 11540.60 MB / num tensors = 229
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 68
A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n
A
A
A
A
A
A
A
A
A
A
My current implementation: # load quantized version of https://huggingface.co/bertin-project/bertin-gpt-j-6B-alpaca
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
# llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
'bertin-project/bertin-gpt-j-6B-alpaca',
low_cpu_mem_usage=True,
device_map=device_map, # split between 2 GPUs
torch_dtype='auto',
quantization_config=bnb_config,
use_cache=False
) Proceed to quantization and export: cls = bnb.nn.Linear4bit
def write_data(name, d):
orig_data_shape = d.shape
orig_data_size = sys.getsizeof(d)
d = d.to(torch.float16).squeeze().to('cpu').numpy()
n_dims = len(d.shape)
print("Writting: " + name + " with shape: ", d.shape)
print('Original shape: ', orig_data_shape)
print((orig_data_size, sys.getsizeof(d)))
ftype_cur = 0
if ftype != 0:
if name[-7:] == ".weight" and n_dims == 2:
print(" Converting to float16")
d = d.astype(np.float16)
ftype_cur = 1
else:
print(" Converting to float32")
d = d.astype(np.float32)
ftype_cur = 0
else:
if d.dtype != np.float32:
print(" Converting to float32")
d = d.astype(np.float32)
ftype_cur = 0
# header
str = name.encode('utf-8')
fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
for i in range(n_dims):
fout.write(struct.pack("i", d.shape[n_dims - 1 - i]))
fout.write(str)
# write file
d.tofile(fout)
# dequantize (if required) and export modules
with torch.no_grad():
for orig_name, module in model.named_modules():
if orig_name.endswith("attn.masked_bias") or orig_name.endswith(".attn.bias"):
print(" Skipping variable: " + orig_name)
continue
if isinstance(module, cls):
name = f'{orig_name}.weight'
print(f"Dequantizing `{orig_name}`...")
quant_state = copy.deepcopy(module.weight.quant_state)
# quant_state.dtype = torch.bfloat16
weight_deq = F.dequantize_4bit(
module.weight.data, quant_state=quant_state, quant_type="nf4").to(torch.bfloat16)
write_data(name, weight_deq)
elif f'{orig_name}.weight' in list_vars or \
f'{orig_name}.bias' in list_vars:
if hasattr(module, 'weight'):
name = f'{orig_name}.weight'
data = module.weight.data
write_data(name, data)
if hasattr(module, 'bias'):
name = f'{orig_name}.bias'
data = module.bias.data
write_data(name, data)
fout.close() Somehow, the embeddings or tokenization is messed up and I can't find the reason. |
CCLDArjun
pushed a commit
to CCLDArjun/ggml
that referenced
this issue
Dec 18, 2023
…gerganov#539) Co-authored-by: Jakub Horak <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a follow up of #371 (comment)
After converting a GPT-J 4 bits model into ggml using the convert-h5-to-ggml.py script, the inferencing fails with the following:
Apparently, it's duplicating the tensor's size as I added some verbosity here:
https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L333
The original tensor has 8388608 while ggml expects 16777216:
I assume this might be related with the model being 4 bits, but I'm yet not sure what to touch.
The text was updated successfully, but these errors were encountered: