Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantizing GPT-J Produces Nonsense #71

Closed
zanussbaum opened this issue Apr 6, 2023 · 4 comments
Closed

Quantizing GPT-J Produces Nonsense #71

zanussbaum opened this issue Apr 6, 2023 · 4 comments

Comments

@zanussbaum
Copy link

zanussbaum commented Apr 6, 2023

Hey thanks for the great package! When I try to quantize an fp16 ggml file of GPT-J, the outputs from chat are nonsense. Also the outputs of the gpt-j-quantize bin seem to be off as I'd expect the hist to have non-zero values (as seen in other examples like llama.cpp quantize)

 0.000 0.000 
                     transformer.h.0.ln_1.weight - [ 4096,     1], type =    f32 size =    0.016 MB
                       transformer.h.0.ln_1.bias - [ 4096,     1], type =    f32 size =    0.016 MB
              transformer.h.0.attn.k_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
              transformer.h.0.attn.v_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
              transformer.h.0.attn.q_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
            transformer.h.0.attn.out_proj.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
                transformer.h.0.mlp.fc_in.weight - [ 4096, 16384], type =    f16 quantizing .. size =   256.00 MB ->    40.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
                  transformer.h.0.mlp.fc_in.bias - [16384,     1], type =    f32 size =    0.062 MB
               transformer.h.0.mlp.fc_out.weight - [16384,  4096], type =    f16 quantizing .. size =   256.00 MB ->    40.00 MB | hist: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
                 transformer.h.0.mlp.fc_out.bias - [ 4096,     1], type =    f32 size =    0.016 MB
                     transformer.h.1.ln_1.weight - [ 4096,     1], type =    f32 size =    0.016 

Is quantizing from fp16 not possible for GPT-J?

@RaymondCrandall
Copy link

RaymondCrandall commented Apr 6, 2023

I think I might not know how to communicate what I am trying to say correctly, but it looks like the output of

transformer.h.0.mlp.fc_in.bias
transformer.h.0.mlp.fc_out.weight

might need their outputs in the opposite order

the comment

// The multi-dimensional tensors are stored in row-major order. The ggml_tensor struct contains fields for the

and the rest of the model structure indicate the model would plausibly be described as [rows,columns]

but maybe I'm wrong or confused

EDIT: definitely wrong and confused.

@LostRuins
Copy link
Contributor

Hi @RaymondCrandall @zanussbaum @ggerganov I think I have figured out this issue, the f16 to f32 tables were not properly initialized in the quantize examples.

This can be fixed by adding this code to main() in quanitize.cpp

    {
        struct ggml_init_params params = { 0, NULL };
        struct ggml_context * ctx = ggml_init(params);
        ggml_free(ctx);
    }

Please refer to my PR #77

@manyoso
Copy link
Contributor

manyoso commented Apr 14, 2023

This was integrated and can be closed, yes?

@LostRuins
Copy link
Contributor

It should be. I am already using it in my fork with correct results, it looks like the quantization works. If you see the histogram outputs during quantization, you can tell if they have a bunch of different numbers, then it should be correct.

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants