Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replit-code-v1-3b #131

Closed
vgrichina opened this issue May 4, 2023 · 17 comments
Closed

replit-code-v1-3b #131

vgrichina opened this issue May 4, 2023 · 17 comments

Comments

@vgrichina
Copy link

https://huggingface.co/replit/replit-code-v1-3b gives solid results for both code generation and chat:

Screenshot 2023-05-04 at 13 11 26

another example here:
https://twitter.com/vgrichina/status/1653872353825419264

TBH seems like best assistant behaviour from < 3B model, would be super cool to get it running on user devices.

@Green-Sky
Copy link
Contributor

xref ggerganov/llama.cpp#1299

@lukasmoellerch
Copy link
Contributor

I'd be interested in this as well, will start looking into it tomorrow if nobody else is working on it so far...

@lukasmoellerch
Copy link
Contributor

Screenshot 2023-05-07 at 22 36 43

Getting there... Just kinda messed up the alibi attention bias I think.

Also: The tokenizer uses unigrams and is based on sentence-piece. What's the idea here? Integrating an optimal unigram tokenizer (some construction with lattices is required to make it efficient I think) would increase the complexity a lot..

@Leoputera2407
Copy link

@lukasmoellerch Could you share your branch, maybe a second eye can help?

@lukasmoellerch
Copy link
Contributor

lukasmoellerch commented May 9, 2023

Sure, branch is here: https://github.com/lukasmoellerch/ggml/tree/replit - The temp=0 output is completely is correct if alibi attention bias is disabled both in their implementation and in mine, thus I am convinced that the problem is in that part somewhere. I suspect that we might need to do something like

double bias = (1 - n_seq + n_past) * m_k;
pdst[0] = bias + src[0];

in ggml_compute_forward_alibi_f32, which gives me the correct output bias for n_seq = 2048, but the output seems to be unaltered.

Also: The input-ids are hardcoded to def fibonacci right now, the tokenizer is not implemented yet.

@Leoputera2407
Copy link

Leoputera2407 commented May 9, 2023

It seems like the alibi-bias in replitLM is calculated differently from how ggml calculates the alibi-bias.
ReplitLM does so by applying an exponentially decreasing bias for each attention head. In Replit's case, it takes the max_alibi_bias=8, and give a different m for each head. This seems to reflect how the Alibi was described in this youtube video. This wasn't how Bloom was implemented on Ofir's repo or HuggingFace, which follows his repo.

The alibi bias in ggml was copied over Bloomz.ccp and implements huggingface's alibi_bias Bloom. The main difference is that the bias power is scaled dynamically based on the num_heads through ("closest_power_to_2" thingy), while Replit's version the "max" power is set through the hyperparameter "max_alibi_bias". I think we need to implement bias for replit (and mosaic version) specifically. I tried to reconcile the two, but it seems non-trivial.

static void ggml_compute_forward_mosaic_alibi_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(src1->type == GGML_TYPE_I32);
    assert(ggml_nelements(src1) == 2);

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n_head = ((int32_t *) src1->data)[1];

    const int ne0 = src0->ne[0];
    const int ne1 = src0->ne[1];

    const int n  = ggml_nrows(src0);
    const int ne2_ne3 = n/ne1;

    const int nb0 = src0->nb[0];
    const int nb1 = src0->nb[1];
    const int nb2 = src0->nb[2];

    assert(nb0 == sizeof(float));

    const float alibi_bias_max = 8.0f;
    float m0 = alibi_bias_max / n_head;

    for (int i = 0; i < ne0; i++) {
        for (int j = 0; j < ne1; j++) {
            for (int k = 0; k < ne2_ne3; k++) {
                float * const src = (float *)((char *) src0->data + i*nb0 + j*nb1 + k*nb2);
                float *      pdst = (float *)((char *)  dst->data + i*nb0 + j*nb1 + k*nb2);

                float m_k = 1.0f / std::pow(2, (k+1) * m0);
                pdst[0] = (j + 1) * m_k + src[0];
            }
        }
    }
}

@lukasmoellerch
Copy link
Contributor

lukasmoellerch commented May 9, 2023

That's what I thought as well, but the "closest power of two" thing doesn't apply here because n_heads is a power of two for the replit model and probably also for the other mpt models. At the same time the m0 term already scales correctly I think:

m0 = powf(2.0f, -8.0f / n_heads_log2_floor);
m_k = powf(m0, k + 1);

means that

m0 = 1 / 2^((8/32) * (k + 1))

which is already equivalent to the replit implementation of

alibi_bias * (1. / (2**torch.arange(1, n_heads + 1).mul(alibi_bias_max / n_heads).view(n_heads, 1, 1)))

Right?

The only difference I see is

alibi_bias = torch.arange(1 - seq_len, 1, dtype=dtype,
                              device=device).view(1, 1, seq_len)

as opposed to j used in the ggml implementation.

@Leoputera2407
Copy link

Leoputera2407 commented May 9, 2023

Actually, you’re right. The max_alibi_bias is fortunately hard-coded to be 8, the same as replit, and it indeed scales as replit does after I tested it out by hand.

I think I can see what you mean by adding bias= (1-n_seq+n_past), the bias computation for ggml starts from the begininng of the sequence, hence why the offset is (1+j), while replit starts from the end sequence, so it should be (1+j-ne1), since ne1 =n_seq-n_past? So, the offset starts from the end instead of the start of the sequence, the same as replit ‘torch.arange(1-seq_len,1,…)’ . However, my hypothesis is that, eventho the positional bias is different, the overall effect should be the same as the relative differences between different positions in the sequence should still be the same, hence why I think the output is the same.

I’m starting to conclude that alibi_bias is not an issue at all, and we can keep ggml alibi bias as is. What do you think? Is the output of the model when you tested problematic?

@Leoputera2407
Copy link

Leoputera2407 commented May 10, 2023

I can see if I can hell to implement the tokenizer, what’s the idea behind making them into a trie?

I saw other ggml ports, including bloomz and neox, they seem to do the tokenizer hack mentioned in the conversion script, without implementing a custom tokenizer.

Also, while not relevant to replit, we need to think about how to handle qkv_clipping to help port mpt 7b, which is what I’m currently trying to do.

@lukasmoellerch
Copy link
Contributor

Yes the output of the model seems to be different to the reference implementation and quickly degrades. The initial logits are correct, but it degrades after a few lines. The outputs match perfectly if bias is removed though.

You might be correct in that adding the same bias everywhere probably doesn’t change anything due the softmax that’s applied right afterwards.

The trie I implemented is completely unnecessary and not relevant for replit. The unigram tokenizer it uses has both “pieces” and their corresponding probabilities which can’t easily be accessed through the official interface. Instead we’d have to load the protobuf file and encode the logits into the ggml model as well. For actually implementing the tokenizer we can probably get away with splitting on whitespace and then encoding both whitespace and non-whitespace sections using a brute force approach. Hugging face has a blog post about this: https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt - this isn’t quite correct I think but could be a good starting point

@lukasmoellerch
Copy link
Contributor

NouamaneTazi/bloomz.cpp#27 fixes it

@lukasmoellerch
Copy link
Contributor

@Leoputera2407 did you do any work on the tokenizer? Otherwise I'd give it a shot.

@Leoputera2407
Copy link

I’m trying some whitespace tokenizer hack I saw from cformers, but don’t let deter you haha I’m still new to the ggml framework and not super proficient in C++.

Btw, could you share the updated the replit branch? Mine seems to hang on when I try to run inference on the converted model, possibly because I didn’t so the sentence_proto thing correctly.

@lukasmoellerch
Copy link
Contributor

lukasmoellerch commented May 10, 2023

Yes, all updated now, also with somewhat okay tokenization support

@Leoputera2407
Copy link

Leoputera2407 commented May 10, 2023

@lukasmoellerch I think I’ve managed to port MPT to ggml. Tokenizer was gpt-neox which we can copypasta, and seems to work ok without the qkv clamp. I can try to write the quantization script. I can push to your branch, or shall we open a PR for it in ggml?

@lukasmoellerch
Copy link
Contributor

lukasmoellerch commented May 10, 2023

Oh I already have mpt on my branch and wrote about that in the mpt issue - sorry for this miscommunication

I have quantization for replit but not for mpt, code should be the same though except vocab

@matthiasgeihs
Copy link

I think this issue can be closed? @ggerganov

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
* Functionality addition CMakeLists.txt

Refactoring:
1. Simplify more options that are negation of negation.
LLAMA_NO_ACCELERATE -> LLAMA_ACCELERATE
2. Changed to an optional expression instead of forcing to enable AVX2 in MSVC.
3. Make CMAKE_CXX_STANDARD, which is different from Makefile, the same.
4. Use add_compile_options instead of adding options to CMAKE_C_FLAGS.
5. Make utils use target_link_libraries instead of directly referencing code.

Added features:
1. Added some options.
LLAMA_STATIC_LINK,LLAMA_NATIVE,LLAMA_LTO,LLAMA_GPROF,LLAMA_OPENBLAS

* Fix Accelerate link in CMake

* Windows build Fix

* C++11 to C++17

* Reflects C/C++ standard individually

* Change the version to 3.12

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants