replit-code-v1-3b #131

vgrichina · 2023-05-04T20:12:53Z

https://huggingface.co/replit/replit-code-v1-3b gives solid results for both code generation and chat:

another example here:
https://twitter.com/vgrichina/status/1653872353825419264

TBH seems like best assistant behaviour from < 3B model, would be super cool to get it running on user devices.

Green-Sky · 2023-05-04T20:37:53Z

xref ggerganov/llama.cpp#1299

lukasmoellerch · 2023-05-06T21:41:06Z

I'd be interested in this as well, will start looking into it tomorrow if nobody else is working on it so far...

lukasmoellerch · 2023-05-07T20:39:50Z

Getting there... Just kinda messed up the alibi attention bias I think.

Also: The tokenizer uses unigrams and is based on sentence-piece. What's the idea here? Integrating an optimal unigram tokenizer (some construction with lattices is required to make it efficient I think) would increase the complexity a lot..

Leoputera2407 · 2023-05-09T13:52:15Z

@lukasmoellerch Could you share your branch, maybe a second eye can help?

lukasmoellerch · 2023-05-09T14:07:01Z

Sure, branch is here: https://github.com/lukasmoellerch/ggml/tree/replit - The temp=0 output is completely is correct if alibi attention bias is disabled both in their implementation and in mine, thus I am convinced that the problem is in that part somewhere. I suspect that we might need to do something like

double bias = (1 - n_seq + n_past) * m_k;
pdst[0] = bias + src[0];

in ggml_compute_forward_alibi_f32, which gives me the correct output bias for n_seq = 2048, but the output seems to be unaltered.

Also: The input-ids are hardcoded to def fibonacci right now, the tokenizer is not implemented yet.

Leoputera2407 · 2023-05-09T21:24:58Z

It seems like the alibi-bias in replitLM is calculated differently from how ggml calculates the alibi-bias.
ReplitLM does so by applying an exponentially decreasing bias for each attention head. In Replit's case, it takes the max_alibi_bias=8, and give a different m for each head. This seems to reflect how the Alibi was described in this youtube video. This wasn't how Bloom was implemented on Ofir's repo or HuggingFace, which follows his repo.

The alibi bias in ggml was copied over Bloomz.ccp and implements huggingface's alibi_bias Bloom. The main difference is that the bias power is scaled dynamically based on the num_heads through ("closest_power_to_2" thingy), while Replit's version the "max" power is set through the hyperparameter "max_alibi_bias". I think we need to implement bias for replit (and mosaic version) specifically. I tried to reconcile the two, but it seems non-trivial.

static void ggml_compute_forward_mosaic_alibi_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        const struct ggml_tensor * src1,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(src1->type == GGML_TYPE_I32);
    assert(ggml_nelements(src1) == 2);

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n_head = ((int32_t *) src1->data)[1];

    const int ne0 = src0->ne[0];
    const int ne1 = src0->ne[1];

    const int n  = ggml_nrows(src0);
    const int ne2_ne3 = n/ne1;

    const int nb0 = src0->nb[0];
    const int nb1 = src0->nb[1];
    const int nb2 = src0->nb[2];

    assert(nb0 == sizeof(float));

    const float alibi_bias_max = 8.0f;
    float m0 = alibi_bias_max / n_head;

    for (int i = 0; i < ne0; i++) {
        for (int j = 0; j < ne1; j++) {
            for (int k = 0; k < ne2_ne3; k++) {
                float * const src = (float *)((char *) src0->data + i*nb0 + j*nb1 + k*nb2);
                float *      pdst = (float *)((char *)  dst->data + i*nb0 + j*nb1 + k*nb2);

                float m_k = 1.0f / std::pow(2, (k+1) * m0);
                pdst[0] = (j + 1) * m_k + src[0];
            }
        }
    }
}

lukasmoellerch · 2023-05-09T21:46:14Z

That's what I thought as well, but the "closest power of two" thing doesn't apply here because n_heads is a power of two for the replit model and probably also for the other mpt models. At the same time the m0 term already scales correctly I think:

m0 = powf(2.0f, -8.0f / n_heads_log2_floor);
m_k = powf(m0, k + 1);

means that

m0 = 1 / 2^((8/32) * (k + 1))

which is already equivalent to the replit implementation of

alibi_bias * (1. / (2**torch.arange(1, n_heads + 1).mul(alibi_bias_max / n_heads).view(n_heads, 1, 1)))

Right?

The only difference I see is

alibi_bias = torch.arange(1 - seq_len, 1, dtype=dtype,
                              device=device).view(1, 1, seq_len)

as opposed to j used in the ggml implementation.

Leoputera2407 · 2023-05-09T23:54:33Z

Actually, you’re right. The max_alibi_bias is fortunately hard-coded to be 8, the same as replit, and it indeed scales as replit does after I tested it out by hand.

I think I can see what you mean by adding bias= (1-n_seq+n_past), the bias computation for ggml starts from the begininng of the sequence, hence why the offset is (1+j), while replit starts from the end sequence, so it should be (1+j-ne1), since ne1 =n_seq-n_past? So, the offset starts from the end instead of the start of the sequence, the same as replit ‘torch.arange(1-seq_len,1,…)’ . However, my hypothesis is that, eventho the positional bias is different, the overall effect should be the same as the relative differences between different positions in the sequence should still be the same, hence why I think the output is the same.

I’m starting to conclude that alibi_bias is not an issue at all, and we can keep ggml alibi bias as is. What do you think? Is the output of the model when you tested problematic?

Leoputera2407 · 2023-05-10T00:26:56Z

I can see if I can hell to implement the tokenizer, what’s the idea behind making them into a trie?

I saw other ggml ports, including bloomz and neox, they seem to do the tokenizer hack mentioned in the conversion script, without implementing a custom tokenizer.

Also, while not relevant to replit, we need to think about how to handle qkv_clipping to help port mpt 7b, which is what I’m currently trying to do.

lukasmoellerch · 2023-05-10T05:45:12Z

Yes the output of the model seems to be different to the reference implementation and quickly degrades. The initial logits are correct, but it degrades after a few lines. The outputs match perfectly if bias is removed though.

You might be correct in that adding the same bias everywhere probably doesn’t change anything due the softmax that’s applied right afterwards.

The trie I implemented is completely unnecessary and not relevant for replit. The unigram tokenizer it uses has both “pieces” and their corresponding probabilities which can’t easily be accessed through the official interface. Instead we’d have to load the protobuf file and encode the logits into the ggml model as well. For actually implementing the tokenizer we can probably get away with splitting on whitespace and then encoding both whitespace and non-whitespace sections using a brute force approach. Hugging face has a blog post about this: https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt - this isn’t quite correct I think but could be a good starting point

lukasmoellerch · 2023-05-10T16:21:04Z

NouamaneTazi/bloomz.cpp#27 fixes it

lukasmoellerch · 2023-05-10T16:32:14Z

@Leoputera2407 did you do any work on the tokenizer? Otherwise I'd give it a shot.

Leoputera2407 · 2023-05-10T17:33:16Z

I’m trying some whitespace tokenizer hack I saw from cformers, but don’t let deter you haha I’m still new to the ggml framework and not super proficient in C++.

Btw, could you share the updated the replit branch? Mine seems to hang on when I try to run inference on the converted model, possibly because I didn’t so the sentence_proto thing correctly.

lukasmoellerch · 2023-05-10T17:40:53Z

Yes, all updated now, also with somewhat okay tokenization support

Leoputera2407 · 2023-05-10T20:26:51Z

@lukasmoellerch I think I’ve managed to port MPT to ggml. Tokenizer was gpt-neox which we can copypasta, and seems to work ok without the qkv clamp. I can try to write the quantization script. I can push to your branch, or shall we open a PR for it in ggml?

lukasmoellerch · 2023-05-10T21:06:27Z

Oh I already have mpt on my branch and wrote about that in the mpt issue - sorry for this miscommunication

I have quantization for replit but not for mpt, code should be the same though except vocab

matthiasgeihs · 2023-06-12T08:38:56Z

I think this issue can be closed? @ggerganov

* Functionality addition CMakeLists.txt Refactoring: 1. Simplify more options that are negation of negation. LLAMA_NO_ACCELERATE -> LLAMA_ACCELERATE 2. Changed to an optional expression instead of forcing to enable AVX2 in MSVC. 3. Make CMAKE_CXX_STANDARD, which is different from Makefile, the same. 4. Use add_compile_options instead of adding options to CMAKE_C_FLAGS. 5. Make utils use target_link_libraries instead of directly referencing code. Added features: 1. Added some options. LLAMA_STATIC_LINK,LLAMA_NATIVE,LLAMA_LTO,LLAMA_GPROF,LLAMA_OPENBLAS * Fix Accelerate link in CMake * Windows build Fix * C++11 to C++17 * Reflects C/C++ standard individually * Change the version to 3.12 --------- Co-authored-by: Georgi Gerganov <[email protected]>

Green-Sky mentioned this issue May 10, 2023

Fix bug in ggml_alibi #143

Merged

lukasmoellerch mentioned this issue May 10, 2023

MosaicML MPT-7B #136

Open

lukasmoellerch mentioned this issue May 10, 2023

Replit + MPT #145

Merged

ggerganov closed this as completed Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replit-code-v1-3b #131

replit-code-v1-3b #131

vgrichina commented May 4, 2023

Green-Sky commented May 4, 2023

lukasmoellerch commented May 6, 2023

lukasmoellerch commented May 7, 2023

Leoputera2407 commented May 9, 2023

lukasmoellerch commented May 9, 2023 •

edited

Loading

Leoputera2407 commented May 9, 2023 •

edited

Loading

lukasmoellerch commented May 9, 2023 •

edited

Loading

Leoputera2407 commented May 9, 2023 •

edited

Loading

Leoputera2407 commented May 10, 2023 •

edited

Loading

lukasmoellerch commented May 10, 2023

lukasmoellerch commented May 10, 2023

lukasmoellerch commented May 10, 2023

Leoputera2407 commented May 10, 2023

lukasmoellerch commented May 10, 2023 •

edited

Loading

Leoputera2407 commented May 10, 2023 •

edited

Loading

lukasmoellerch commented May 10, 2023 •

edited

Loading

matthiasgeihs commented Jun 12, 2023

replit-code-v1-3b #131

replit-code-v1-3b #131

Comments

vgrichina commented May 4, 2023

Green-Sky commented May 4, 2023

lukasmoellerch commented May 6, 2023

lukasmoellerch commented May 7, 2023

Leoputera2407 commented May 9, 2023

lukasmoellerch commented May 9, 2023 • edited Loading

Leoputera2407 commented May 9, 2023 • edited Loading

lukasmoellerch commented May 9, 2023 • edited Loading

Leoputera2407 commented May 9, 2023 • edited Loading

Leoputera2407 commented May 10, 2023 • edited Loading

lukasmoellerch commented May 10, 2023

lukasmoellerch commented May 10, 2023

lukasmoellerch commented May 10, 2023

Leoputera2407 commented May 10, 2023

lukasmoellerch commented May 10, 2023 • edited Loading

Leoputera2407 commented May 10, 2023 • edited Loading

lukasmoellerch commented May 10, 2023 • edited Loading

matthiasgeihs commented Jun 12, 2023

lukasmoellerch commented May 9, 2023 •

edited

Loading

Leoputera2407 commented May 9, 2023 •

edited

Loading

lukasmoellerch commented May 9, 2023 •

edited

Loading

Leoputera2407 commented May 9, 2023 •

edited

Loading

Leoputera2407 commented May 10, 2023 •

edited

Loading

lukasmoellerch commented May 10, 2023 •

edited

Loading

Leoputera2407 commented May 10, 2023 •

edited

Loading

lukasmoellerch commented May 10, 2023 •

edited

Loading