ggml : update mul_mat_id to use the same tensor for all the experts #6387

slaren · 2024-03-29T18:16:42Z

Changes the storage of experts in memory from a tensor per expert, to a single 3D tensor with all the experts. This will allow us support models with a large number of experts such as qwen2moe.

Existing MoE model files (ie. mixtral and grok) with split experts are still usable, however, for the CPU and Metal backends, it will cause the data will be copied to a buffer without mmap, which may increase load times slightly. Additionally, imatrices created after this change is merged cannot be used to quantize old models, and imatrices created with previous versions of llama.cpp cannot be used to quantize new models.

Fixes #6082

ggml-metal.m

Co-authored-by: Georgi Gerganov <[email protected]>

slaren · 2024-03-31T00:26:26Z

I tried to update convert-hf-to-gguf.py, but eventually it crashes with SIGKILL. I think it is running out of memory because the merged experts tensors stay in memory. Somebody with more experience with python will need to figure how to fix that.

Also for grok, I don't even have enough free disk space to try that one.

convert-hf-to-gguf.py

ggerganov · 2024-03-31T17:45:18Z

convert-hf-to-gguf.py finishes successfully and the converted model runs correctly, but it does take a lot of memory to convert. I will take a look as well, but my Python is also not great. Pinging @cebtenzzre

slaren · 2024-03-31T17:49:32Z

Ok, that's good to know. I guess the same code should be usable for grok, but the memory usage may be a problem. For mixtral, convert.py also works with HF models anyway, which doesn't have the memory usage problem.

Co-authored-by: Georgi Gerganov <[email protected]>

foldl · 2024-04-01T09:44:00Z

How about make a new op, and keep ggml_mul_mat_id for backward compatibility?

slaren · 2024-04-01T10:40:34Z

I think it is better to convert the weights to the new format at load time than to maintain multiple versions of the op. I also plan to make further changes to ggml_mul_mat_id in a followup PR to allow all the experts to be computed in a single call, so that would be yet another version of the op to maintain.

foldl · 2024-04-01T10:55:06Z

IMHO, developers who use ggml would prefer backward compatibility.

ggerganov · 2024-04-01T11:03:35Z

We generally try to maintain backward compatibility to a good extend, but sometimes it can be too difficult. MoE functionality is on a trend to become the standard so we need good support. Maintaining duplicate kernels would increase significantly the required effort

foldl · 2024-04-01T11:23:21Z

In this case, if the original ggml_mul_mat_id is kept, it can be marked as "deprecated" and is not maintained any more.

Pros:

Perfect backward compatibility;
No need to merge experts into a single tensor at runtime, mmap works as before.

Cons:

The name ggml_mul_mat_id is wasted.

Anyway, if a breaking change is unavoidable, then just do it as soon as possible.

…e way

sorasoras · 2024-04-03T14:10:19Z

Good to know, maybe it deserves a breaking change label ?

Yes. It is actually a bit worse than that. After this change, an imatrix generated with an old model, cannot be used to quantize that same model. The imatrix will only work with newer models regardless of what model was used to create it.

If my understanding is correctly,Only affect MoE,right?

ggerganov · 2024-04-03T14:17:00Z

Yes

JMPSequeira · 2024-04-04T09:01:07Z

I'm sorry but as an heavy user of Mixtral 8x7b Instruct can some tell me if:
This requires a reconversion from the hf model to gguf or can I use the the same fp_16 gguf I already had?
Are there performance benefits of this PR?
Are there any special flags need using during the convert to quantize flow?

slaren · 2024-04-04T12:10:05Z

If you want to create new quants with an imatrix, you need to convert the model again from an hf or pth model, you cannot use the same fp16 gguf. You don't need to use any special flags.

maddes8cht · 2024-04-05T20:53:58Z

Thanks for this work.
Again, also being an heavy user of mixtral 8x7b (instruct):
After doing the reconvesion, can i expect some kind of performance benefits from this PR?

slaren · 2024-04-05T20:58:39Z

If you are fully offloading the model there is very little advantage to converting the models again. For CPU and Metal, it allows using mmap which can improve load times (especially if you are restarting llama.cpp repeatedly), but in most cases it won't really matter. For performance benefits, see #6505 (but that also works with older models).

he29-net · 2024-04-12T23:26:09Z

will cause the data will be copied to a buffer without mmap, which may increase load times slightly

I would say that is a slight understatement. In my case, for Mixtral 8x7B, the load time increased from a few seconds to 3 minutes or so.

At first I thought my HW is dying, because the loading process was accompanied by a flood of kernel messages (related to Intel IOMMU*) so severe, it made the machine unresponsive and consumed all available space in /var/log within a few minutes.

That turned out to be likely an unrelated issue, perhaps just randomly triggered by the way llama.cpp uses ROCm while loading the model. Ruling out a HW problem, it occurred to me to try an older build. And sure enough, the slow loading went away and git bisect eventually led me to this commit.

It may have been a good idea to print a warning to the log when a model starts being loaded in a "compatibility mode". Would have saved me a lot of frustration and time. :) (But I have only myself to be angry at, because I even remember seeing this PR and the notes about breaking compatibility..)

*) In case anyone else on ROCm and Intel CPU also experiences a flood of messages like:

DMAR: ERROR: DMA PTE for vPFN 0x7bf32 already set (to 7bf32003 not 24c563801)

followed by a stack trace and other related info for a bug in drivers/iommu/intel/iommu.c, try adding iommu.passthrough=1 to your kernel boot parameters. This should bypass some DMA mapping stuff done by the OS, avoiding the issue. I don't use virtualization so I can't say if it breaks anything else, but the flood went away and so far I don't see any other side effects.

slaren · 2024-04-12T23:39:43Z

It is impossible to account for AMD drivers unpredictability, this change doesn't cause any meaningful overhead when offloading a old model. Without offloading and with Metal, it may cause the model to be evicted from the system cache if you don't have enough memory for two copies. This will increase the load time of a second usage, but the time of the first load should be essentially the same.

he29-net · 2024-04-12T23:52:58Z

evicted from the system cache if you don't have enough memory for two copies

Ah, I suppose that would be the main slowdown in my case. I did not realize the conversion is not happening in place. Thanks for the background.

Agreed on the AMD drivers though, they still have plenty of work to do to make the whole stack rock solid and reliable (e.g., the GPU in my new AMD laptop crashed in three different ways in the past two months..)

slaren · 2024-04-12T23:58:16Z

If you are not offloading a large portion of the model while using a CUDA or HIP build, a difference is that it will try to allocate a pinned buffer for the CPU portion of the model (this is the same that happens when disabling mmap). In some systems this can cause instability if there is too much pinned memory, since this memory is not available to other processes. You can disable this behavior by defining the environment variable GGML_CUDA_NO_PINNED, at the expense of slightly slower prompt processing performance. When using mmap (either with an older version, or with a new model), the CPU buffer is not pinned.

he29-net · 2024-04-13T00:25:35Z

Now I realize I first misunderstood your comment about model being evicted from the cache; I'm loading from a fast SSD, so cache alone would not explain such a big slowdown. But what you say about pinned memory probably explains it: I'm offloading about 10 of 33 to the GPU, and the portion left to the CPU takes up almost all the RAM. So if the CPU portion is pinned, there is barely anything left to work with, and the slowdown is probably caused by swapping. Indeed, after setting GGML_CUDA_NO_PINNED, the loading time becomes reasonable even with old model + new commit.

EDIT: Now reading about memory pinning, and the IOMMU errors may be related after all, since another, less common message present in the flood was "amdgpu: init_user_pages: Failed to get user pages: -1". And get_user_pages() is mentioned right there, in the article about pinning... You learn something every day. :)

…6387) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <[email protected]> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <[email protected]> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <[email protected]>

yuhai-china · 2024-05-03T03:54:26Z

this is a very bad change since lots of mixtral models need be created again, please consider backward compatibility as llama.cpp is so popular.

jukofyork · 2024-05-05T19:44:14Z

It's possible I have found a problem with the MOE imatrix calculations as a result of this PR's changes I guess, but posted the info in #6515 as DBRX was causing me the problems.

I don't really know what the intentions are with the weighting (should less selected experts have lower importance now, etc), and the actual quant code is very obtuse but it looks a lot like the experts' MLP weights are all getting downscaled by a factor of n_experts too much, and it's possibly not just a benign change of scale for all weights in the same tensor. From ggml-quants.c:

        const float * xbl = x + QK_K*ibl;
        float sumx2 = 0;
        for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
        float sigma2 = 2*sumx2/QK_K;

        for (int ib = 0; ib < QK_K/32; ++ib) {
            const float * xb = xbl + 32*ib;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*ibl + 32*ib;
                for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
            } else {
                for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
            }

slaren · 2024-05-05T20:01:00Z

Hi @jukofyork, thanks for looking into this. It's very possible that this PR introduced some issue in the imatrix generation of MoE models, I tried to maintain the previous behavior, but I don't know how it works and the code is entirely uncommented, so I depend on code review.

jukofyork · 2024-05-05T20:05:18Z

Hi @jukofyork, thanks for looking into this. It's very possible that this PR introduced some issue in the imatrix generation of MoE models, I tried to maintain the previous behavior, but I don't know how it works and the code is entirely uncommented, so I depend on code review.

Hi, I just tested this change to quantize and it does look like this helps:

                for (int j = 0; j < (int)src1->ne[0]; ++j) {
                    e.values[e_start + j] += (x[j]*x[j])*static_cast<float>(n_as);
                }

and definitely isn't just a benign change of scale (the same model acts quite differently). The dbrx model is so flaky though it's hard to tell so I'm going to give mixtral-8x22b-instruct another try (that was completely broken for me before) and see what happens .

Is the intension that less selected expert MLPs are to have their weights downscaled proportionally? I assume under the old scheme that the allocation of bits was done on a per-tensor basis so this is a pretty big change as even with the fix above, it's now allocated based on the proportion of times the top-k gating network selects an expert's MLP? I can see arguments both for and against this though, so not 100% clear which would be best...

If the old behaviour is to be returned then it's not going to be easy to pass a vector of ncall without a breaking change of the imatrix file format, but I think this:

struct Stats {
std::vector values;
int ncall = 0;
};

Could be adapted inside of imatrix.cpp to keep track and then each time save_imatrix() is called, apply a correction factor so that the division by ncall down in quantize generates the correct result.

There is probably a far better way to do both these fixes by altering collect_imatrix() though. With:

        ++e.ncall;
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        //if (idx == t->src[0]->ne[0] - 1) ++e.ncall;

needing a look at too. This actually causes the printout to be confusing too:

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 248.05 ms
compute_imatrix: computing over 91 chunks with batch_size 512
compute_imatrix: 36.59 seconds per pass - ETA 55.48 minutes
[1]6.8864,[2]5.5590,
save_imatrix: stored collected data after 10 chunks in dbrx:16x12b-instruct-f16.imatrix
[3]4.6385,[4]5.2093,
save_imatrix: stored collected data after 20 chunks in dbrx:16x12b-instruct-f16.imatrix
[5]5.6050,[6]4.6732,[7]4.7876,
save_imatrix: stored collected data after 30 chunks in dbrx:16x12b-instruct-f16.imatrix
[8]5.3775,[9]5.6677,

slaren · 2024-05-05T20:10:03Z

The intention was that during quantization, each expert is quantized separately using the fragment of the imatrix that corresponding to that expert:

llama.cpp/llama.cpp

Lines 14894 to 14900 in 628b299

 for (int64_t i03 = 0; i03 < tensor->ne[2]; ++i03) { 

 const float * f32_data_03 = f32_data + i03 * nelements_matrix; 

 void * new_data_03 = (char *)new_data + ggml_row_size(new_type, n_per_row) * i03 * nrows; 

 const float * imatrix_03 = imatrix ? imatrix + i03 * n_per_row : nullptr; 

 new_size += llama_tensor_quantize_internal(new_type, f32_data_03, new_data_03, chunk_size, nrows, n_per_row, imatrix_03, workers, nthread_use); 

 }

But I may have gotten that wrong. Ultimately the goal was to preserve the same behavior.

jukofyork · 2024-05-05T21:15:45Z

I think this should maintain the old "per-expert" scaling behaviour:

struct Stats {
    std::vector<float> values;
    std::vector<int> counts; // +++
    int ncall = 0;
};

        //++e.ncall; // ---
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        if (idx == t->src[0]->ne[0] - 1) ++e.ncall; // +++

            if (e.values.empty()) {
                e.values.resize(src1->ne[0]*n_as, 0);
                e.counts.resize(src1->ne[0]*n_as, 0); // +++
            }
.
.
.
                for (int j = 0; j < (int)src1->ne[0]; ++j) {
                    e.values[e_start + j] += x[j]*x[j];
                    e.counts[e_start + j]++; // +++
                }

        if (e.values.empty()) {
            e.values.resize(src1->ne[0], 0);
            e.counts.resize(src1->ne[0], 0); // +++
        }
.
.
.
            for (int j = 0; j < (int)src1->ne[0]; ++j) {
                e.values[j] += x[j]*x[j];
                e.counts[j]++; // +++
            }

void IMatrixCollector::save_imatrix(const char * fname) const {
    std::ofstream out(fname, std::ios::binary);
    int n_entries = m_stats.size();
    out.write((const char*)&n_entries, sizeof(n_entries));
    for (auto& p : m_stats) {
        int len = p.first.size();
        out.write((const char*)&len, sizeof(len));
        out.write(p.first.c_str(), len);
        out.write((const char*)&p.second.ncall, sizeof(p.second.ncall));
        int nval = p.second.values.size();
        // +++
        std::vector<float> tmp(nval);
        for (int i = 0; i < nval; i++) {
            tmp[i] = (p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall);
        }
        out.write((const char*)&nval, sizeof(nval));
        if (nval > 0) out.write((const char*)tmp.data(), nval*sizeof(float));
        // +++
    }
    if (m_params.verbosity > 0) {
        fprintf(stderr, "\n%s: stored collected data after %d chunks in %s\n",__func__,m_last_call,fname);
    }
}

and shouldn't require any breaking changes to the imatrix file format, work correctly when joining imatrix files of differing sample sizes, and the for (auto& v : e) v /= ncall should works as expected in quantize without any changes needed.

It's O(d) currently (for simplicity) but could easily be reduced to O(n_experts) later as all the counts will be the same for each expert.

I'll re-quant dbrx and mixtral-8x22b and report back (possibly tomorrow now).

It also fixes the weird block saving behaviour for MOEs:

1]6.8864,[2]5.5590,[3]4.6385,[4]5.2093,[5]5.6050,[6]4.6732,[7]4.7876,[8]5.3775,[9]5.6677,
save_imatrix: stored collected data after 10 chunks in dbrx:16x12b-instruct-f16.imatrix

jukofyork · 2024-05-05T21:52:11Z

The intention was that during quantization, each expert is quantized separately using the fragment of the imatrix that corresponding to that expert:

llama.cpp/llama.cpp

Lines 14894 to 14900 in 628b299

for (int64_t i03 = 0; i03 < tensor->ne[2]; ++i03) {

const float * f32_data_03 = f32_data + i03 * nelements_matrix;

void * new_data_03 = (char *)new_data + ggml_row_size(new_type, n_per_row) * i03 * nrows;

const float * imatrix_03 = imatrix ? imatrix + i03 * n_per_row : nullptr;

new_size += llama_tensor_quantize_internal(new_type, f32_data_03, new_data_03, chunk_size, nrows, n_per_row, imatrix_03, workers, nthread_use);

}

But I may have gotten that wrong. Ultimately the goal was to preserve the same behavior.

Sorry, missed your post. Looking some more here:

        const float * xbl = x + QK_K*ibl;
        float sumx2 = 0;
        for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
        float sigma2 = 2*sumx2/QK_K;

        for (int ib = 0; ib < QK_K/32; ++ib) {
            const float * xb = xbl + 32*ib;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*ibl + 32*ib;
                for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
            } else {
                for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
            }

then a change of scale shouldn't effect weight[i] so long as it is normalised later on, as it will just scale all the values in the 32-element block by the same constant.

BUT: I did get a different result earlier by multiplying all e.values[e_start + j] += x[j]*x[j] by 16, which makes me think that somewhere it is important to maintain the absolute scales...

Perhaps as a test it might be worth rounding each of the 16 experts differently and then checking all the complicated slicing loops are really dealing with the experts as expected using asserts up until they get quantized?

jukofyork · 2024-05-05T22:14:55Z

This commented out bit of code didn't work:

        //++e.ncall;
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        if (idx == t->src[0]->ne[0] - 1) ++e.ncall;

Just got ncall = 0 saved.

I'm still no wiser what t->src[0]->ne[0] is holding and have been staring at this for 30 minutes 🤣

Luckily by printing out some debugging info found this also works:

if (idx == 0) ++e.ncall;

(but probably triggers the if (e.ncall > m_last_call) ... save_imatrix() condition earlier than it should)

I guess t->src[0]->ne[0] should have held the number of top-k experts, but it's not quite this?

jukofyork · 2024-05-06T00:20:40Z

It took a lot of digging, but I think I can see where absolute scales matter:

static void quantize_row_q4_0_impl(const float * restrict x, block_q4_0 * restrict y, int64_t n_per_row, const float * quant_weights) {
    static_assert(QK4_0 == 32, "QK4_0 must be 32");
    
    if (!quant_weights) {
        quantize_row_q4_0_reference(x, y, n_per_row);
        return;
    }   
        
    float weight[QK4_0]; 
    int8_t L[QK4_0];
        
    float sum_x2 = 0;
    for (int j = 0; j < n_per_row; ++j) sum_x2 += x[j]*x[j];
    float sigma2 = sum_x2/n_per_row;
        
    const int64_t nb = n_per_row/QK4_0;
    for (int ib = 0; ib < nb; ++ib) {
        const float * xb = x + QK4_0 * ib;
        const float * qw = quant_weights + QK4_0 * ib;
        for (int j = 0; j < QK4_0; ++j) weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
        float d = make_qx_quants(QK4_0, 8, xb, L, 1, weight); 
        y[ib].d = GGML_FP32_TO_FP16(d);
        for (int j = 0; j < 16; ++j) {
            y[ib].qs[j] = L[j] | (L[j+16] << 4);
        }
    }    
}

The important bit: weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]).

static float make_qx_quants(int n, int nmax, const float * restrict x, int8_t * restrict L, int rmse_type,
        const float * restrict qw) {
    float max = 0;
    float amax = 0;
    for (int i = 0; i < n; ++i) {
        float ax = fabsf(x[i]);
        if (ax > amax) { amax = ax; max = x[i]; }
    }   
    if (amax < 1e-30f) { // all zero
        for (int i = 0; i < n; ++i) {
            L[i] = 0;
        }   
        return 0.f;
    }   
    float iscale = -nmax / max;
    if (rmse_type == 0) {
        for (int i = 0; i < n; ++i) {
            int l = nearest_int(iscale * x[i]);
            L[i] = nmax + MAX(-nmax, MIN(nmax-1, l));
        }
        return 1/iscale;
    }
    bool return_early = false;
    if (rmse_type < 0) {
        rmse_type = -rmse_type;
        return_early = true;
    }
    float sumlx = 0; 
    float suml2 = 0;
#ifdef HAVE_BUGGY_APPLE_LINKER
    // use 'volatile' to prevent unroll and work around a bug in Apple ld64 1015.7
    for (volatile int i = 0; i < n; ++i) {
#else
    for (int i = 0; i < n; ++i) {
#endif          
        int l = nearest_int(iscale * x[i]);
        l = MAX(-nmax, MIN(nmax-1, l));
        L[i] = l + nmax;
        float w = qw ? qw[i] : rmse_type == 1 ? x[i] * x[i] : rmse_type == 2 ? 1 : rmse_type == 3 ? fabsf(x[i]) : sqrtf(fabsf(x[i]));
        sumlx += w*x[i]*l;
        suml2 += w*l*l;
    }   
    float scale = sumlx/suml2;
    if (return_early) return suml2 > 0 ? 0.5f*(scale + 1/iscale) : 1/iscale;
    float best = scale * sumlx;
    for (int is = -9; is <= 9; ++is) {
        if (is == 0) {
            continue;
        }
        iscale = -(nmax + 0.1f*is) / max;
        sumlx = suml2 = 0;
        for (int i = 0; i < n; ++i) {
            int l = nearest_int(iscale * x[i]);
            l = MAX(-nmax, MIN(nmax-1, l));
            float w = qw ? qw[i] : rmse_type == 1 ? x[i] * x[i] : rmse_type == 2 ? 1 : rmse_type == 3 ? fabsf(x[i]) : sqrtf(fabsf(x[i]));
            sumlx += w*x[i]*l;
            suml2 += w*l*l;
        }
        if (suml2 > 0 && sumlx*sumlx > best*suml2) {
            for (int i = 0; i < n; ++i) {
                int l = nearest_int(iscale * x[i]);
                L[i] = nmax + MAX(-nmax, MIN(nmax-1, l));
            }
            scale = sumlx/suml2; best = scale*sumlx;
        }
    }
    return scale;
}

So scale = sumlx/suml2 does indeed cancel out an arbitrary scale factor applied to all of qw, but:

best = scale*sumlx = (sumlx/suml2)*sumlx = sumlx^2 / suml2.

and the non-linearity of sumlx^2 means this will effect this loop (I think - this code is hard going to follow!).

jukofyork · 2024-05-06T10:08:58Z

I added pull request #7099, but this doesn't just want pushing as is - it unnecessarily doubles the memory overhead, but hopefully can be used as a test to refactor, etc.

jukofyork · 2024-05-06T10:56:56Z

I can confirm this is actually doing something useful as before Mixtral-8x22b-Instruct was a broken mess who kept stopping mid-sentence all the time and now appears to work as normal.

ggml : update mul_mat_id to use the same tensor for all the experts

0c7e21d

This comment was marked as off-topic.

Sign in to view

slaren added 5 commits March 29, 2024 20:06

update cuda

9c9fe60

minor

2479900

update metal

93db37e

update test-backend-ops

325e5ef

fix cuda

26c09ad

ggerganov reviewed Mar 30, 2024

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

slaren and others added 3 commits March 30, 2024 11:42

Update ggml-metal.m

2abb6c7

Co-authored-by: Georgi Gerganov <[email protected]>

update convert.py

6203d72

update convert-hf-to-gguf.py

4a5d50e

update convert.py for mixtral hf models

3b3298a

ggerganov reviewed Mar 31, 2024

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

Update convert-hf-to-gguf.py

8c2f7b8

Co-authored-by: Georgi Gerganov <[email protected]>

slaren mentioned this pull request Mar 31, 2024

Add qwen2moe #6074

Merged

Jeximo mentioned this pull request Apr 1, 2024

Are Qwen1.5 MOE models supported? #6415

Closed

phymbert mentioned this pull request Apr 1, 2024

server: bench: continuous performance testing #6233

Closed

16 tasks

slaren added 5 commits April 2, 2024 01:11

cuda : support non-pow-2 number of experts

4531b02

allow quantize to work for split and merged experts models in the sam…

6886fdb

…e way

cleanup + disable mmap automatically with split tensors models

deea200

update imatrix

b4a6206

test-backend-ops : test qwen argsort

8f84ca3

ggerganov merged commit 08a0c02 into master Apr 3, 2024
60 of 61 checks passed

slaren deleted the sl/moe-rework-1 branch April 4, 2024 12:07

phymbert mentioned this pull request Apr 6, 2024

model: support arch DbrxForCausalLM #6515

Merged

13 tasks

slaren mentioned this pull request Apr 8, 2024

"Missing importance matrix" despite imatrix being provided #6552

Closed

compilade mentioned this pull request Apr 13, 2024

Mixtral 8x7b models using more memory while loading #6652

Closed

jart mentioned this pull request Apr 20, 2024

MoE loading time regression #6798

Closed

compilade mentioned this pull request May 4, 2024

convert-hf : save memory with lazy evaluation #7075

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : update mul_mat_id to use the same tensor for all the experts #6387

ggml : update mul_mat_id to use the same tensor for all the experts #6387

slaren commented Mar 29, 2024 •

edited

Loading

This comment was marked as off-topic.

slaren commented Mar 31, 2024 •

edited

Loading

ggerganov commented Mar 31, 2024

slaren commented Mar 31, 2024 •

edited

Loading

foldl commented Apr 1, 2024

slaren commented Apr 1, 2024

foldl commented Apr 1, 2024

ggerganov commented Apr 1, 2024

foldl commented Apr 1, 2024

sorasoras commented Apr 3, 2024

ggerganov commented Apr 3, 2024

JMPSequeira commented Apr 4, 2024

slaren commented Apr 4, 2024

maddes8cht commented Apr 5, 2024

slaren commented Apr 5, 2024

he29-net commented Apr 12, 2024

slaren commented Apr 12, 2024

he29-net commented Apr 12, 2024

slaren commented Apr 12, 2024

he29-net commented Apr 13, 2024 •

edited

Loading

yuhai-china commented May 3, 2024

jukofyork commented May 5, 2024 •

edited

Loading

slaren commented May 5, 2024

jukofyork commented May 5, 2024 •

edited

Loading

slaren commented May 5, 2024

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 6, 2024

jukofyork commented May 6, 2024 •

edited

Loading

jukofyork commented May 6, 2024

ggml : update mul_mat_id to use the same tensor for all the experts #6387

ggml : update mul_mat_id to use the same tensor for all the experts #6387

Conversation

slaren commented Mar 29, 2024 • edited Loading

This comment was marked as off-topic.

slaren commented Mar 31, 2024 • edited Loading

ggerganov commented Mar 31, 2024

slaren commented Mar 31, 2024 • edited Loading

foldl commented Apr 1, 2024

slaren commented Apr 1, 2024

foldl commented Apr 1, 2024

ggerganov commented Apr 1, 2024

foldl commented Apr 1, 2024

sorasoras commented Apr 3, 2024

ggerganov commented Apr 3, 2024

JMPSequeira commented Apr 4, 2024

slaren commented Apr 4, 2024

maddes8cht commented Apr 5, 2024

slaren commented Apr 5, 2024

he29-net commented Apr 12, 2024

slaren commented Apr 12, 2024

he29-net commented Apr 12, 2024

slaren commented Apr 12, 2024

he29-net commented Apr 13, 2024 • edited Loading

yuhai-china commented May 3, 2024

jukofyork commented May 5, 2024 • edited Loading

slaren commented May 5, 2024

jukofyork commented May 5, 2024 • edited Loading

slaren commented May 5, 2024

jukofyork commented May 5, 2024 • edited Loading

jukofyork commented May 5, 2024 • edited Loading

jukofyork commented May 5, 2024 • edited Loading

jukofyork commented May 6, 2024

jukofyork commented May 6, 2024 • edited Loading

jukofyork commented May 6, 2024

slaren commented Mar 29, 2024 •

edited

Loading

slaren commented Mar 31, 2024 •

edited

Loading

slaren commented Mar 31, 2024 •

edited

Loading

he29-net commented Apr 13, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 5, 2024 •

edited

Loading

jukofyork commented May 6, 2024 •

edited

Loading