-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add DeepSeek-v2-Chat support #7118
Comments
That would be awesome. |
Impressive model, and potentially a CPU friendly model(if you have >96GB memory) |
@ggerganov I'd be very interested in helping, I want to get into porting models to inference engines Would you be so kind to provide a rough outline of what needs to be done here? I'd then submit a draft PR and ask for little details that don't work |
Interesting - can we get a rundown of the multi-head latent KV cache technique: @SinanAkkoyun Look at PRs that have already been merged and add support for new model arches |
Sure thing. Here's their tech report: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf |
Thanks, very cool work! Adding this to the roadmap to give it more visibility |
+1 |
I'm working on it right now: https://youtu.be/1AG-GUtDvaw |
@fairydreaming Oh wow how awesome!! How does the ppl look? |
@SinanAkkoyun At this moment it's somewhat high (Q8_0):
|
This is normal for non-base models |
Would love to see support for the smaller MoE models. They seem to be good and only use 2.5b active parameters for token generation. |
You can try my branch if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2
|
There is this PR from a while ago: #4093 Though DS2 seems to not use the "GPT-NeoX RoPE" as we call it, so probably not relevant
How many are the parameters? I don't think we have a better solution than adding them to the GGUF header |
@ggerganov here they are:
What do you think? |
I think it's fine to add those parameters |
The difference in YaRN RoPE that I noticed is that llama.cpp scales sin and cos values with mscale calculated like this:
while DeepSeek-V2 tensorflow implementation uses the following code:
where yarn_get_mscale is:
It uses the same calculation like llama.cpp, but twice - first for self.mscale (which is 0.707 in the config.json), then for self.mscale_all_dim (which is also 0.707 in the config.json) and then divides the first calculated value by the second. However, this will be 1.0 since both mscales are the same. In DeepSeek-V2 vLLM implementation they also do this. There's even a comment:
In the DeepSeek-V2 paper there is: "Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy", but I'm not sure if they are talking about the difference I noticed. |
Hm, that's strange - what's the point of multiplying by |
@CyberTimon I added support for the lite model in my branch, you can try it out now if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2 |
@ggerganov I think YaRN also affects calculation of sin/cos frequencies (theta variable), so we can't simply disable it. Anyway, I found another quirk of the DeepSeek-V2 - it uses a scalar value to scale the expert weights instead of normalizing them. After taking it into account perplexity looks much better in the chat model (Q8_0):
Of course it will require another parameter to be added to the model headers. |
https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite :P A model for everyone to test |
The MLA approach can probably be combined with the Pyramid KV cache - https://arxiv.org/abs/2405.12532 |
Is the main branch code now able to support DeepseekV2 inference? |
No, not yet |
For those who want to have a test on DeepSeek-V2-Chat Light: chatllm.cpp now supports it (with conditions). Comparing to @fairydreaming 's code, this one tries to follow the paper, but not |
@foldl Neat, what perplexity did you get on the lite model on wiki.test.raw? |
@fairydreaming I don't like to test perplexity. Instead, I compared each tensor of each layer against |
Possibly relevant - #2445 (comment) |
I got an error like this:
the latest version about llama.cpp |
@DirtyKnightForVi It doesn't work for me either in the current master:
Out of memory apparently. Right, I don't have 805 GB of mem. I have no idea what's going on. |
Allow me to elaborate further: the error occurs when running on Windows. I encounter the same error when using the same llama.cpp to run deepseek-v2-chat-lite.mradermacher/DeepSeek-V2-Lite-GGUF There seems to be an issue with the conversion of the vocabulary list. Are you running it on Linux? |
@fairydreaming Try to add '-c 512'. Recently the examples started using a kv cache size equal to the model training context by default - in this case 16k |
Yes, I use Linux. I tried the smallest one from mradermacher and it ran without problems. |
I’ve encountered another issue, and I’m not sure if any of the parameters in my command are having an effect: both my GPU and memory usage are below 10%, yet the model is running. My machine has an A4500 with 20G and 64G of memory.
my command: |
@ggerganov OK, thanks for the info. |
@DirtyKnightForVi Did you try some other model to see if your environment works correctly? |
Running other models poses no issue. However, I'm curious as to why you encountered an OOM error, while I was able to smoothly infer a 200B large model with minimal resource consumption? At least, the data on the monitoring dashboard seems to suggest that a non-existent device is running the model for me. LOL |
@DirtyKnightForVi I have limited knowledge of Windows, but I guess there is some disk swap mechanism in use. |
@fairydreaming I'am running it on Ubuntu. And CPU offload maybe the reason why it works well. @ggerganov This might be a default setting. But are there other configurations that can fully load my CPU or GPU? I’m quite curious about the origin of this setting. |
@DirtyKnightForVi It's because you run it with context size (n_ctx) set to 512, while on my machine it was set to default training context size value of 163840. |
please support deepseek-ai/DeepSeek-V2-Chat
https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
The text was updated successfully, but these errors were encountered: