Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add DeepSeek-v2-Chat support #7118

Closed
DirtyKnightForVi opened this issue May 7, 2024 · 41 comments · Fixed by #7519
Closed

llama : add DeepSeek-v2-Chat support #7118

DirtyKnightForVi opened this issue May 7, 2024 · 41 comments · Fixed by #7519
Labels
good first issue Good for newcomers model Model specific

Comments

@DirtyKnightForVi
Copy link

please support deepseek-ai/DeepSeek-V2-Chat

https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat

@SinanAkkoyun
Copy link

That would be awesome.

@jeff31415
Copy link

Impressive model, and potentially a CPU friendly model(if you have >96GB memory)

@SinanAkkoyun
Copy link

@ggerganov I'd be very interested in helping, I want to get into porting models to inference engines

Would you be so kind to provide a rough outline of what needs to be done here? I'd then submit a draft PR and ask for little details that don't work

@ggerganov
Copy link
Owner

Interesting - can we get a rundown of the multi-head latent KV cache technique:

image

@SinanAkkoyun Look at PRs that have already been merged and add support for new model arches

@DirtyKnightForVi
Copy link
Author

Sure thing. Here's their tech report: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf

@ggerganov
Copy link
Owner

Thanks, very cool work! Adding this to the roadmap to give it more visibility

@ggerganov ggerganov changed the title Please Support DeepSeek-v2-Chat llama : add DeepSeek-v2-Chat support May 9, 2024
@ggerganov ggerganov added good first issue Good for newcomers model Model specific labels May 9, 2024
@taozhiyuai
Copy link

+1

@fairydreaming
Copy link
Collaborator

I'm working on it right now: https://youtu.be/1AG-GUtDvaw
The code needs some cleanup, so it's not published yet.

@SinanAkkoyun
Copy link

@fairydreaming Oh wow how awesome!! How does the ppl look?

@fairydreaming
Copy link
Collaborator

fairydreaming commented May 15, 2024

@fairydreaming Oh wow how awesome!! How does the ppl look?

@SinanAkkoyun At this moment it's somewhat high (Q8_0):

perplexity: tokenizing the input ..
perplexity: tokenization took 1107.87 ms
perplexity: calculating perplexity over 596 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 91.94 seconds per pass - ETA 3 hours 48.32 minutes
[1]6.4552,[2]7.7478,[3]6.8637,[4]7.1755,[5]7.5298,[6]8.4102,[7]8.7088,[8]9.0019,[9]9.5003,[10]9.8350,[11]9.9215,[12]10.1602,[13]10.2808,[14]10.3361,[15]10.2942,[16]10.4948,[17]9.7985,[18]9.8037,[19]9.8295,[20]9.6260

@ggerganov
Copy link
Owner

At this moment it's somewhat high (Q8_0)

This is normal for non-base models

@CyberTimon
Copy link

Would love to see support for the smaller MoE models. They seem to be good and only use 2.5b active parameters for token generation.

@fairydreaming
Copy link
Collaborator

fairydreaming commented May 17, 2024

You can try my branch if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2
The model works but there are several issues:

  • The implementation is suboptimal, since it permutes K and Q tensors during inference. I will try to avoid this by permuting model tensors during conversion instead.
  • I see some differences in YaRN implementation between DeepSeek-V2 and llama.cpp (calculation of mscale). Is there any YaRN expert on board?
  • The implementation still caches whole K and V tensors instead of the parts marked on the model diagram above (I don't think I'm going to change this, even the tensorflow implementation does the same).
  • Some model-specific parameters are hardcoded in the code. I'm not sure what to do with them, I don't think we want to add every little parameter from the myriad of model architectures to gguf model files.

@ggerganov
Copy link
Owner

I see some differences in YaRN implementation between DeepSeek-V2 and llama.cpp (calculation of mscale). Is there any YaRN expert on board?

There is this PR from a while ago: #4093

Though DS2 seems to not use the "GPT-NeoX RoPE" as we call it, so probably not relevant

Some model-specific parameters are hardcoded in the code. I'm not sure what to do with them, I don't think we want to add every little parameter from the myriad of model architectures to gguf model files.

How many are the parameters? I don't think we have a better solution than adding them to the GGUF header

@fairydreaming
Copy link
Collaborator

How many are the parameters? I don't think we have a better solution than adding them to the GGUF header

@ggerganov here they are:

                    // TODO maybe move some of these to hparams
                    const uint32_t n_shared_experts = 2;
                    const uint32_t moe_intermediate_size = 1536;
                    const uint32_t q_lora_rank = 1536;
                    const uint32_t kv_lora_rank = 512;
                    const uint32_t first_k_dense_replace = 1;
  • moe_intermediate_size is needed because intermediate_size is used for dense FFN intermediate size,
  • q_lora_rank and kv_lora_rank are the latent compressed Q and KV dimensions (consult the image above),
  • first_k_dense_replace says from which layer use MoE instead of dense FFN (so layer 0 has no MoE, but a dense FFN instead).

What do you think?

@ggerganov
Copy link
Owner

I think it's fine to add those parameters

@fairydreaming
Copy link
Collaborator

fairydreaming commented May 17, 2024

I see some differences in YaRN implementation between DeepSeek-V2 and llama.cpp (calculation of mscale). Is there any YaRN expert on board?

There is this PR from a while ago: #4093

Though DS2 seems to not use the "GPT-NeoX RoPE" as we call it, so probably not relevant

The difference in YaRN RoPE that I noticed is that llama.cpp scales sin and cos values with mscale calculated like this:

mscale *= 1.0f + 0.1f * logf(1.0f / freq_scale);

while DeepSeek-V2 tensorflow implementation uses the following code:

        _mscale = float(
            yarn_get_mscale(self.scaling_factor, self.mscale)
            / yarn_get_mscale(self.scaling_factor, self.mscale_all_dim)
        )

where yarn_get_mscale is:

def yarn_get_mscale(scale=1, mscale=1):
    if scale <= 1:
        return 1.0
    return 0.1 * mscale * math.log(scale) + 1.0

It uses the same calculation like llama.cpp, but twice - first for self.mscale (which is 0.707 in the config.json), then for self.mscale_all_dim (which is also 0.707 in the config.json) and then divides the first calculated value by the second. However, this will be 1.0 since both mscales are the same. In DeepSeek-V2 vLLM implementation they also do this. There's even a comment:

# Get n-d magnitude scaling corrected for interpolation.

In the DeepSeek-V2 paper there is: "Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy", but I'm not sure if they are talking about the difference I noticed.

@ggerganov
Copy link
Owner

Hm, that's strange - what's the point of multiplying by 1.0. Not sure if we should modify our implementation - probably we just need to disable YARN for DS2 since it's basically a noop based on the python implementations

@fairydreaming
Copy link
Collaborator

Would love to see support for the smaller MoE models. They seem to be good and only use 2.5b active parameters for token generation.

@CyberTimon I added support for the lite model in my branch, you can try it out now if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2

@fairydreaming
Copy link
Collaborator

Hm, that's strange - what's the point of multiplying by 1.0. Not sure if we should modify our implementation - probably we just need to disable YARN for DS2 since it's basically a noop based on the python implementations

@ggerganov I think YaRN also affects calculation of sin/cos frequencies (theta variable), so we can't simply disable it. Anyway, I found another quirk of the DeepSeek-V2 - it uses a scalar value to scale the expert weights instead of normalizing them. After taking it into account perplexity looks much better in the chat model (Q8_0):

perplexity: calculating perplexity over 596 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 63.76 seconds per pass - ETA 2 hours 38.33 minutes
[1]2.7414,[2]3.6534,[3]3.1132,[4]3.3036,[5]3.5037,[6]3.9715,[7]4.1896,[8]4.2031,[9]4.4069,[10]4.5289,[11]4.6015,[12]4.7431,[13]4.8987,[14]4.7905,[15]4.6756,[16]4.6905,[17]4.5251,[18]4.6219,[19]4.6456,[20]4.4898,[21]4.5219,[22]4.5331,[23]4.4675,[24]4.3658,[25]4.2529,[26]4.1937,[27]4.0689,[28]3.9773,[29]3.9261

Of course it will require another parameter to be added to the model headers.

@SinanAkkoyun
Copy link

https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite

:P A model for everyone to test

@YavorGIvanov
Copy link

YavorGIvanov commented May 22, 2024

The MLA approach can probably be combined with the Pyramid KV cache - https://arxiv.org/abs/2405.12532

@DirtyKnightForVi
Copy link
Author

Is the main branch code now able to support DeepseekV2 inference?

@fairydreaming
Copy link
Collaborator

Is the main branch code now able to support DeepseekV2 inference?

No, not yet

@foldl
Copy link
Contributor

foldl commented May 24, 2024

For those who want to have a test on DeepSeek-V2-Chat Light: chatllm.cpp now supports it (with conditions).

Comparing to @fairydreaming 's code, this one tries to follow the paper, but not modeling_deepseek.py.

@fairydreaming
Copy link
Collaborator

For those who want to have a test on DeepSeek-V2-Chat Light: chatllm.cpp now supports it (with conditions).

Comparing to @fairydreaming 's code, this one tries to follow the paper, but not modeling_deepseek.py.

@foldl Neat, what perplexity did you get on the lite model on wiki.test.raw?

@foldl
Copy link
Contributor

foldl commented May 24, 2024

@fairydreaming I don't like to test perplexity. Instead, I compared each tensor of each layer against modeling_deepseek.py. Results show that differences are caused by rounding errors.

@oldmanjk
Copy link

oldmanjk commented Jun 1, 2024

Possibly relevant - #2445 (comment)

@DirtyKnightForVi
Copy link
Author

DirtyKnightForVi commented Jun 5, 2024

I got an error like this:

E:/WorkingArea/llama_cpp/llama.cpp $ main --override-kv deepseek2.attention.q_lora_rank=int:1536 --override-kv deepseek2.attention.kv_lora_rank=int:512 --override-kv deepseek2.expert_shared_count=int:2 --override-kv deepseek2.expert_weights_scale=float:16 --override-kv deepseek2.expert_feed_forward_length=int:1536 --override-kv deepseek2.leading_dense_block_count=int:1 --override-kv deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707 -m E:/model_tmps/DeepSeek-V2-Chat.Q8_0.gguf -c 128 --color -i
Log start
main: build = 3083 (adc9ff38)
main: built with cc (GCC) 14.1.0 for x86_64-w64-mingw32
main: seed  = 1717592663
llama_model_loader: loaded meta data with 46 key-value pairs and 959 tensors from E:/model_tmps/DeepSeek-V2-Chat.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = Deepseek-V2-Chat
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 60
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 5120
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  13:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  14:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  15:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  16: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  17:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  18:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  19:                     deepseek2.expert_count u32              = 160
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:            deepseek2.attention.q_lora_rank i32              = 1536
llama_model_loader: - kv  33:           deepseek2.attention.kv_lora_rank i32              = 512
llama_model_loader: - kv  34:              deepseek2.expert_shared_count i32              = 2
llama_model_loader: - kv  35:             deepseek2.expert_weights_scale f32              = 16.000000
llama_model_loader: - kv  36:       deepseek2.expert_feed_forward_length i32              = 1536
llama_model_loader: - kv  37:        deepseek2.leading_dense_block_count i32              = 1
llama_model_loader: - kv  38: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  39:                      quantize.imatrix.file str              = imatrix.dat
llama_model_loader: - kv  40:                   quantize.imatrix.dataset str              = groups_merged.txt
llama_model_loader: - kv  41:             quantize.imatrix.entries_count i32              = 716
llama_model_loader: - kv  42:              quantize.imatrix.chunks_count i32              = 62
llama_model_loader: - kv  43:                                   split.no u16              = 0
llama_model_loader: - kv  44:                                split.count u16              = 0
llama_model_loader: - kv  45:                        split.tensors.count i32              = 959
llama_model_loader: - type  f32:  300 tensors
llama_model_loader: - type q8_0:  659 tensors
validate_override: Using metadata override (  int) 'deepseek2.leading_dense_block_count' = 1
validate_override: Using metadata override (  int) 'deepseek2.attention.q_lora_rank' = 1536
validate_override: Using metadata override (  int) 'deepseek2.attention.kv_lora_rank' = 512
validate_override: Using metadata override (  int) 'deepseek2.expert_feed_forward_length' = 1536
validate_override: Using metadata override (  int) 'deepseek2.expert_shared_count' = 2
validate_override: Using metadata override (float) 'deepseek2.expert_weights_scale' = 16.000000
validate_override: Using metadata override (float) 'deepseek2.rope.scaling.yarn_log_multiplier' = 0.070700
llama_model_load: error loading model: error loading model vocabulary: wstring_convert::from_bytes
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'E:/model_tmps/DeepSeek-V2-Chat.Q8_0.gguf'
main: error: unable to load model

the latest version about llama.cpp
model from : leafspark/DeepSeek-V2-Chat-GGUF, and merged by gguf-split

@fairydreaming
Copy link
Collaborator

@DirtyKnightForVi It doesn't work for me either in the current master:

...
llama_new_context_with_model: n_ctx      = 163840
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 805306368032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '/mnt/md0/models/deepseek-v2-chat-Q8_0.gguf'
main: error: unable to load model

Out of memory apparently. Right, I don't have 805 GB of mem. I have no idea what's going on.

@DirtyKnightForVi
Copy link
Author

DirtyKnightForVi commented Jun 5, 2024

Allow me to elaborate further: the error occurs when running on Windows. I encounter the same error when using the same llama.cpp to run deepseek-v2-chat-lite.mradermacher/DeepSeek-V2-Lite-GGUF

There seems to be an issue with the conversion of the vocabulary list. Are you running it on Linux?

@ggerganov
Copy link
Owner

@fairydreaming Try to add '-c 512'. Recently the examples started using a kv cache size equal to the model training context by default - in this case 16k

@fairydreaming
Copy link
Collaborator

Allow me to elaborate further: the error occurs when running on Windows. I encounter the same error when using the same llama.cpp to run deepseek-v2-chat-lite.mradermacher/DeepSeek-V2-Lite-GGUF

There seems to be an issue with the conversion of the vocabulary list. Are you running it on Linux?

Yes, I use Linux. I tried the smallest one from mradermacher and it ran without problems.

@DirtyKnightForVi
Copy link
Author

DirtyKnightForVi commented Jun 5, 2024

I’ve encountered another issue, and I’m not sure if any of the parameters in my command are having an effect: both my GPU and memory usage are below 10%, yet the model is running. My machine has an A4500 with 20G and 64G of memory.

Thu Jun  6 07:42:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4500               Off | 00000000:01:00.0  On |                  Off |
| 30%   35C    P8              14W / 200W |    532MiB / 20470MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1389      G   /usr/lib/xorg/Xorg                          261MiB |
|    0   N/A  N/A      1687      G   /usr/bin/gnome-shell                        170MiB |
|    0   N/A  N/A      6343      G   ...13,262144 --variations-seed-version       88MiB |
+---------------------------------------------------------------------------------------+







(base) jiyin@jiyin:/media/jiyin/ResearchSpace1/llama.cpp$ free -m
               total        used        free      shared  buff/cache   available
内存:      64137        2910        1027           7       60200       60503
交换:      62499        3704       58795



(base) jiyin@jiyin:/media/jiyin/ResearchSpace1/llama.cpp$ top

top - 07:53:21 up 56 min,  1 user,  load average: 6.26, 6.26, 5.36
任务: 401 total,   1 running, 399 sleeping,   0 stopped,   1 zombie
%Cpu(s):  7.0 us,  1.4 sy,  0.0 ni, 62.2 id, 29.3 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64137.8 total,   1157.0 free,   3072.0 used,  59908.8 buff/cache
MiB Swap:  62500.0 total,  58802.2 free,   3697.8 used.  60342.8 avail Mem 



my command:
(base) jiyin@jiyin:/media/jiyin/ResearchSpace1/llama.cpp$ ./main --override-kv deepseek2.attention.q_lora_rank=int:1536 --override-kv deepseek2.attention.kv_lora_rank=int:512 --override-kv deepseek2.expert_shared_count=int:2 --override-kv deepseek2.expert_weights_scale=float:16 --override-kv deepseek2.expert_feed_forward_length=int:1536 --override-kv deepseek2.leading_dense_block_count=int:1 --override-kv deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707 -m /media/jiyin/新加卷3/model_tmps/DeepSeek-V2-Chat.q4_k_m.split-00001-of-00008.gguf -p "hello"

@fairydreaming
Copy link
Collaborator

@fairydreaming Try to add '-c 512'. Recently the examples started using a kv cache size equal to the model training context by default - in this case 16k

@ggerganov OK, thanks for the info.

@fairydreaming
Copy link
Collaborator

@DirtyKnightForVi Did you try some other model to see if your environment works correctly?

@DirtyKnightForVi
Copy link
Author

@DirtyKnightForVi Did you try some other model to see if your environment works correctly?

Running other models poses no issue. However, I'm curious as to why you encountered an OOM error, while I was able to smoothly infer a 200B large model with minimal resource consumption? At least, the data on the monitoring dashboard seems to suggest that a non-existent device is running the model for me. LOL

@DirtyKnightForVi
Copy link
Author

DirtyKnightForVi commented Jun 5, 2024

截图 2024-06-06 08-03-06

Deepseek still running

@fairydreaming
Copy link
Collaborator

@DirtyKnightForVi I have limited knowledge of Windows, but I guess there is some disk swap mechanism in use.

@DirtyKnightForVi
Copy link
Author

DirtyKnightForVi commented Jun 5, 2024

截图 2024-06-06 08-13-38

@fairydreaming I'am running it on Ubuntu. And CPU offload maybe the reason why it works well.

@ggerganov This might be a default setting. But are there other configurations that can fully load my CPU or GPU? I’m quite curious about the origin of this setting.

@fairydreaming
Copy link
Collaborator

@DirtyKnightForVi Did you try some other model to see if your environment works correctly?

Running other models poses no issue. However, I'm curious as to why you encountered an OOM error, while I was able to smoothly infer a 200B large model with minimal resource consumption?

@DirtyKnightForVi It's because you run it with context size (n_ctx) set to 512, while on my machine it was set to default training context size value of 163840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers model Model specific
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

10 participants