Implementing Code Llama 7B #370

flavienbwk · 2023-08-25T14:54:45Z

Please describe the feature you want

Code Llama, released yesterday by Meta, is pretending better performance than GPT3.5 for code generation.

I saw the following project : https://huggingface.co/TabbyML/CodeLlama-7B

When is it scheduled to be released ?

Thanks a lot to the TabbyML team.

wsxiaoys · 2023-08-27T09:38:20Z

I have an under-development version hosted at https://huggingface.co/TabbyML/CodeLlama-7B. However, we are still working on implementing the tokenization of stop words for line breaks.

I will keep you updated on our progress regarding this issue.

wsxiaoys · 2023-08-28T06:02:35Z

Once #371 is merged and released in the daily docker build, TabbyML/CodeLlama-7B shall work as intended.

Please note that this model is significantly larger (7B) compared to our current recommendation, such as SantaCoder-1B, for a T4 GPU.

flavienbwk · 2023-08-30T13:02:05Z

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

wsxiaoys · 2023-08-30T13:42:30Z

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

jeff31415 · 2023-08-30T14:12:43Z

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

Will there be further support for quantisation,like GPTQ to make even bigger models more useble?

flavienbwk · 2023-08-30T14:17:40Z

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

I am surprised because I've tested on my Nvidia P100 16GB VRAM and the container returns :

2023-08-30T14:13:44.171766Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/vocabulary.json
  [00:00:00] [##################################################################] 496.94 KiB/496.94 KiB (2.92 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tabby.json
  [00:00:00] [############################################################################] 143B/143B (303.65 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tokenizer.json
  [00:00:00] [######################################################################] 1.76 MiB/1.76 MiB (5.33 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/config.json
  [00:00:00] [############################################################################] 103B/103B (211.54 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/model.bin
  [00:00:43] [##################################################################] 12.55 GiB/12.55 GiB (293.80 MiB/s, 0s)2023-08-30T14:14:29.690394Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA failed with error out of memory

With command :

docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda

Would you have any idea ?

wsxiaoys · 2023-08-30T14:20:13Z

For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.

As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.

flavienbwk · 2023-08-30T14:32:05Z

I've maybe forgotten the --compute-type option but it outputs this error :

# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

wsxiaoys · 2023-08-30T14:38:52Z

I've maybe forgotten the --compute-type option but it outputs this error :

# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

Could you share your cuda setup? maybe attached output of nvidia-smi?

flavienbwk · 2023-08-30T14:40:37Z

Here it is :

# nvidia-smi
Wed Aug 30 14:40:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Is the Pascal architecture too old ?

wsxiaoys · 2023-08-30T14:48:35Z

Is the Pascal architecture too old ?

Yes - int8 precision requires cuda compute capability >= 7.0 or 6.1. While P100 has compute capability of 6.0 thus could only utilize float32 inference.

I added a faq section on website to further elaborate this: https://tabbyml.github.io/tabby/docs/faq

flavienbwk · 2023-08-30T15:35:57Z

Very clear, thank you.

flavienbwk · 2023-08-30T15:53:02Z

Confirmed working on RTX3070 with 6849MiB / 8192MiB of VRAM.

jeff31415 · 2023-08-30T16:38:24Z

For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.

As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.

As far as I know,there is still something can be done to decrease latency on larger models,with some tricks to overcome vram bandwidth bottleneck and increase GPU utilization.
For example,the assisted generation trick discussed in this blog:https://huggingface.co/blog/assisted-generation
Which achieved ~2x speed up in single stream generation.Such speed up could be more substantial on vram bandwidth starved gamming GPUs(for those RTX cards).In this way,running a ~15 or even ~30B sota model with reasonable latency might be achiveable🤔

wsxiaoys · 2023-09-01T09:05:22Z

https://x.com/ggerganov/status/1694775472658198604?s=46

might worth prioritize llasupport integrating since the speculative decoding (assisted generation) gives such a high performance bump ……

flavienbwk added the enhancement New feature or request label Aug 25, 2023

wsxiaoys mentioned this issue Aug 28, 2023

feat: add stop words encoding offset for ctranslate model config #371

Merged

wsxiaoys mentioned this issue Aug 28, 2023

Llama support #352

Closed

wsxiaoys closed this as completed Aug 29, 2023

wsxiaoys reopened this Sep 1, 2023

wsxiaoys mentioned this issue Sep 3, 2023

feat: llama.cpp for metal support [TAB-146] #391

Merged

3 tasks

wsxiaoys closed this as completed Sep 12, 2023

This was referenced Feb 27, 2024

TabbyML: Self-hosted AI coding assistant. irthomasthomas/undecidability#642

Open

py-shiny/README.md at main · posit-dev/py-shiny irthomasthomas/undecidability#675

Open

pattern: detect wasm availability irthomasthomas/undecidability#692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Code Llama 7B #370

Implementing Code Llama 7B #370

flavienbwk commented Aug 25, 2023

wsxiaoys commented Aug 27, 2023

wsxiaoys commented Aug 28, 2023 •

edited

Loading

flavienbwk commented Aug 30, 2023 •

edited

Loading

wsxiaoys commented Aug 30, 2023

jeff31415 commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

wsxiaoys commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

wsxiaoys commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

wsxiaoys commented Aug 30, 2023 •

edited

Loading

flavienbwk commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

jeff31415 commented Aug 30, 2023

wsxiaoys commented Sep 1, 2023

Implementing Code Llama 7B #370

Implementing Code Llama 7B #370

Comments

flavienbwk commented Aug 25, 2023

wsxiaoys commented Aug 27, 2023

wsxiaoys commented Aug 28, 2023 • edited Loading

flavienbwk commented Aug 30, 2023 • edited Loading

wsxiaoys commented Aug 30, 2023

jeff31415 commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

wsxiaoys commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

wsxiaoys commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

wsxiaoys commented Aug 30, 2023 • edited Loading

flavienbwk commented Aug 30, 2023

flavienbwk commented Aug 30, 2023

jeff31415 commented Aug 30, 2023

wsxiaoys commented Sep 1, 2023

wsxiaoys commented Aug 28, 2023 •

edited

Loading

flavienbwk commented Aug 30, 2023 •

edited

Loading

wsxiaoys commented Aug 30, 2023 •

edited

Loading