Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Code Llama 7B #370

Closed
flavienbwk opened this issue Aug 25, 2023 · 15 comments
Closed

Implementing Code Llama 7B #370

flavienbwk opened this issue Aug 25, 2023 · 15 comments
Labels
enhancement New feature or request

Comments

@flavienbwk
Copy link

Please describe the feature you want

Code Llama, released yesterday by Meta, is pretending better performance than GPT3.5 for code generation.

I saw the following project : https://huggingface.co/TabbyML/CodeLlama-7B

When is it scheduled to be released ?

Thanks a lot to the TabbyML team.

@flavienbwk flavienbwk added the enhancement New feature or request label Aug 25, 2023
@wsxiaoys
Copy link
Member

I have an under-development version hosted at https://huggingface.co/TabbyML/CodeLlama-7B. However, we are still working on implementing the tokenization of stop words for line breaks.

I will keep you updated on our progress regarding this issue.

@wsxiaoys
Copy link
Member

wsxiaoys commented Aug 28, 2023

Once #371 is merged and released in the daily docker build, TabbyML/CodeLlama-7B shall work as intended.

Please note that this model is significantly larger (7B) compared to our current recommendation, such as SantaCoder-1B, for a T4 GPU.

@flavienbwk
Copy link
Author

flavienbwk commented Aug 30, 2023

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

@wsxiaoys
Copy link
Member

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

@jeff31415
Copy link

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

Will there be further support for quantisation,like GPTQ to make even bigger models more useble?

@flavienbwk
Copy link
Author

@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you.

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

I am surprised because I've tested on my Nvidia P100 16GB VRAM and the container returns :

2023-08-30T14:13:44.171766Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/vocabulary.json
  [00:00:00] [##################################################################] 496.94 KiB/496.94 KiB (2.92 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tabby.json
  [00:00:00] [############################################################################] 143B/143B (303.65 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/tokenizer.json
  [00:00:00] [######################################################################] 1.76 MiB/1.76 MiB (5.33 MiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/config.json
  [00:00:00] [############################################################################] 103B/103B (211.54 KiB/s, 0s)Downloaded /data/models/TabbyML/CodeLlama-7B/ctranslate2/model.bin
  [00:00:43] [##################################################################] 12.55 GiB/12.55 GiB (293.80 MiB/s, 0s)2023-08-30T14:14:29.690394Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA failed with error out of memory

With command :

docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda

Would you have any idea ?

@wsxiaoys
Copy link
Member

For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.

As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.

@flavienbwk
Copy link
Author

I've maybe forgotten the --compute-type option but it outputs this error :

# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

@wsxiaoys
Copy link
Member

I've maybe forgotten the --compute-type option but it outputs this error :

# docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model TabbyML/CodeLlama-7B --device cuda --compute-type int8
2023-08-30T14:27:23.824958Z  INFO tabby_download: crates/tabby-download/src/lib.rs:66: Start downloading model `TabbyML/CodeLlama-7B`
2023-08-30T14:27:23.826825Z  INFO tabby::serve: crates/tabby/src/serve/mod.rs:128: Starting server, this might takes a few minutes...
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Requested int8 compute type, but the target device or backend do not support efficient int8 computation.

Could you share your cuda setup? maybe attached output of nvidia-smi?

@flavienbwk
Copy link
Author

Here it is :

# nvidia-smi
Wed Aug 30 14:40:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Is the Pascal architecture too old ?

@wsxiaoys
Copy link
Member

wsxiaoys commented Aug 30, 2023

Is the Pascal architecture too old ?

Yes - int8 precision requires cuda compute capability >= 7.0 or 6.1. While P100 has compute capability of 6.0 thus could only utilize float32 inference.

I added a faq section on website to further elaborate this: https://tabbyml.github.io/tabby/docs/faq

@flavienbwk
Copy link
Author

Very clear, thank you.

@flavienbwk
Copy link
Author

Confirmed working on RTX3070 with 6849MiB / 8192MiB of VRAM.

@jeff31415
Copy link

For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction.

As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q.

As far as I know,there is still something can be done to decrease latency on larger models,with some tricks to overcome vram bandwidth bottleneck and increase GPU utilization.
For example,the assisted generation trick discussed in this blog:https://huggingface.co/blog/assisted-generation
Which achieved ~2x speed up in single stream generation.Such speed up could be more substantial on vram bandwidth starved gamming GPUs(for those RTX cards).In this way,running a ~15 or even ~30B sota model with reasonable latency might be achiveable🤔

@wsxiaoys wsxiaoys reopened this Sep 1, 2023
@wsxiaoys
Copy link
Member

wsxiaoys commented Sep 1, 2023

https://x.com/ggerganov/status/1694775472658198604?s=46

might worth prioritize llasupport integrating since the speculative decoding (assisted generation) gives such a high performance bump ……

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants