Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference starcoder(4bit\8bit) with GPU #417

Open
curname opened this issue Jul 26, 2023 · 4 comments
Open

Inference starcoder(4bit\8bit) with GPU #417

curname opened this issue Jul 26, 2023 · 4 comments

Comments

@curname
Copy link

curname commented Jul 26, 2023

First of all, thank you for your work! I used ggml to quantize the starcoder model to 8bit (4bit), but I encountered difficulties when using GPU for inference. If you can provide me with an example, I would be very grateful.

@johnson442
Copy link
Contributor

./bin/starcoder-mmap -m /models/WizardCoder-15B-1.0.ggmlv3.q5_1.bin -ngl 20 -p "def fibonacci(n):"
main: seed = 1690402839
starcoder_model_load: loading model from '/models/WizardCoder-15B-1.0.ggmlv3.q5_1.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 2009
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml map size = 13596.73 MB
starcoder_model_load: ggml ctx size =   0.24 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1
starcoder_model_load: kv_cache memory size =  7680.00 MB, n_mem = 327680
starcoder_model_load: model size  = 13596.24 MB
starcoder_model_load: [cublas] offloading 20 layers to GPU
starcoder_model_load: [cublas] total VRAM used: 6480 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: prompt: 'def fibonacci(n):'
main: number of tokens in prompt = 6
main: token[0] =    589, def
main: token[1] =  28176,  fib
main: token[2] =  34682, onacci
main: token[3] =     26, (
main: token[4] =     96, n
main: token[5] =    711, ):


Calling starcoder_eval
def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

n = int(input("Enter a positive integer: "))
if n < 0:
    print("Invalid input!")
else:
    print("The", n, "th Fibonacci number is:", fibonacci(n))<|endoftext|>

main: mem per token =   460268 bytes
main:     load time =  2807.24 ms
main:   sample time =    24.13 ms
main:  predict time = 21730.91 ms / 231.18 ms per token
main:    total time = 29279.35 ms

@staviq
Copy link

staviq commented Jul 27, 2023

It's not working for me either.

#>cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc ..
-- The C compiler identification is GNU 12.2.1
-- The CXX compiler identification is GNU 12.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.39.1") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Linux detected
-- Found CUDAToolkit: /opt/cuda/include (found version "12.2.91") 
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.2.91
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- GGML CUDA sources found, configuring CUDA architecture
-- x86 detected
-- Linux detected
-- Configuring done
-- Generating done
-- Build files have been written to: /storage/ggml/build

And starcoder doesn't even try using the GPU:

./starcoder -m /storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin -p "sqrt(4)" -ngl 1   
main: seed = 1690477492
starcoder_model_load: loading model from '/storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 2007
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml ctx size = 34536.48 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 19176.25 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.

starcoder-mmap does this:

./starcoder-mmap -m /storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin -p "sqrt(4)" -ngl 1
main: seed = 1690477603
starcoder_model_load: loading model from '/storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 2007
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml map size = 19176.73 MB
starcoder_model_load: ggml ctx size =   0.24 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
starcoder_model_load: kv_cache memory size =  7680.00 MB, n_mem = 327680
starcoder_model_load: model size  = 19176.25 MB
starcoder_model_load: [cublas] offloading 1 layers to GPU
starcoder_model_load: [cublas] total VRAM used: 459 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: prompt: 'sqrt(4)'
main: number of tokens in prompt = 4
main: token[0] =   8663, sqrt
main: token[1] =     26, (
main: token[2] =     38, 4
main: token[3] =     27, )


Calling starcoder_eval
CUDA error 222 at /storage/ggml/src/ggml-cuda.cu:3509: the provided PTX was compiled with an unsupported toolchain.

EDIT: llamacpp works just fine for me though.

@slaren
Copy link
Collaborator

slaren commented Jul 28, 2023

The CUDA backend requires some changes to the code to do full offloading, otherwise it is only used for multiplication of large matrices (generally that only happens when evaluating large prompts). It will be easier to use once we implement a common interface for all the backends, but it's going to take a while.

For example of how to use it, you can look into the llama.cpp source code. In the future, llama.cpp will also be extended to support other LLMs.

@staviq
Copy link

staviq commented Jul 28, 2023

The CUDA backend requires some changes to the code to do full offloading, otherwise it is only used for multiplication of large matrices (generally that only happens when evaluating large prompts). It will be easier to use once we implement a common interface for all the backends, but it's going to take a while.

For example of how to use it, you can look into the llama.cpp source code. In the future, llama.cpp will also be extended to support other LLMs.

Thank you for the explaination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants