Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StarCoder/SantaCoder example #146

Merged
merged 14 commits into from
May 13, 2023

Conversation

NouamaneTazi
Copy link
Contributor

@NouamaneTazi NouamaneTazi commented May 11, 2023

Adds support to Starcoder and SantaCoder (aka smol StarCoder)

Quickstart:

# Convert HF model to ggml
python examples/starcoder/convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

mkdir build && cd build
cmake .. && make -j4 starcoder starcoder-quantize

# quantize the model
./bin/starcoder-quantize ../models/bigcode/gpt_bigcode-santacoder-ggml.bin ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3

# run inference
./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2
Performance for Santacoder on M1 Pro:
$ ./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" -t 4 --top_k 0 --top_p 0.95 --temp 0.2      
main: seed = 1683881276
gpt2_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'
gpt2_model_load: n_vocab = 49280
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 1794.90 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7 

def fibonnaci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibo(10))
main: mem per token =  9597928 bytes
main:     load time =   480.43 ms
main:   sample time =    26.21 ms
main:  predict time =  3987.95 ms / 19.36 ms per token
main:    total time =  4580.56 ms
Performance for StarCoder on M1 Pro:

Pretty slow as it requires 30GB of RAM whilst my laptop only has 16GB (memory requirement could still be optimized by using MQA instead of MHA, could be done in a follow-up PR)

$ ./bin/starcoder -m ../starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibonnaci(" -t 4 --top_k 1 -n 2
main: seed = 1683838878
gpt2_model_load: loading model from '../starcoder-ggml/starcoder-ggml-q4_1.bin'
gpt2_model_load: n_vocab = 49152
gpt2_model_load: n_ctx   = 8192
gpt2_model_load: n_embd  = 6144
gpt2_model_load: n_head  = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 28956.35 MB
gpt2_model_load: memory size = 15360.00 MB, n_mem = 327680
gpt2_model_load: model size  = 13596.23 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 589 28176 97 3997 83 1871 26 

def fibonnaci(n):

main: mem per token = 46788904 bytes
main:     load time = 15578.03 ms
main:   sample time =     0.78 ms
main:  predict time = 52459.57 ms / 6557.45 ms per token
main:    total time = 89458.72 ms
Performance for StarCoder on DGX (device with plenty of CPU RAM)
$ ./bin/starcoder -m /home/nouamane/projects/ggml/starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibonnaci("
main: seed = 1683824507
gpt2_model_load: loading model from '/home/nouamane/projects/ggml/starcoder-ggml/starcoder-ggml-q4_1.bin2'
gpt2_model_load: n_vocab = 49152
gpt2_model_load: n_ctx   = 8192
gpt2_model_load: n_embd  = 6144
gpt2_model_load: n_head  = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 28956.35 MB
gpt2_model_load: memory size = 15360.00 MB, n_mem = 327680
gpt2_model_load: model size  = 13596.23 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 589 28176 97 3997 83 1871 26 

def fibonnaci(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(0))<|endoftext|>

main: mem per token = 46788904 bytes
main:     load time =  4577.47 ms
main:   sample time =     6.29 ms
main:  predict time = 20115.85 ms / 352.91 ms per token
main:    total time = 25859.86 ms

Result of quantizing the models:

Model Original size Quantized size Quantization type
bigcode/gpt_bigcode-santacoder 5396.45 MB 1026.83 MB 4-bit integer (q4_1)
bigcode/starcoder 71628.23 MB 13596.23 MB 4-bit integer (q4_1)

Next TODOs:

  • Use MQA instead of MHA for less memory requirements
  • Fix endoftext token for santacoder model

cc @ggerganov

@NouamaneTazi NouamaneTazi marked this pull request as ready for review May 12, 2023 09:08
@NouamaneTazi NouamaneTazi changed the title Add StarCoder example Add StarCoder/SantaCoder example May 12, 2023
@mparrett
Copy link

Hi, thanks for your work on this. I'm also interested in getting this model working.

Were you able to get any code completions from this model in its current state?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

Looking forward to the MQA improvement

@ggerganov ggerganov merged commit 1330f32 into ggerganov:master May 13, 2023
ggerganov added a commit that referenced this pull request May 13, 2023
@NouamaneTazi
Copy link
Contributor Author

Yes @mparrett, please check the collapsible sections.

For example using the prompt def fibonnaci(, the quantized model outputs:

def fibonnaci(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(0))<|endoftext|>

@danforbes
Copy link
Contributor

I am having trouble running the example using the GGML file that I found here https://huggingface.co/nouamanetazi/starcoder-ggml/tree/main

build: ./bin/starcoder -m ~/.ggml-models/starcoder.bin -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1683991335
gpt2_model_load: loading model from '/home/dan/.ggml-models/starcoder.bin'
gpt2_model_load: n_vocab = 49152
gpt2_model_load: n_ctx   = 8192
gpt2_model_load: n_embd  = 6144
gpt2_model_load: n_head  = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 28956.35 MB
GGML_ASSERT: /home/dan/Code/ggml/src/ggml.c:4567: ctx->mem_buffer != NULL
Aborted (core dumped)

@NouamaneTazi
Copy link
Contributor Author

How much RAM do you have @danforbes? Can you try with bigcode/gpt_bigcode-santacoder before starcoder?

@danforbes
Copy link
Contributor

Ah, yes...it may indeed be a RAM thing. Is there a GGML format of that model floating around somewhere? Quick search on 🤗 didn't turn on up...

@danforbes
Copy link
Contributor

Also, another question while I have you 😅 Can you help me understand how the StarCoder example is different from the GPT-2 example? It seems the files are identical except for the way the memory is prepared for the weights?

@NouamaneTazi
Copy link
Contributor Author

I tried uploading the q4_1 quantized model of starcoder, you can find it here: https://huggingface.co/nouamanetazi/starcoder-ggml/tree/main

@danforbes
Copy link
Contributor

I tried uploading the q4_1 quantized model of starcoder, you can find it here: https://huggingface.co/nouamanetazi/starcoder-ggml/tree/main

Yes, this is the one I originally tested with. I guess I will take a stab at converting SantaCoder myself and seeing how that works.

@danforbes
Copy link
Contributor

How much RAM do you have @danforbes? Can you try with bigcode/gpt_bigcode-santacoder before starcoder?

Yes, that worked much better

build: ./bin/starcoder -m ~/.ggml-models/santacoder.bin -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1684001711
gpt2_model_load: loading model from '/home/dan/.ggml-models/santacoder.bin'
gpt2_model_load: n_vocab = 49280
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 1794.90 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7 

def fibonacci(n):
    if n == 0 or n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


print(fibonacci(10))
<|endoftext|><fim-prefix><fim-suffix>e.exports = {
  get: get,
  set: set,
  remove: remove,
  clear: clear,
  getKeys: getKeys,
  getKeysByPrefix: getKeysByPrefix,
  getKeysByPrefixAndSuffix: getKeysByPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffix: getKeysByPrefixAndSuffixAndPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefixAndSuffix: getKeysBy

main: mem per token =  9597928 bytes
main:     load time =   414.56 ms
main:   sample time =    64.71 ms
main:  predict time = 10515.31 ms / 51.05 ms per token
main:    total time = 11098.49 ms

@mparrett
Copy link

mparrett commented May 13, 2023

Yes @mparrett, please check the collapsible sections.

For example using the prompt def fibonnaci(, the quantized model outputs:

def fibonnaci(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(0))<|endoftext|>

Interesting. I checked out your PR and converted the models myself, and could only get this result, with the original and quantized versions of santacoder.

./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml.bin.orig -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2

main: seed = 1684010783
starcoder_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1
starcoder_model_load: ggml ctx size = 3475.52 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  2707.45 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7

def fibonacci(!

main: mem per token =  9603048 bytes
main:     load time =   812.26 ms
main:   sample time =     0.21 ms
main:  predict time =    95.32 ms / 13.62 ms per token
main:    total time =   998.52 ms

I pulled from main this morning (but didn't re-convert models) and noticed the same behavior after working through this fun error during build. TL;DR Make sure to rm -rf build if you've upgrade SDK since last build.

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/wchar.h:123:15: fatal error: 'wchar.h' file not found
#include_next <wchar.h>

I also noticed if I change top_k=1, (and n=64, otherwise it goes on too long) I can get this output:

main: seed = 1684010982
starcoder_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml.bin.orig'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1
starcoder_model_load: ggml ctx size = 3475.52 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  2707.45 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fiboonacci(10))
<|endoftext|><fim-prefix><fim-suffix>e.log(err

main: mem per token =  9603048 bytes
main:     load time =   794.18 ms
main:   sample time =     7.98 ms
main:  predict time =  3067.45 ms / 43.82 ms per token
main:    total time =  3949.93 ms

I also confirmed the quantized santacoder and quantized starcoder models work this way. I can't try the original starcoder because my RAM is limited (24GB).

./bin/starcoder -m ../models/bigcode/starcoder-ggml-q4_1.bin -p "def fibonacci(" --top_k 1 --top_p 0.95 --temp 0.2 -n 64

main: seed = 1684011161
starcoder_model_load: loading model from '../models/bigcode/starcoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 3
starcoder_model_load: ggml ctx size = 28956.35 MB
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 13596.23 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 589 28176 97 46278 85 91 26

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(10))
<|endoftext|>

main: mem per token = 46788904 bytes
main:     load time = 10596.82 ms
main:   sample time =     7.34 ms
main:  predict time = 14553.64 ms / 234.74 ms per token
main:    total time = 27610.36 ms

Sharing all of this in case it's helpful to someone. Thanks again for your work!

@appvoid
Copy link
Contributor

appvoid commented May 14, 2023

How much RAM do you have @danforbes? Can you try with bigcode/gpt_bigcode-santacoder before starcoder?

Yes, that worked much better

build: ./bin/starcoder -m ~/.ggml-models/santacoder.bin -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1684001711
gpt2_model_load: loading model from '/home/dan/.ggml-models/santacoder.bin'
gpt2_model_load: n_vocab = 49280
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 1794.90 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7 

def fibonacci(n):
    if n == 0 or n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


print(fibonacci(10))
<|endoftext|><fim-prefix><fim-suffix>e.exports = {
  get: get,
  set: set,
  remove: remove,
  clear: clear,
  getKeys: getKeys,
  getKeysByPrefix: getKeysByPrefix,
  getKeysByPrefixAndSuffix: getKeysByPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffix: getKeysByPrefixAndSuffixAndPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefixAndSuffix: getKeysBy

main: mem per token =  9597928 bytes
main:     load time =   414.56 ms
main:   sample time =    64.71 ms
main:  predict time = 10515.31 ms / 51.05 ms per token
main:    total time = 11098.49 ms

Can you please share that model on huggingface?

@danforbes
Copy link
Contributor

Can you please share that model on huggingface?

Here you go https://huggingface.co/danforbes/santacoder-ggml-q4_1/blob/main/santacoder-ggml-q4_1.bin

@kohlerm
Copy link

kohlerm commented May 17, 2023

Can you please share that model on huggingface?

Here you go https://huggingface.co/danforbes/santacoder-ggml-q4_1/blob/main/santacoder-ggml-q4_1.bin

I tried to run this model but I only get:


./bin/starcoder -m santacoder-ggml-q4_1.bin  -p "def hello_world(" --top_k 1 --top_p 0.95 --temp 0.2
main: seed = 1684311617
starcoder_model_load: loading model from 'santacoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 3
starcoder_model_load: qntvr   = 0
starcoder_model_load: ggml ctx size = 1794.97 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  1026.83 MB
main: prompt: 'def hello_world('
main: number of tokens in prompt = 5, first 8 tokens: 563 16300 62 3881 7

def hello_world(verterverterverterverterverterverterverterverterverter

Any idea what is going wrong?

@bluecoconut
Copy link

@kohlerm There was a breaking change to Quantization formats -- I also ran into this problem with pulling master of ggml. Specifically, you'll have to re-quantized the weights using the new codebase, or checkout a commit from before: #154 to use the older weights you have.

@x4080
Copy link

x4080 commented May 19, 2023

@danforbes do you have the new ggml model uploaded ?

Can my m2 pro 16GB run the starcoder ? Or memory is not enough ?

Thanks

@NouamaneTazi
Copy link
Contributor Author

NouamaneTazi commented May 20, 2023

Can my m2 pro 16GB run the starcoder ? Or memory is not enough ?

It should run even if it exceeds your 16GB RAM using swap memory, but it will be extremely slow (like the case for Performance for StarCoder on M1 Pro: in the PR description) @x4080

@x4080
Copy link

x4080 commented May 20, 2023

@NouamaneTazi Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants