Add StarCoder/SantaCoder example #146

NouamaneTazi · 2023-05-11T09:42:49Z

Adds support to Starcoder and SantaCoder (aka smol StarCoder)

Quickstart:

# Convert HF model to ggml
python examples/starcoder/convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder

mkdir build && cd build
cmake .. && make -j4 starcoder starcoder-quantize

# quantize the model
./bin/starcoder-quantize ../models/bigcode/gpt_bigcode-santacoder-ggml.bin ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3

# run inference
./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2

Performance for Santacoder on M1 Pro:

$ ./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" -t 4 --top_k 0 --top_p 0.95 --temp 0.2      
main: seed = 1683881276
gpt2_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'
gpt2_model_load: n_vocab = 49280
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 1794.90 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7 

def fibonnaci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibo(10))
main: mem per token =  9597928 bytes
main:     load time =   480.43 ms
main:   sample time =    26.21 ms
main:  predict time =  3987.95 ms / 19.36 ms per token
main:    total time =  4580.56 ms

Performance for StarCoder on M1 Pro:

Pretty slow as it requires 30GB of RAM whilst my laptop only has 16GB (memory requirement could still be optimized by using MQA instead of MHA, could be done in a follow-up PR)

$ ./bin/starcoder -m ../starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibonnaci(" -t 4 --top_k 1 -n 2
main: seed = 1683838878
gpt2_model_load: loading model from '../starcoder-ggml/starcoder-ggml-q4_1.bin'
gpt2_model_load: n_vocab = 49152
gpt2_model_load: n_ctx   = 8192
gpt2_model_load: n_embd  = 6144
gpt2_model_load: n_head  = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 28956.35 MB
gpt2_model_load: memory size = 15360.00 MB, n_mem = 327680
gpt2_model_load: model size  = 13596.23 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 589 28176 97 3997 83 1871 26 

def fibonnaci(n):

main: mem per token = 46788904 bytes
main:     load time = 15578.03 ms
main:   sample time =     0.78 ms
main:  predict time = 52459.57 ms / 6557.45 ms per token
main:    total time = 89458.72 ms

Performance for StarCoder on DGX (device with plenty of CPU RAM)

$ ./bin/starcoder -m /home/nouamane/projects/ggml/starcoder-ggml/starcoder-ggml-q4_1.bin -p "def fibonnaci("
main: seed = 1683824507
gpt2_model_load: loading model from '/home/nouamane/projects/ggml/starcoder-ggml/starcoder-ggml-q4_1.bin2'
gpt2_model_load: n_vocab = 49152
gpt2_model_load: n_ctx   = 8192
gpt2_model_load: n_embd  = 6144
gpt2_model_load: n_head  = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 28956.35 MB
gpt2_model_load: memory size = 15360.00 MB, n_mem = 327680
gpt2_model_load: model size  = 13596.23 MB
main: prompt: 'def fibonnaci('
main: number of tokens in prompt = 7, first 8 tokens: 589 28176 97 3997 83 1871 26 

def fibonnaci(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(0))<|endoftext|>

main: mem per token = 46788904 bytes
main:     load time =  4577.47 ms
main:   sample time =     6.29 ms
main:  predict time = 20115.85 ms / 352.91 ms per token
main:    total time = 25859.86 ms

Result of quantizing the models:

Model	Original size	Quantized size	Quantization type
`bigcode/gpt_bigcode-santacoder`	5396.45 MB	1026.83 MB	4-bit integer (q4_1)
`bigcode/starcoder`	71628.23 MB	13596.23 MB	4-bit integer (q4_1)

Next TODOs:

Use MQA instead of MHA for less memory requirements
Fix endoftext token for santacoder model

cc @ggerganov

mparrett · 2023-05-13T04:35:09Z

Hi, thanks for your work on this. I'm also interested in getting this model working.

Were you able to get any code completions from this model in its current state?

ggerganov

Great stuff!

Looking forward to the MQA improvement

NouamaneTazi · 2023-05-13T10:33:06Z

Yes @mparrett, please check the collapsible sections.

For example using the prompt def fibonnaci(, the quantized model outputs:

def fibonnaci(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(0))<|endoftext|>

danforbes · 2023-05-13T15:23:52Z

I am having trouble running the example using the GGML file that I found here https://huggingface.co/nouamanetazi/starcoder-ggml/tree/main

build: ./bin/starcoder -m ~/.ggml-models/starcoder.bin -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1683991335
gpt2_model_load: loading model from '/home/dan/.ggml-models/starcoder.bin'
gpt2_model_load: n_vocab = 49152
gpt2_model_load: n_ctx   = 8192
gpt2_model_load: n_embd  = 6144
gpt2_model_load: n_head  = 48
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 28956.35 MB
GGML_ASSERT: /home/dan/Code/ggml/src/ggml.c:4567: ctx->mem_buffer != NULL
Aborted (core dumped)

NouamaneTazi · 2023-05-13T16:00:31Z

How much RAM do you have @danforbes? Can you try with bigcode/gpt_bigcode-santacoder before starcoder?

danforbes · 2023-05-13T16:03:46Z

Ah, yes...it may indeed be a RAM thing. Is there a GGML format of that model floating around somewhere? Quick search on 🤗 didn't turn on up...

danforbes · 2023-05-13T16:08:16Z

Also, another question while I have you 😅 Can you help me understand how the StarCoder example is different from the GPT-2 example? It seems the files are identical except for the way the memory is prepared for the weights?

NouamaneTazi · 2023-05-13T16:12:12Z

I tried uploading the q4_1 quantized model of starcoder, you can find it here: https://huggingface.co/nouamanetazi/starcoder-ggml/tree/main

danforbes · 2023-05-13T16:15:09Z

I tried uploading the q4_1 quantized model of starcoder, you can find it here: https://huggingface.co/nouamanetazi/starcoder-ggml/tree/main

Yes, this is the one I originally tested with. I guess I will take a stab at converting SantaCoder myself and seeing how that works.

danforbes · 2023-05-13T18:16:19Z

How much RAM do you have @danforbes? Can you try with bigcode/gpt_bigcode-santacoder before starcoder?

Yes, that worked much better

build: ./bin/starcoder -m ~/.ggml-models/santacoder.bin -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1684001711
gpt2_model_load: loading model from '/home/dan/.ggml-models/santacoder.bin'
gpt2_model_load: n_vocab = 49280
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 1794.90 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7 

def fibonacci(n):
    if n == 0 or n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


print(fibonacci(10))
<|endoftext|><fim-prefix><fim-suffix>e.exports = {
  get: get,
  set: set,
  remove: remove,
  clear: clear,
  getKeys: getKeys,
  getKeysByPrefix: getKeysByPrefix,
  getKeysByPrefixAndSuffix: getKeysByPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffix: getKeysByPrefixAndSuffixAndPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefixAndSuffix: getKeysBy

main: mem per token =  9597928 bytes
main:     load time =   414.56 ms
main:   sample time =    64.71 ms
main:  predict time = 10515.31 ms / 51.05 ms per token
main:    total time = 11098.49 ms

mparrett · 2023-05-13T20:50:44Z

Yes @mparrett, please check the collapsible sections.

For example using the prompt def fibonnaci(, the quantized model outputs:
def fibonnaci(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(0))<|endoftext|>

Interesting. I checked out your PR and converted the models myself, and could only get this result, with the original and quantized versions of santacoder.

./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml.bin.orig -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2

main: seed = 1684010783
starcoder_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1
starcoder_model_load: ggml ctx size = 3475.52 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  2707.45 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7

def fibonacci(!

main: mem per token =  9603048 bytes
main:     load time =   812.26 ms
main:   sample time =     0.21 ms
main:  predict time =    95.32 ms / 13.62 ms per token
main:    total time =   998.52 ms

I pulled from main this morning (but didn't re-convert models) and noticed the same behavior after working through this fun error during build. TL;DR Make sure to rm -rf build if you've upgrade SDK since last build.

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/wchar.h:123:15: fatal error: 'wchar.h' file not found
#include_next <wchar.h>

I also noticed if I change top_k=1, (and n=64, otherwise it goes on too long) I can get this output:

main: seed = 1684010982
starcoder_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml.bin.orig'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 1
starcoder_model_load: ggml ctx size = 3475.52 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  2707.45 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fiboonacci(10))
<|endoftext|><fim-prefix><fim-suffix>e.log(err

main: mem per token =  9603048 bytes
main:     load time =   794.18 ms
main:   sample time =     7.98 ms
main:  predict time =  3067.45 ms / 43.82 ms per token
main:    total time =  3949.93 ms

I also confirmed the quantized santacoder and quantized starcoder models work this way. I can't try the original starcoder because my RAM is limited (24GB).

./bin/starcoder -m ../models/bigcode/starcoder-ggml-q4_1.bin -p "def fibonacci(" --top_k 1 --top_p 0.95 --temp 0.2 -n 64

main: seed = 1684011161
starcoder_model_load: loading model from '../models/bigcode/starcoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49152
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 3
starcoder_model_load: ggml ctx size = 28956.35 MB
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 13596.23 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 589 28176 97 46278 85 91 26

def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(10))
<|endoftext|>

main: mem per token = 46788904 bytes
main:     load time = 10596.82 ms
main:   sample time =     7.34 ms
main:  predict time = 14553.64 ms / 234.74 ms per token
main:    total time = 27610.36 ms

Sharing all of this in case it's helpful to someone. Thanks again for your work!

appvoid · 2023-05-14T19:33:34Z

How much RAM do you have @danforbes? Can you try with bigcode/gpt_bigcode-santacoder before starcoder?

Yes, that worked much better

build: ./bin/starcoder -m ~/.ggml-models/santacoder.bin -p "def fibonacci(" --top_k 0 --top_p 0.95 --temp 0.2
main: seed = 1684001711
gpt2_model_load: loading model from '/home/dan/.ggml-models/santacoder.bin'
gpt2_model_load: n_vocab = 49280
gpt2_model_load: n_ctx   = 2048
gpt2_model_load: n_embd  = 2048
gpt2_model_load: n_head  = 16
gpt2_model_load: n_layer = 24
gpt2_model_load: ftype   = 3
gpt2_model_load: ggml ctx size = 1794.90 MB
gpt2_model_load: memory size =   768.00 MB, n_mem = 49152
gpt2_model_load: model size  =  1026.83 MB
main: prompt: 'def fibonacci('
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 4357 66 2819 7 

def fibonacci(n):
    if n == 0 or n == 1:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


print(fibonacci(10))
<|endoftext|><fim-prefix><fim-suffix>e.exports = {
  get: get,
  set: set,
  remove: remove,
  clear: clear,
  getKeys: getKeys,
  getKeysByPrefix: getKeysByPrefix,
  getKeysByPrefixAndSuffix: getKeysByPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffix: getKeysByPrefixAndSuffixAndPrefixAndSuffix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix: getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefix,
  getKeysByPrefixAndSuffixAndPrefixAndSuffixAndPrefixAndSuffix: getKeysBy

main: mem per token =  9597928 bytes
main:     load time =   414.56 ms
main:   sample time =    64.71 ms
main:  predict time = 10515.31 ms / 51.05 ms per token
main:    total time = 11098.49 ms

Can you please share that model on huggingface?

danforbes · 2023-05-14T20:41:47Z

Can you please share that model on huggingface?

Here you go https://huggingface.co/danforbes/santacoder-ggml-q4_1/blob/main/santacoder-ggml-q4_1.bin

kohlerm · 2023-05-17T08:21:41Z

Can you please share that model on huggingface?

Here you go https://huggingface.co/danforbes/santacoder-ggml-q4_1/blob/main/santacoder-ggml-q4_1.bin

I tried to run this model but I only get:


./bin/starcoder -m santacoder-ggml-q4_1.bin  -p "def hello_world(" --top_k 1 --top_p 0.95 --temp 0.2
main: seed = 1684311617
starcoder_model_load: loading model from 'santacoder-ggml-q4_1.bin'
starcoder_model_load: n_vocab = 49280
starcoder_model_load: n_ctx   = 2048
starcoder_model_load: n_embd  = 2048
starcoder_model_load: n_head  = 16
starcoder_model_load: n_layer = 24
starcoder_model_load: ftype   = 3
starcoder_model_load: qntvr   = 0
starcoder_model_load: ggml ctx size = 1794.97 MB
starcoder_model_load: memory size =   768.00 MB, n_mem = 49152
starcoder_model_load: model size  =  1026.83 MB
main: prompt: 'def hello_world('
main: number of tokens in prompt = 5, first 8 tokens: 563 16300 62 3881 7

def hello_world(verterverterverterverterverterverterverterverterverter

Any idea what is going wrong?

bluecoconut · 2023-05-17T08:25:59Z

@kohlerm There was a breaking change to Quantization formats -- I also ran into this problem with pulling master of ggml. Specifically, you'll have to re-quantized the weights using the new codebase, or checkout a commit from before: #154 to use the older weights you have.

x4080 · 2023-05-19T20:32:22Z

@danforbes do you have the new ggml model uploaded ?

Can my m2 pro 16GB run the starcoder ? Or memory is not enough ?

Thanks

NouamaneTazi · 2023-05-20T15:34:37Z

Can my m2 pro 16GB run the starcoder ? Or memory is not enough ?

It should run even if it exceeds your 16GB RAM using swap memory, but it will be extremely slow (like the case for Performance for StarCoder on M1 Pro: in the PR description) @x4080

x4080 · 2023-05-20T20:29:23Z

@NouamaneTazi Thanks

NouamaneTazi added 14 commits May 11, 2023 11:42

init commit

fb25cfd

fix building starcoder

83205ac

gen work

13af9ed

fix vocab

a850b78

santacoder mha

f191133

.

8d5253b

fix quantize

2861556

offload_state_dict

68bad53

endoftext

56b3398

rename scripts

94bfd58

fix main

fba0a14

scripts

b6b6348

update README

3825839

quickfixes

9c0c853

NouamaneTazi marked this pull request as ready for review May 12, 2023 09:08

NouamaneTazi mentioned this pull request May 12, 2023

Investigate supporting starcode ggerganov/llama.cpp#1326

Closed

NouamaneTazi changed the title ~~Add StarCoder example~~ Add StarCoder/SantaCoder example May 12, 2023

ggerganov approved these changes May 13, 2023

View reviewed changes

ggerganov merged commit 1330f32 into ggerganov:master May 13, 2023

ggerganov added a commit that referenced this pull request May 13, 2023

readme : update example list (#146)

9ffa152

This was referenced May 16, 2023

add ggml based ctransformer models approximatelabs/lambdaprompt#11

Merged

starcoder -- not enough space in the context's memory pool #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StarCoder/SantaCoder example #146

Add StarCoder/SantaCoder example #146

NouamaneTazi commented May 11, 2023 •

edited

Loading

mparrett commented May 13, 2023

ggerganov left a comment

NouamaneTazi commented May 13, 2023

danforbes commented May 13, 2023

NouamaneTazi commented May 13, 2023

danforbes commented May 13, 2023

danforbes commented May 13, 2023

NouamaneTazi commented May 13, 2023

danforbes commented May 13, 2023

danforbes commented May 13, 2023

mparrett commented May 13, 2023 •

edited

Loading

appvoid commented May 14, 2023

danforbes commented May 14, 2023

kohlerm commented May 17, 2023

bluecoconut commented May 17, 2023

x4080 commented May 19, 2023

NouamaneTazi commented May 20, 2023 •

edited

Loading

x4080 commented May 20, 2023

Add StarCoder/SantaCoder example #146

Add StarCoder/SantaCoder example #146

Conversation

NouamaneTazi commented May 11, 2023 • edited Loading

mparrett commented May 13, 2023

ggerganov left a comment

Choose a reason for hiding this comment

NouamaneTazi commented May 13, 2023

danforbes commented May 13, 2023

NouamaneTazi commented May 13, 2023

danforbes commented May 13, 2023

danforbes commented May 13, 2023

NouamaneTazi commented May 13, 2023

danforbes commented May 13, 2023

danforbes commented May 13, 2023

mparrett commented May 13, 2023 • edited Loading

appvoid commented May 14, 2023

danforbes commented May 14, 2023

kohlerm commented May 17, 2023

bluecoconut commented May 17, 2023

x4080 commented May 19, 2023

NouamaneTazi commented May 20, 2023 • edited Loading

x4080 commented May 20, 2023

NouamaneTazi commented May 11, 2023 •

edited

Loading

mparrett commented May 13, 2023 •

edited

Loading

NouamaneTazi commented May 20, 2023 •

edited

Loading