feat: support StarCoder model architectures #3187

wsxiaoys · 2023-09-15T04:57:07Z

~~Still in progress, while the model converting / params loading part seems to be working.~~

Tabby has integrated llama.cpp and released v0.1.1 🎉. It now offers native support for metal inference and the StarCoder model!

ggerganov · 2023-09-15T07:02:34Z

Looks good so far - let us know if you hit any roadblocks

wsxiaoys · 2023-09-15T07:28:46Z

Looks good so far - let us know if you hit any roadblocks

The remaining part for now is from line 3580 to line 3718 in llama.cpp. It should not be very hard to figure it out once I have set up a development environment to ensure the matrix shape arithmetic is correct...

wsxiaoys · 2023-09-15T10:47:43Z

OK, I think I got a version running under CPU:

> make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 0

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:


def dijkstra(graph, start):
    """
    Returns the shortest path from `start` to all other nodes in `graph`.

    The graph is represented as a dictionary of dictionaries. Each key represents a node and each value is another dictionary with keys 'to' and 'cost'.
    """
    # Initialize the distances array to infinity
    distances = [float('inf') for _ in range(len(graph))]
    distances[start] = 0

    # Initialize the previous array to None
    previous = [None for _ in range(len(graph))]

    # Loop through all nodes and find the shortest path
llama_print_timings:        load time =   110.20 ms
llama_print_timings:      sample time =   134.80 ms /   128 runs   (    1.05 ms per token,   949.55 tokens per second)
llama_print_timings: prompt eval time =   262.29 ms /    20 tokens (   13.11 ms per token,    76.25 tokens per second)
llama_print_timings:        eval time =  3485.94 ms /   127 runs   (   27.45 ms per token,    36.43 tokens per second)
llama_print_timings:       total time =  3914.92 ms

But it's currently buggy in metal:

> make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 1


system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

<|endoftext|> [end of text]

llama_print_timings:        load time =   232.01 ms
llama_print_timings:      sample time =     1.26 ms /     1 runs   (    1.26 ms per token,   791.14 tokens per second)
llama_print_timings: prompt eval time =    21.64 ms /    20 tokens (    1.08 ms per token,   924.17 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    23.29 ms
ggml_metal_free: deallocating
Log end

Looking into it...

Edited:
Seems the metal backend outputs nan for entire logits array

ggerganov · 2023-09-15T14:59:40Z

@wsxiaoys There was a bug in the soft max Metal kernel. Can you give me access to push a fix?

 $ ▶ git push tabbyml HEAD:support-starcoder 
remote: Permission to TabbyML/llama.cpp.git denied to ggerganov.
fatal: unable to access 'https://github.com/TabbyML/llama.cpp/': The requested URL returned error: 403

Or I can push it to a branch in this repo? Anyway work for me

Edit: created a PR here - TabbyML#2

Support starcoder fix

wsxiaoys · 2023-09-15T15:30:49Z

Thanks for the fix! Will cleanup the impl a bit then send it out for review, here're some benchmark numbers

llama_print_timings:        load time =   114.00 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   107.79 ms /    22 tokens (    4.90 ms per token,   204.11 tokens per second)
llama_print_timings:        eval time =  1315.10 ms /   127 runs   (   10.36 ms per token,    96.57 tokens per second)
llama_print_timings:       total time =  1427.08 ms

wsxiaoys · 2023-09-15T16:14:42Z

Follow-up PRs:

Actively implementing MQA: https://github.com/TabbyML/llama.cpp/pull/3/files
Implementing CUDA offloading: I will skip this part and leave it to the community.

feat: support starcoder mqa

wsxiaoys · 2023-09-15T16:42:00Z

PR Ready for review now 🥇

llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

llama.cpp

ggerganov · 2023-09-15T17:15:16Z

gguf-py/gguf/gguf.py

@monatis Do we need to bump gguf.py version after this change?

Co-authored-by: Georgi Gerganov <[email protected]>

llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* add placeholder of starcoder in gguf / llama.cpp * support convert starcoder weights to gguf * convert MQA to MHA * fix ffn_down name * add LLM_ARCH_STARCODER to llama.cpp * set head_count_kv = 1 * load starcoder weight * add max_position_embeddings * set n_positions to max_positioin_embeddings * properly load all starcoder params * fix head count kv * fix comments * fix vram calculation for starcoder * store mqa directly * add input embeddings handling * add TBD * working in cpu, metal buggy * cleanup useless code * metal : fix out-of-bounds access in soft_max kernels * llama : make starcoder graph build more consistent with others * refactor: cleanup comments a bit * add other starcoder models: 3B, 7B, 15B * support-mqa-directly * fix: remove max_position_embeddings, use n_train_ctx * Update llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * fix: switch to space from tab --------- Co-authored-by: Georgi Gerganov <[email protected]>

wsxiaoys added 15 commits September 15, 2023 10:39

add placeholder of starcoder in gguf / llama.cpp

0c5d4d8

support convert starcoder weights to gguf

eb7f0eb

convert MQA to MHA

76d32cc

fix ffn_down name

7e0a843

add LLM_ARCH_STARCODER to llama.cpp

7298c37

set head_count_kv = 1

166a259

load starcoder weight

57f064d

add max_position_embeddings

a17ef39

set n_positions to max_positioin_embeddings

2683611

properly load all starcoder params

77c7ec1

fix head count kv

0be15e1

fix comments

dac31da

fix vram calculation for starcoder

4420cff

store mqa directly

ab13d07

add input embeddings handling

8bc76a2

ggerganov added the model Model specific label Sep 15, 2023

add TBD

101c578

ggerganov mentioned this pull request Sep 15, 2023

mpt-1b fails with mpt_model_load: unknown tensor 'transformer.blocks.0.attn.k_ln.weight' in model file ggerganov/ggml#499

Open

working in cpu, metal buggy

a1cf66e

wsxiaoys and others added 3 commits September 15, 2023 19:00

cleanup useless code

6c353dc

metal : fix out-of-bounds access in soft_max kernels

f82328a

llama : make starcoder graph build more consistent with others

92a4f86

Merge pull request #2 from ggerganov/support-starcoder-fix

caa7220

Support starcoder fix

ggerganov mentioned this pull request Sep 15, 2023

Requesting Support for phi-1_5 by Microsoft #3146

Closed

refactor: cleanup comments a bit

57eaa39

add other starcoder models: 3B, 7B, 15B

5ca037b

wsxiaoys marked this pull request as ready for review September 15, 2023 16:11

wsxiaoys added 2 commits September 16, 2023 00:36

support-mqa-directly

08f35c4

Merge pull request #3 from TabbyML/support-starcoder-mqa

e1fa9dd

feat: support starcoder mqa

Green-Sky reviewed Sep 15, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

fix: remove max_position_embeddings, use n_train_ctx

f989ba1

ggerganov reviewed Sep 15, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

Update llama.cpp

bb9931c

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov approved these changes Sep 15, 2023

View reviewed changes

Update llama.cpp

eafcc34

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov reviewed Sep 15, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

wsxiaoys and others added 2 commits September 16, 2023 01:20

Apply suggestions from code review

e30ad71

Co-authored-by: Georgi Gerganov <[email protected]>

fix: switch to space from tab

72a7285

ggerganov merged commit 4fe09df into ggerganov:master Sep 15, 2023
30 of 33 checks passed

wsxiaoys mentioned this pull request Sep 15, 2023

Support starcoder family architectures (1B/3B/7B/13B) #3076

Closed

wsxiaoys deleted the support-starcoder branch September 16, 2023 01:04

Green-Sky mentioned this pull request Sep 18, 2023

Will the new K quantizations be added anytime soon for StarCoder? ggerganov/ggml#278

Closed

ggerganov mentioned this pull request Oct 6, 2023

feat: support bloom-1b4-zh ggerganov/ggml#543

Closed

jllllll mentioned this pull request Oct 12, 2023

When loading the model I get the following error: jllllll/llama-cpp-python-cuBLAS-wheels#17

Open

leo-gan mentioned this pull request Oct 18, 2023

Update llama.cpp integration langchain-ai/langchain#11864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support StarCoder model architectures #3187

feat: support StarCoder model architectures #3187

wsxiaoys commented Sep 15, 2023 •

edited

Loading

ggerganov commented Sep 15, 2023

wsxiaoys commented Sep 15, 2023

wsxiaoys commented Sep 15, 2023 •

edited

Loading

ggerganov commented Sep 15, 2023 •

edited

Loading

wsxiaoys commented Sep 15, 2023 •

edited

Loading

wsxiaoys commented Sep 15, 2023 •

edited

Loading

wsxiaoys commented Sep 15, 2023

ggerganov Sep 15, 2023

feat: support StarCoder model architectures #3187

feat: support StarCoder model architectures #3187

Conversation

wsxiaoys commented Sep 15, 2023 • edited Loading

ggerganov commented Sep 15, 2023

wsxiaoys commented Sep 15, 2023

wsxiaoys commented Sep 15, 2023 • edited Loading

ggerganov commented Sep 15, 2023 • edited Loading

wsxiaoys commented Sep 15, 2023 • edited Loading

wsxiaoys commented Sep 15, 2023 • edited Loading

wsxiaoys commented Sep 15, 2023

ggerganov Sep 15, 2023

Choose a reason for hiding this comment

wsxiaoys commented Sep 15, 2023 •

edited

Loading

wsxiaoys commented Sep 15, 2023 •

edited

Loading

ggerganov commented Sep 15, 2023 •

edited

Loading

wsxiaoys commented Sep 15, 2023 •

edited

Loading

wsxiaoys commented Sep 15, 2023 •

edited

Loading