Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support StarCoder model architectures #3187

Merged
merged 30 commits into from
Sep 15, 2023

Conversation

wsxiaoys
Copy link
Contributor

@wsxiaoys wsxiaoys commented Sep 15, 2023

#3076

Still in progress, while the model converting / params loading part seems to be working.

Tabby has integrated llama.cpp and released v0.1.1 🎉. It now offers native support for metal inference and the StarCoder model!

@ggerganov ggerganov added the model Model specific label Sep 15, 2023
@ggerganov
Copy link
Owner

Looks good so far - let us know if you hit any roadblocks

@wsxiaoys
Copy link
Contributor Author

Looks good so far - let us know if you hit any roadblocks

The remaining part for now is from line 3580 to line 3718 in llama.cpp. It should not be very hard to figure it out once I have set up a development environment to ensure the matrix shape arithmetic is correct...

@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Sep 15, 2023

OK, I think I got a version running under CPU:

> make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 0

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:


def dijkstra(graph, start):
    """
    Returns the shortest path from `start` to all other nodes in `graph`.

    The graph is represented as a dictionary of dictionaries. Each key represents a node and each value is another dictionary with keys 'to' and 'cost'.
    """
    # Initialize the distances array to infinity
    distances = [float('inf') for _ in range(len(graph))]
    distances[start] = 0

    # Initialize the previous array to None
    previous = [None for _ in range(len(graph))]

    # Loop through all nodes and find the shortest path
llama_print_timings:        load time =   110.20 ms
llama_print_timings:      sample time =   134.80 ms /   128 runs   (    1.05 ms per token,   949.55 tokens per second)
llama_print_timings: prompt eval time =   262.29 ms /    20 tokens (   13.11 ms per token,    76.25 tokens per second)
llama_print_timings:        eval time =  3485.94 ms /   127 runs   (   27.45 ms per token,    36.43 tokens per second)
llama_print_timings:       total time =  3914.92 ms

But it's currently buggy in metal:

> make main && ./bin/main -m ../models/starcoder-1b.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 1


system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = -1.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

<|endoftext|> [end of text]

llama_print_timings:        load time =   232.01 ms
llama_print_timings:      sample time =     1.26 ms /     1 runs   (    1.26 ms per token,   791.14 tokens per second)
llama_print_timings: prompt eval time =    21.64 ms /    20 tokens (    1.08 ms per token,   924.17 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    23.29 ms
ggml_metal_free: deallocating
Log end

Looking into it...


Edited:
Seems the metal backend outputs nan for entire logits array

@ggerganov
Copy link
Owner

ggerganov commented Sep 15, 2023

@wsxiaoys There was a bug in the soft max Metal kernel. Can you give me access to push a fix?

 $ ▶ git push tabbyml HEAD:support-starcoder 
remote: Permission to TabbyML/llama.cpp.git denied to ggerganov.
fatal: unable to access 'https://github.com/TabbyML/llama.cpp/': The requested URL returned error: 403

Or I can push it to a branch in this repo? Anyway work for me

Edit: created a PR here - TabbyML#2

@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Sep 15, 2023

Thanks for the fix! Will cleanup the impl a bit then send it out for review, here're some benchmark numbers

llama_print_timings:        load time =   114.00 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   107.79 ms /    22 tokens (    4.90 ms per token,   204.11 tokens per second)
llama_print_timings:        eval time =  1315.10 ms /   127 runs   (   10.36 ms per token,    96.57 tokens per second)
llama_print_timings:       total time =  1427.08 ms

@wsxiaoys wsxiaoys marked this pull request as ready for review September 15, 2023 16:11
@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Sep 15, 2023

Follow-up PRs:

@wsxiaoys
Copy link
Contributor Author

PR Ready for review now 🥇

llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <[email protected]>
llama.cpp Outdated Show resolved Hide resolved
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monatis Do we need to bump gguf.py version after this change?

Co-authored-by: Georgi Gerganov <[email protected]>
llama.cpp Outdated Show resolved Hide resolved
@ggerganov ggerganov merged commit 4fe09df into ggerganov:master Sep 15, 2023
30 of 33 checks passed
@wsxiaoys wsxiaoys deleted the support-starcoder branch September 16, 2023 01:04
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
* add placeholder of starcoder in gguf / llama.cpp

* support convert starcoder weights to gguf

* convert MQA to MHA

* fix ffn_down name

* add LLM_ARCH_STARCODER to llama.cpp

* set head_count_kv = 1

* load starcoder weight

* add max_position_embeddings

* set n_positions to max_positioin_embeddings

* properly load all starcoder params

* fix head count kv

* fix comments

* fix vram calculation for starcoder

* store mqa directly

* add input embeddings handling

* add TBD

* working in cpu, metal buggy

* cleanup useless code

* metal : fix out-of-bounds access in soft_max kernels

* llama : make starcoder graph build more consistent with others

* refactor: cleanup comments a bit

* add other starcoder models: 3B, 7B, 15B

* support-mqa-directly

* fix: remove max_position_embeddings, use n_train_ctx

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* fix: switch to space from tab

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants