Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer gpt2-13G-fp16 837.65 ms per token #215

Closed
vicwer opened this issue May 30, 2023 · 0 comments
Closed

infer gpt2-13G-fp16 837.65 ms per token #215

vicwer opened this issue May 30, 2023 · 0 comments

Comments

@vicwer
Copy link

vicwer commented May 30, 2023

infer gpt2-13G-fp16 model: ./gpt-2 -m ./ggml-model-f16.bin -n 128

main: seed = 1685426504
gpt2_model_load: loading model from './ggml-model-f16.bin'
gpt2_model_load: n_vocab = 50432
gpt2_model_load: n_ctx   = 4096
gpt2_model_load: n_embd  = 5120
gpt2_model_load: n_head  = 40
gpt2_model_load: n_layer = 40
gpt2_model_load: ftype   = 1
gpt2_model_load: qntvr   = 0
gpt2_model_load: ggml tensor size = 224 bytes
gpt2_model_load: ggml ctx size = 31475.43 MB
gpt2_model_load: memory size =  6400.00 MB, n_mem = 163840
gpt2_model_load: model size  = 25075.20 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: prompt: 'He'
main: number of tokens in prompt = 1, first 8 tokens: 3876

main: mem per token = 39071352 bytes
main:     load time = 11362.40 ms
main:   sample time =    24.45 ms
main:  predict time = 107219.80 ms / 837.65 ms per token
main:    total time = 119965.73 ms

but llama-7b-fp16 inference speed:./main -m ./hf_llama/7b/ggml-model-f16.bin -n 128

main: build = 583 (7e4ea5b)
main: seed  = 1685426839
llama.cpp: loading model from ./hf_llama/7b/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)

llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =  8700.02 ms
llama_print_timings:      sample time =    73.76 ms /   128 runs   (    0.58 ms per token)
llama_print_timings: prompt eval time =   201.22 ms /     2 tokens (  100.61 ms per token)
llama_print_timings:        eval time = 22030.23 ms /   127 runs   (  173.47 ms per token)
llama_print_timings:       total time = 30836.57 ms

Is the inference speed of these two models normal?

@vicwer vicwer closed this as completed Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant