We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infer gpt2-13G-fp16 model: ./gpt-2 -m ./ggml-model-f16.bin -n 128
./gpt-2 -m ./ggml-model-f16.bin -n 128
main: seed = 1685426504 gpt2_model_load: loading model from './ggml-model-f16.bin' gpt2_model_load: n_vocab = 50432 gpt2_model_load: n_ctx = 4096 gpt2_model_load: n_embd = 5120 gpt2_model_load: n_head = 40 gpt2_model_load: n_layer = 40 gpt2_model_load: ftype = 1 gpt2_model_load: qntvr = 0 gpt2_model_load: ggml tensor size = 224 bytes gpt2_model_load: ggml ctx size = 31475.43 MB gpt2_model_load: memory size = 6400.00 MB, n_mem = 163840 gpt2_model_load: model size = 25075.20 MB extract_tests_from_file : No test file found. test_gpt_tokenizer : 0 tests failed out of 0 tests. main: prompt: 'He' main: number of tokens in prompt = 1, first 8 tokens: 3876 main: mem per token = 39071352 bytes main: load time = 11362.40 ms main: sample time = 24.45 ms main: predict time = 107219.80 ms / 837.65 ms per token main: total time = 119965.73 ms
but llama-7b-fp16 inference speed:./main -m ./hf_llama/7b/ggml-model-f16.bin -n 128
./main -m ./hf_llama/7b/ggml-model-f16.bin -n 128
main: build = 583 (7e4ea5b) main: seed = 1685426839 llama.cpp: loading model from ./hf_llama/7b/ggml-model-f16.bin llama_model_load_internal: format = ggjt v1 (pre #1405) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 1 (mostly F16) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 14645.09 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MB llama_print_timings: load time = 8700.02 ms llama_print_timings: sample time = 73.76 ms / 128 runs ( 0.58 ms per token) llama_print_timings: prompt eval time = 201.22 ms / 2 tokens ( 100.61 ms per token) llama_print_timings: eval time = 22030.23 ms / 127 runs ( 173.47 ms per token) llama_print_timings: total time = 30836.57 ms
Is the inference speed of these two models normal?
The text was updated successfully, but these errors were encountered:
No branches or pull requests
infer gpt2-13G-fp16 model:
./gpt-2 -m ./ggml-model-f16.bin -n 128
but llama-7b-fp16 inference speed:
./main -m ./hf_llama/7b/ggml-model-f16.bin -n 128
Is the inference speed of these two models normal?
The text was updated successfully, but these errors were encountered: