You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Save model to bfloat16 in python -> run convert-h5-to-ggml passing 1 as target -> run ggml model and get 983.91 ms per token
Get q8_0 quantised model from that f16 artifact -> 21ms per token
vs
Save model to float32 -> run convert-h5-to-ggml passing 1 as target -> run ggml model 41.68 ms per token
tested on mpt model. So it is solvable by saving to float32 in torch, but why?
The text was updated successfully, but these errors were encountered:
Sounds like magic - yet:
tested on mpt model. So it is solvable by saving to
float32
in torch, but why?The text was updated successfully, but these errors were encountered: