-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAN value for truthfulqa_mc2 on full finetuned model TinyLlama #1340
Comments
Could you try running with |
|
@lintangsutawika any idea? |
Could it be the model? I tried with this (default model, gpt2)
|
It does look like your model is giving NaN outputs. what datatype was it trained with? When you try to generate from the model, does it give reasonable results? |
@haileyschoelkopf trained on a completion task ( local json file). the problem that if I full finetuned the mode with deepspeed, I got a value for truthfulqa_mc2 not a NAN! |
Sorry, meant to ask about torch dtype / precision--was it 16-bit? lower? it would be worth trying manually specifying the dtype in
do you mean you only did a LoRA / PEFT method? what is the save format of |
@haileyschoelkopf I fully trained TinyLLama model with float 16 (
The base model is 4G and the float16 model is 2.1G.
|
@lintangsutawika @haileyschoelkopf `
|
Are you willing to push your model somewhere public? it's difficult to say what the problem is without being able to test. It looks like running inference on your model is giving floating point overflows / NaNs (and may under the hood here for arc_challenge as well). |
I checked this issue has similar problem I have, however using the latest main branch doesn't solve the problem!
Model:
TinyLlama/TinyLlama-1.1B-step-50K-105b
model using axoltol with FSDP on a completion dataset. On a single machine with two GPUs with these settings:gradient_accumulation_steps:12, micro-batch:1
Evaluation:
accelerate launch -m lm_eval --model hf --model_args pretrained=fsdp-model/ --task truthfulqa_mc2 --verbosity DEBUG
Result:
truthfulqa_mc2
is NAN andtruthfulqa_mc1
is 1The text was updated successfully, but these errors were encountered: