-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Eval Harness #471
Improve Eval Harness #471
Conversation
I used this to run the eval harness on the saved checkpoints and it didn’t have any problems. Does that address the additional testing that you wanted @sdtblck, or are there additional configurations you still want to try? |
Btw. greedy_until requests for alibi fail without this PR #452 . Would be nice to merge at some point. |
Merged. |
Note not to forget. Squadv2 fails at this line with "list index out of range". cont is not always a list of length bigger than 0. Needs to be more robust. gpt-neox/eval_tasks/eval_adapter.py Line 75 in bbbc5fb
|
Hey @sweinbach spending the day today getting to all the past issues I've been too busy to look at 😆 thanks for looking into this one, I think you're probably on the money with the last batch not being equal. However I suspect just naively padding might break things further down the line? I'll spend a couple of hours testing this out now, and also add doc strings to clarify the eval harness code as it's a bit of a mess right now. |
Ok @sweinbach I think that this should be fixed now. The problem with Squad2 is threefold, and I think even with this fixed, you won't get satisfactory results. To explain:
So, despite 2 and 3 being fixed, 1 is still an issue, but more of an issue on lm_eval_harness' side. I suggest we add an issue there. Anyway, if someone can approve, I think this is ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this, we are currently using it to evaluate models, and Sid has reviewed it as well. There are some lingering issues, but those appear to be issues with the design issues with the eval harness rather than issues with this code.
Adds the following:
Apologies, autoformatter made a lot of cosmetic changes too