-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3-base gsm8k score #1896
Comments
The 79.6 GSM8k number reported by Meta comes from their instruct model. I'm not certain if they've reported "official" GSM8k base model scores |
Sorry for not paying attention to the subtitle, yes 79.6 is llama3-8B-it score : ) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I met a problem relevant to #1799.
The scores of LLaMA3-8B-base on the GSM8K benchmark are significantly lower than the scores reported by the official sources.
I use gsm8k_cot task in lm_eval_harness get same score 50+ as #1799 (comment) while official reported is 79.6
![image](https://private-user-images.githubusercontent.com/88258534/334325881-a8b13fe1-e0e8-40c8-bc46-e2ddd1e081d0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MjMzNDMsIm5iZiI6MTcyMTkyMzA0MywicGF0aCI6Ii84ODI1ODUzNC8zMzQzMjU4ODEtYThiMTNmZTEtZTBlOC00MGM4LWJjNDYtZTJkZGQxZTA4MWQwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI1VDE1NTcyM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk3MmJkM2JkYTQyZTc1OGMzNzFlZmVlYTJmMmYwNDMyNTYwNzFkZTczYWQwMTc1YmE1YjU3YmNmMjZiOGEzMTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.YtQ3oMrsGaMQzEm3LGngncZnSm5j9u99KHxOZJllGUY)
Any idea about this ?
The text was updated successfully, but these errors were encountered: