GitHub - mantle2048/chain-of-thought-hub: :-)

Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

Notes: this is a re-run by mantle2048

Raw model outputs can be found in this google drive link

Results - Overall

Model	GSM8K	MATH	MMLU	BBH
gpt-3.5-turbo	78.6(our), 78.9(origin)	-	68.18(our), 67.3(origin)	69.62(our), 70.1(origin)

Results - Separate

Dataset: MMLU: high school and college knowledge

Model：gpt-3.5-turbo	Our	Origin	Difference
Overall	9488/13917, 68.18	≈9366/13917, 67.3	≈122/13917, +0.88
MMLU/abstract_algebra	43/99, 43.43	46/99, 46.46	3/99, -3.03
MMLU/anatomy	80/134, 59.70	79/134, 58.95	1/134, +0.75
MMLU/astronomy	97/151, 64.23	98/151 64.90	1/151, -0.67
MMLU/business_ethics	67/99, 67.68	67/99 67.68	0/99, 0.0
MMLU/clinical_knowledge	210/264, 79.55	208/264 78.79	2/264, +0.76
MMLU/college_biology	84/143, 58.74	90/143 62.94	6/143, -4.20
MMLU/college_chemistry	48/99, 48.48	51/99 51.51	3/99, -3.03
MMLU/college_computer_science	40/99, 40.40	38/99 38.38	2/99, +2.02
MMLU/college_mathematics	36/99, 36.36	29/99 29.29	7/99, +7.07
MMLU/college_medicine	103/172, 58.99	101/172 58.72	2/172, +0.27
MMLU/college_physics	58/101, 57.43	nan	nan
MMLU/computer_security	76/99, 76.77	nan	nan
MMLU/conceptual_physics	182/234, 77.78	nan	nan
MMLU/econometrics	48/113, 42.48	nan	nan
MMLU/electrical_engineering	91/144, 63.19	nan	nan
MMLU/elementary_mathematics	286/377, 75.86	nan	nan
MMLU/formal_logic	53/125, 42.40	nan	nan
MMLU/global_facts	56/99, 56.57	nan	nan
MMLU/high_school_biology	225/309, 72.82	nan	nan
MMLU/high_school_chemistry	104/202, 51.49	nan	nan
MMLU/high_school_computer_science	69/99, 69.70	nan	nan
MMLU/high_school_european_history	127/164, 77.44	nan	nan
MMLU/high_school_geography	176/197, 89.34	nan	nan
MMLU/high_school_government_and_politics	169/192, 88.02	nan	nan
MMLU/high_school_macroeconomics	287/389, 73.78	nan	nan
MMLU/high_school_mathematics	97/201, 48.26	nan	nan
MMLU/high_school_microeconomics	176/237, 74.26	nan	nan
MMLU/high_school_physics	70/150, 46.67	nan	nan
MMLU/high_school_psychology	473/544, 86.95	nan	nan
MMLU/high_school_statistics	124/215, 57.67	nan	nan
MMLU/high_school_us_history	150/203, 73.89	nan	nan
MMLU/high_school_world_history	176/236, 74.58	nan	nan
MMLU/human_aging	159/222, 71.62	nan	nan
MMLU/human_sexuality	100/130, 76.92	nan	nan
MMLU/international_law	101/120, 84.17	nan	nan
MMLU/jurisprudence	81/107, 75.70	nan	nan
MMLU/logical_fallacies	117/162, 72.22	nan	nan
MMLU/machine_learning	57/111, 51.35	nan	nan
MMLU/management	82/102, 80.39	nan	nan
MMLU/marketing	212/233, 90.99	nan	nan
MMLU/medical_genetics	80/99, 80.81	nan	nan
MMLU/miscellaneous	685/782, 87.60	nan	nan
MMLU/moral_disputes	255/345, 73.91	nan	nan
MMLU/moral_scenarios	478/894, 53.47	nan	nan
MMLU/nutrition	211/305, 69.18	nan	nan
MMLU/philosophy	233/310, 75.16	nan	nan
MMLU/prehistory	257/323, 79.57	nan	nan
MMLU/professional_accounting	145/281, 51.60	nan	nan
MMLU/professional_law	761/1533, 49.64	nan	nan
MMLU/professional_medicine	225/271, 83.03	nan	nan
MMLU/professional_psychology	457/611, 74.80	nan	nan
MMLU/public_relations	73/109, 66.97	nan	nan
MMLU/security_studies	149/244, 61.07	nan	nan
MMLU/sociology	171/200, 85.50	nan	nan
MMLU/us_foreign_policy	87/99, 87.88	nan	nan
MMLU/virology	88/165, 53.33	nan	nan
MMLU/world_religions	143/170, 84.12	nan	nan

Dataset: GSM8K: elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs

Model：gpt-3.5-turbo	Our	Origin	Difference
GSM8K/complex_prompt_greedy_decoding	1037/1319, 78.62	1040/1319, 78.85	3/1319, 0.22

Dataset: BBH: a collection of 27 hard reasoning problems

Model：gpt-3.5-turbo	Our	Origin	Difference
Overall	4533/6511, 69.62	≈4564/6511, 70.10	≈31/6511, -0.48
BBH/temporal_sequences	151/250, 60.40	nan	nan
BBH/disambiguation_qa	165/250, 66.00	nan	nan
BBH/date_understanding	200/250, 80.00	nan	nan
BBH/tracking_shuffled_objects_three_objects	150/250, 60.00	nan	nan
BBH/penguins_in_a_table	113/146, 77.40	115/146, 78.77	2/146, -1.37
BBH/geometric_shapes	153/250, 61.20	nan	nan
BBH/snarks	106/178 59.55	nan	nan
BBH/ruin_names	168/250 67.20	nan	nan
BBH/tracking_shuffled_objects_seven_objects	136/250, 54.40	nan	nan
BBH/tracking_shuffled_objects_five_objects	148/250, 59.20	nan	nan
BBH/logical_deduction_three_objects	216/250, 86.40	nan	nan
BBH/hyperbaton	202/250, 80.80	nan	nan
BBH/logical_deduction_five_objects	150/250, 60.00	nan	nan
BBH/logical_deduction_seven_objects	108/250, 43.20	nan	nan
BBH/movie_recommendation	202/250, 80.80	nan	nan
BBH/salient_translation_error_detection	142/250, 56.80	nan	nan
BBH/reasoning_about_colored_objects	217/250, 86.80	nan	nan
BBH/multistep_arithmetic_two	169/250, 67.60	nan	nan
BBH/navigate	231/250, 92.40	nan	nan
BBH/dyck_languages	62/250, 24.80	nan	nan
BBH/word_sorting	150/250, 60.00	nan	nan
BBH/web_of_lies	248/250, 99.20	nan	nan
BBH/sports_understanding	241/250, 96.40	nan	nan
BBH/boolean_expressions	240/250, 96.00	nan	nan
BBH/object_counting	230/250, 92.00	nan	nan
BBH/formal_fallacies	130/250, 52.00	nan	nan
BBH/causal_judgement	105/187, 56.15	nan	nan

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
BBH		BBH
MATH		MATH
MMLU		MMLU
MathQA		MathQA
csqa		csqa
gsm8k		gsm8k
resources		resources
strategyqa		strategyqa
.gitignore		.gitignore
detailed_results.md		detailed_results.md
literature.md		literature.md
prompt_tricks.md		prompt_tricks.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

Results - Overall

Results - Separate

Dataset: MMLU: high school and college knowledge

Dataset: GSM8K: elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs

Dataset: BBH: a collection of 27 hard reasoning problems

DataFlow

About

Releases

Packages

Languages

mantle2048/chain-of-thought-hub

Folders and files

Latest commit

History

Repository files navigation

Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

Results - Overall

Results - Separate

Dataset: MMLU: high school and college knowledge

Dataset: GSM8K: elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs

Dataset: BBH: a collection of 27 hard reasoning problems

DataFlow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages