Skip to content

mantle2048/chain-of-thought-hub

 
 

Repository files navigation

Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

Notes: this is a re-run by mantle2048

Raw model outputs can be found in this google drive link

Results - Overall

Model GSM8K MATH MMLU BBH
gpt-3.5-turbo 78.6(our), 78.9(origin) - 68.18(our), 67.3(origin) 69.62(our), 70.1(origin)

Results - Separate

Dataset: MMLU: high school and college knowledge

Model:gpt-3.5-turbo Our Origin Difference
Overall 9488/13917, 68.18 ≈9366/13917, 67.3 ≈122/13917, +0.88
MMLU/abstract_algebra 43/99, 43.43 46/99, 46.46 3/99, -3.03
MMLU/anatomy 80/134, 59.70 79/134, 58.95 1/134, +0.75
MMLU/astronomy 97/151, 64.23 98/151 64.90 1/151, -0.67
MMLU/business_ethics 67/99, 67.68 67/99 67.68 0/99, 0.0
MMLU/clinical_knowledge 210/264, 79.55 208/264 78.79 2/264, +0.76
MMLU/college_biology 84/143, 58.74 90/143 62.94 6/143, -4.20
MMLU/college_chemistry 48/99, 48.48 51/99 51.51 3/99, -3.03
MMLU/college_computer_science 40/99, 40.40 38/99 38.38 2/99, +2.02
MMLU/college_mathematics 36/99, 36.36 29/99 29.29 7/99, +7.07
MMLU/college_medicine 103/172, 58.99 101/172 58.72 2/172, +0.27
MMLU/college_physics 58/101, 57.43 nan nan
MMLU/computer_security 76/99, 76.77 nan nan
MMLU/conceptual_physics 182/234, 77.78 nan nan
MMLU/econometrics 48/113, 42.48 nan nan
MMLU/electrical_engineering 91/144, 63.19 nan nan
MMLU/elementary_mathematics 286/377, 75.86 nan nan
MMLU/formal_logic 53/125, 42.40 nan nan
MMLU/global_facts 56/99, 56.57 nan nan
MMLU/high_school_biology 225/309, 72.82 nan nan
MMLU/high_school_chemistry 104/202, 51.49 nan nan
MMLU/high_school_computer_science 69/99, 69.70 nan nan
MMLU/high_school_european_history 127/164, 77.44 nan nan
MMLU/high_school_geography 176/197, 89.34 nan nan
MMLU/high_school_government_and_politics 169/192, 88.02 nan nan
MMLU/high_school_macroeconomics 287/389, 73.78 nan nan
MMLU/high_school_mathematics 97/201, 48.26 nan nan
MMLU/high_school_microeconomics 176/237, 74.26 nan nan
MMLU/high_school_physics 70/150, 46.67 nan nan
MMLU/high_school_psychology 473/544, 86.95 nan nan
MMLU/high_school_statistics 124/215, 57.67 nan nan
MMLU/high_school_us_history 150/203, 73.89 nan nan
MMLU/high_school_world_history 176/236, 74.58 nan nan
MMLU/human_aging 159/222, 71.62 nan nan
MMLU/human_sexuality 100/130, 76.92 nan nan
MMLU/international_law 101/120, 84.17 nan nan
MMLU/jurisprudence 81/107, 75.70 nan nan
MMLU/logical_fallacies 117/162, 72.22 nan nan
MMLU/machine_learning 57/111, 51.35 nan nan
MMLU/management 82/102, 80.39 nan nan
MMLU/marketing 212/233, 90.99 nan nan
MMLU/medical_genetics 80/99, 80.81 nan nan
MMLU/miscellaneous 685/782, 87.60 nan nan
MMLU/moral_disputes 255/345, 73.91 nan nan
MMLU/moral_scenarios 478/894, 53.47 nan nan
MMLU/nutrition 211/305, 69.18 nan nan
MMLU/philosophy 233/310, 75.16 nan nan
MMLU/prehistory 257/323, 79.57 nan nan
MMLU/professional_accounting 145/281, 51.60 nan nan
MMLU/professional_law 761/1533, 49.64 nan nan
MMLU/professional_medicine 225/271, 83.03 nan nan
MMLU/professional_psychology 457/611, 74.80 nan nan
MMLU/public_relations 73/109, 66.97 nan nan
MMLU/security_studies 149/244, 61.07 nan nan
MMLU/sociology 171/200, 85.50 nan nan
MMLU/us_foreign_policy 87/99, 87.88 nan nan
MMLU/virology 88/165, 53.33 nan nan
MMLU/world_religions 143/170, 84.12 nan nan

Dataset: GSM8K: elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs

Model:gpt-3.5-turbo Our Origin Difference
GSM8K/complex_prompt_greedy_decoding 1037/1319, 78.62 1040/1319, 78.85 3/1319, 0.22

Dataset: BBH: a collection of 27 hard reasoning problems

Model:gpt-3.5-turbo Our Origin Difference
Overall 4533/6511, 69.62 ≈4564/6511, 70.10 ≈31/6511, -0.48
BBH/temporal_sequences 151/250, 60.40 nan nan
BBH/disambiguation_qa 165/250, 66.00 nan nan
BBH/date_understanding 200/250, 80.00 nan nan
BBH/tracking_shuffled_objects_three_objects 150/250, 60.00 nan nan
BBH/penguins_in_a_table 113/146, 77.40 115/146, 78.77 2/146, -1.37
BBH/geometric_shapes 153/250, 61.20 nan nan
BBH/snarks 106/178 59.55 nan nan
BBH/ruin_names 168/250 67.20 nan nan
BBH/tracking_shuffled_objects_seven_objects 136/250, 54.40 nan nan
BBH/tracking_shuffled_objects_five_objects 148/250, 59.20 nan nan
BBH/logical_deduction_three_objects 216/250, 86.40 nan nan
BBH/hyperbaton 202/250, 80.80 nan nan
BBH/logical_deduction_five_objects 150/250, 60.00 nan nan
BBH/logical_deduction_seven_objects 108/250, 43.20 nan nan
BBH/movie_recommendation 202/250, 80.80 nan nan
BBH/salient_translation_error_detection 142/250, 56.80 nan nan
BBH/reasoning_about_colored_objects 217/250, 86.80 nan nan
BBH/multistep_arithmetic_two 169/250, 67.60 nan nan
BBH/navigate 231/250, 92.40 nan nan
BBH/dyck_languages 62/250, 24.80 nan nan
BBH/word_sorting 150/250, 60.00 nan nan
BBH/web_of_lies 248/250, 99.20 nan nan
BBH/sports_understanding 241/250, 96.40 nan nan
BBH/boolean_expressions 240/250, 96.00 nan nan
BBH/object_counting 230/250, 92.00 nan nan
BBH/formal_fallacies 130/250, 52.00 nan nan
BBH/causal_judgement 105/187, 56.15 nan nan

DataFlow

DataFlow

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.5%
  • Python 1.5%