Notes: this is a re-run by mantle2048
Raw model outputs can be found in this google drive link
Model | GSM8K | MATH | MMLU | BBH |
---|---|---|---|---|
gpt-3.5-turbo | 78.6(our), 78.9(origin) | - | 68.18(our), 67.3(origin) | 69.62(our), 70.1(origin) |
Dataset: MMLU: high school and college knowledge
Model:gpt-3.5-turbo | Our | Origin | Difference |
---|---|---|---|
Overall | 9488/13917, 68.18 | ≈9366/13917, 67.3 | ≈122/13917, +0.88 |
MMLU/abstract_algebra | 43/99, 43.43 | 46/99, 46.46 | 3/99, -3.03 |
MMLU/anatomy | 80/134, 59.70 | 79/134, 58.95 | 1/134, +0.75 |
MMLU/astronomy | 97/151, 64.23 | 98/151 64.90 | 1/151, -0.67 |
MMLU/business_ethics | 67/99, 67.68 | 67/99 67.68 | 0/99, 0.0 |
MMLU/clinical_knowledge | 210/264, 79.55 | 208/264 78.79 | 2/264, +0.76 |
MMLU/college_biology | 84/143, 58.74 | 90/143 62.94 | 6/143, -4.20 |
MMLU/college_chemistry | 48/99, 48.48 | 51/99 51.51 | 3/99, -3.03 |
MMLU/college_computer_science | 40/99, 40.40 | 38/99 38.38 | 2/99, +2.02 |
MMLU/college_mathematics | 36/99, 36.36 | 29/99 29.29 | 7/99, +7.07 |
MMLU/college_medicine | 103/172, 58.99 | 101/172 58.72 | 2/172, +0.27 |
MMLU/college_physics | 58/101, 57.43 | nan | nan |
MMLU/computer_security | 76/99, 76.77 | nan | nan |
MMLU/conceptual_physics | 182/234, 77.78 | nan | nan |
MMLU/econometrics | 48/113, 42.48 | nan | nan |
MMLU/electrical_engineering | 91/144, 63.19 | nan | nan |
MMLU/elementary_mathematics | 286/377, 75.86 | nan | nan |
MMLU/formal_logic | 53/125, 42.40 | nan | nan |
MMLU/global_facts | 56/99, 56.57 | nan | nan |
MMLU/high_school_biology | 225/309, 72.82 | nan | nan |
MMLU/high_school_chemistry | 104/202, 51.49 | nan | nan |
MMLU/high_school_computer_science | 69/99, 69.70 | nan | nan |
MMLU/high_school_european_history | 127/164, 77.44 | nan | nan |
MMLU/high_school_geography | 176/197, 89.34 | nan | nan |
MMLU/high_school_government_and_politics | 169/192, 88.02 | nan | nan |
MMLU/high_school_macroeconomics | 287/389, 73.78 | nan | nan |
MMLU/high_school_mathematics | 97/201, 48.26 | nan | nan |
MMLU/high_school_microeconomics | 176/237, 74.26 | nan | nan |
MMLU/high_school_physics | 70/150, 46.67 | nan | nan |
MMLU/high_school_psychology | 473/544, 86.95 | nan | nan |
MMLU/high_school_statistics | 124/215, 57.67 | nan | nan |
MMLU/high_school_us_history | 150/203, 73.89 | nan | nan |
MMLU/high_school_world_history | 176/236, 74.58 | nan | nan |
MMLU/human_aging | 159/222, 71.62 | nan | nan |
MMLU/human_sexuality | 100/130, 76.92 | nan | nan |
MMLU/international_law | 101/120, 84.17 | nan | nan |
MMLU/jurisprudence | 81/107, 75.70 | nan | nan |
MMLU/logical_fallacies | 117/162, 72.22 | nan | nan |
MMLU/machine_learning | 57/111, 51.35 | nan | nan |
MMLU/management | 82/102, 80.39 | nan | nan |
MMLU/marketing | 212/233, 90.99 | nan | nan |
MMLU/medical_genetics | 80/99, 80.81 | nan | nan |
MMLU/miscellaneous | 685/782, 87.60 | nan | nan |
MMLU/moral_disputes | 255/345, 73.91 | nan | nan |
MMLU/moral_scenarios | 478/894, 53.47 | nan | nan |
MMLU/nutrition | 211/305, 69.18 | nan | nan |
MMLU/philosophy | 233/310, 75.16 | nan | nan |
MMLU/prehistory | 257/323, 79.57 | nan | nan |
MMLU/professional_accounting | 145/281, 51.60 | nan | nan |
MMLU/professional_law | 761/1533, 49.64 | nan | nan |
MMLU/professional_medicine | 225/271, 83.03 | nan | nan |
MMLU/professional_psychology | 457/611, 74.80 | nan | nan |
MMLU/public_relations | 73/109, 66.97 | nan | nan |
MMLU/security_studies | 149/244, 61.07 | nan | nan |
MMLU/sociology | 171/200, 85.50 | nan | nan |
MMLU/us_foreign_policy | 87/99, 87.88 | nan | nan |
MMLU/virology | 88/165, 53.33 | nan | nan |
MMLU/world_religions | 143/170, 84.12 | nan | nan |
Dataset: GSM8K: elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs
Model:gpt-3.5-turbo | Our | Origin | Difference |
---|---|---|---|
GSM8K/complex_prompt_greedy_decoding | 1037/1319, 78.62 | 1040/1319, 78.85 | 3/1319, 0.22 |
Dataset: BBH: a collection of 27 hard reasoning problems
Model:gpt-3.5-turbo | Our | Origin | Difference |
---|---|---|---|
Overall | 4533/6511, 69.62 | ≈4564/6511, 70.10 | ≈31/6511, -0.48 |
BBH/temporal_sequences | 151/250, 60.40 | nan | nan |
BBH/disambiguation_qa | 165/250, 66.00 | nan | nan |
BBH/date_understanding | 200/250, 80.00 | nan | nan |
BBH/tracking_shuffled_objects_three_objects | 150/250, 60.00 | nan | nan |
BBH/penguins_in_a_table | 113/146, 77.40 | 115/146, 78.77 | 2/146, -1.37 |
BBH/geometric_shapes | 153/250, 61.20 | nan | nan |
BBH/snarks | 106/178 59.55 | nan | nan |
BBH/ruin_names | 168/250 67.20 | nan | nan |
BBH/tracking_shuffled_objects_seven_objects | 136/250, 54.40 | nan | nan |
BBH/tracking_shuffled_objects_five_objects | 148/250, 59.20 | nan | nan |
BBH/logical_deduction_three_objects | 216/250, 86.40 | nan | nan |
BBH/hyperbaton | 202/250, 80.80 | nan | nan |
BBH/logical_deduction_five_objects | 150/250, 60.00 | nan | nan |
BBH/logical_deduction_seven_objects | 108/250, 43.20 | nan | nan |
BBH/movie_recommendation | 202/250, 80.80 | nan | nan |
BBH/salient_translation_error_detection | 142/250, 56.80 | nan | nan |
BBH/reasoning_about_colored_objects | 217/250, 86.80 | nan | nan |
BBH/multistep_arithmetic_two | 169/250, 67.60 | nan | nan |
BBH/navigate | 231/250, 92.40 | nan | nan |
BBH/dyck_languages | 62/250, 24.80 | nan | nan |
BBH/word_sorting | 150/250, 60.00 | nan | nan |
BBH/web_of_lies | 248/250, 99.20 | nan | nan |
BBH/sports_understanding | 241/250, 96.40 | nan | nan |
BBH/boolean_expressions | 240/250, 96.00 | nan | nan |
BBH/object_counting | 230/250, 92.00 | nan | nan |
BBH/formal_fallacies | 130/250, 52.00 | nan | nan |
BBH/causal_judgement | 105/187, 56.15 | nan | nan |