Skip to content
View conceptmath's full-sized avatar
Block or Report

Block or report conceptmath

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
conceptmath/README.md

ConceptMath

A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

📃 Paper    🏆 Leaderboard (WIP)

Introduction

ConceptMath is a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models.

How to run

Step 1: Installation

conda create --name conceptmath python=3.9
conda activate conceptmath
pip install sympy scipy pandas
git clone https://github.com/conceptmath/conceptmath.git
cd conceptmath

Step 2: Data Preparation

Please run the tested model and save the responses in the inference folder. Follow the format in inference/Meta-Llama-3-70B-Instruction/middle_en.jsonl for the responses.

Step 3: Evaluation

After preparing the model responses, you can run the following script,

python evaluation/main.py --path_in inference/Meta-Llama-3-70B-Instruction/middle_en.jsonl --dir_out inference/Meta-Llama-3-70B-Instruction/ --model Meta-Llama-3-70B-Instruction --grade middle_en

Then you will get: overall acc, concept acc, and bad case in the dir_out directory.

Leaderboard

Leaderboard Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones.


Results of different models on our constructed ConceptMath benchmark dataset.


(a) Concept accuracies on Middle-EN                              (b) Mean concept accuracies on Middle-EN.

ConceptMath

About ConceptMath ConceptMath is a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with conceptwise accuracies.



ConceptMath comprises a total of 4011 math problems across 214 math concepts.

How to efficiently enhance the weaknesses of existing LLMs We also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs.


Left: The concept-wise accuracies of LLaMA2-13B and the fine-tuned version based on our efficient finetuning method (i.e., LLaMA2-FT); Right: Introducing CS data specifically for the bottom 10 concepts significantly enhances these concepts’ performance, while slightly improving the performance across the remaining 33 concepts.

Citation

Feel free to cite us if you like ConceptMath.

@article{wu2024conceptmath,
  title={ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models},
  author={Wu, Yanan and Liu, Jie and Bu, Xingyuan and Liu, Jiaheng and Zhou, Zhanhui and Zhang, Yuanxing and Zhang, Chenchen and Bai, Zhiqi and Chen, Haibin and Ge, Tiezheng and others},
  journal={arXiv preprint arXiv:2402.14660},
  year={2024}
}

Popular repositories Loading

  1. conceptmath conceptmath Public

    [ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models".

    Python 16