TheoremQA

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset (https://arxiv.org/abs/2305.12524).

Introduction

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

The dataset covers a wide range of topics listed below:

Examples

Huggingface

Our dataset is on Huggingface now: https://huggingface.co/datasets/TIGER-Lab/TheoremQA

from datasets import load_dataset
dataset = load_dataset("wenhu/TheoremQA")

Files

theoremqa_test.json: this file contains all the annotated question-answer pairs.
theoremqa_visual_subset_test.json: this file contains the subset of visual questions if you want to specifically test that.
all_theorems.json: this file contains the textual description of all the theorems being covered.
error_analysis/*: this folder contains the error analysis results on the 180-question subset.
solutions/*: this folder contains solutions for roughly 180 questions, which correspond to the problems used in error_analysis/
outputs/*.json.corrected: this folder contains all the model outputs.

Visualize the GPT-4 output at https://github.com/wenhuchen/TheoremQA/blob/main/visualize.ipynb.

Running Instruction

Dependency

openai == 0.27.6
wolframalpha == 5.0.0
pytorch == py3.8_cuda11.8_cudnn8.7.0_0
sympy == 1.11.1
transformers == 4.29.1
accelerate == 0.19.0
anthropic == 0.2.9

Chain-of-Thoughts Prompting

python run_gpt4.py

This will write output to outputs/GPT4_s0...

Program-of-Thoughts Prompting

python run_gpt4_pot.py

This will write output to outputs/GPT4_PoT_s0...

Evaluate model output

You need to register wolfram|alpha account to use their free API, checkout https://products.wolframalpha.com/api to register. Once you are done, you should receive an API_KEY.

export OPENAI_KEY=[YOUR_KEY]
export WOLFRAM_KEY=[YOUR_KEY]
python predict_accuracy.py outputs/[YOUR_FILE]

This will write an evaluation output as outputs/[YOUR_FILE].corrected

Analyze the model output

python analyze_results.py outputs/[YOUR_FILE].corrected

Leaderboard

Model	Method	Accuracy
GPT-4	PoT	52.4
GPT-4	CoT	43.8
ChatGPT	PoT	35.6
PaLM-2 (unicorn)	CoT	31.8
ChatGPT	CoT	30.2
GPT-3.5 (text-davinci-003)	PoT	27.8
Claude-v1	PoT	25.9
Claude-v1	CoT	24.9
Claude-v2	CoT	24.6
Claude-instant	CoT	23.6
Codex (code-davinci-002)	PoT	23.9
GPT-3.5 (text-davinci-003)	CoT	22.8
PaLM-2 (bison)	CoT	21.0
GPT-3 (text-davinci-002)	PoT	20.6
GPT-3 (text-davinci-002)	CoT	16.6
Alpaca	CoT	13.5
Vicuna	CoT	12.9
MOSS	CoT	12.2
StarChat	PoT	12.2
InstructCodeT5+	PoT	11.6
OpenAssistant	CoT	10.7

Cite our Work

@article{chen2023theoremqa,
  title={TheoremQA: A Theorem-driven Question Answering dataset},
  author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu},
  journal={arXiv preprint arXiv:2305.12524},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_old.md

README_old.md

TheoremQA

Introduction

Examples

Huggingface

Files

Running Instruction

Dependency

Chain-of-Thoughts Prompting

Program-of-Thoughts Prompting

Evaluate model output

Analyze the model output

Leaderboard

Cite our Work

Files

README_old.md

Latest commit

History

README_old.md

File metadata and controls

TheoremQA

Introduction

Examples

Huggingface

Files

Running Instruction

Dependency

Chain-of-Thoughts Prompting

Program-of-Thoughts Prompting

Evaluate model output

Analyze the model output

Leaderboard

Cite our Work