Skip to content

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

Notifications You must be signed in to change notification settings

RJT1990/chain-of-thought-hub

 
 

Repository files navigation

Title "A fantasy graph illustrating a chain of stars in a dark night with blue sky, digital art, super resolution" Dall-E

Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

Yao Fu and Litu Ou

University of Edinburgh

[email protected]

Recently, there are a lot of progress in LLMs. Many claim that a small model less than 10B can achieve comparable performance to GPT-3.5. Really?

In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when *the complexity of the task reaches a sufficient threshold* — GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. -- GPT-4 release blog

The key differentiator is whether a model can do complex tasks, like the old saying: "chit-chat is cheap, show me the reasoning." This is why we compile a list of complex reasoning tasks including math (GSM8K), science (MATH), symbolic (BBH), knowledge (MMLU), to measure which models are really better.

Results - Overall

Model # Params GSM8K MATH MMLU BBH
gpt-4 ? 92.0 42.5 86.4 -
claude-v1.3 ? 81.8 - 74.8 -
gpt-3.5-turbo ? 78.9 - 67.3 70.1
claude-instant ? 74.8 - - -
text-davinci-003 ? - - 64.6 70.7
code-davinci-002 ? 66.6 19.1 64.5 73.7
Minerva 540B 58.8 33.6 - -
Flan-PaLM 540B - - 70.9 66.3
Flan-U-PaLM 540B - - 69.8 64.9
PaLM 540B 56.9 8.8 62.9 62.0
text-davinci-002 ? 55.4 - 60.0 67.2
PaLM 64B 52.4 4.4 49.0 42.3
LLaMA 65B 50.9 10.6 63.4 -
LLaMA 33B 35.6 7.1 57.8 -
LLaMA 13B 17.8 3.9 46.9 -
Flan-T5 11B 16.1 - 48.6 41.4
LLaMA 7B 11.0 2.9 35.1 -

Some raw model outputs are in this google drive link

What's different than HeLM?

  • HeLM uses answer-only prompting, we use chain-of-thought promoting
  • HeLM evaluates everything. We only focus on complex reasoning, the key differentiator of LLMs' capability.
  • By Apr 26 2023 our results are newer than the reasoning section of HeLM

Generally:

  • We rank the model performance by GSM8K, the classical benchmark measuring chain-of-thought math reasoning performance. This is definitely not the only metric, but a good interpretation is "how good the model can do math while maintaining other generic abilities" -- which is also very hard.
  • Still under construction. Code may be a little bit messy. Many missing values. Apologies in advance.

The MMLU and BBH results

  • GPT-4 from its website and Bubeck et al Mar 2023
  • *-davinci-00* and *PaLM are from the Flan-PaLM paper appendix.
  • LLaMA from LLaMA paper (TODO: test LLaMA on BBH).
  • Claude is from our own test script, see below about how to run it.

Current results:

  • GPT-4 clearly outperforms all other models on GSM8K and MMLU.
  • **The 65B LLaMA is very close to text/code-davinci-002, which means that based on it, if SFT and RLHF are done correctly, it is very likely that we could reproduce ChatGPT based on the 65B LLaMA**
  • Claude is the only model family that is comparable to GPT family.
  • On GSM8K, gpt-3.5-turbo improves over text-davinci-003. This confirms OpenAI's Jan 30 2023 release notes "improved mathematical capabilities."
  • On MMLU, gpt-3.5-turbo is slightly better than text-davinci-003. But this level of margin is NOT SIGNIFICANT
  • Also remember that gpt-3.5-turbo is 10 times cheaper than text-davinci-003
  • Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in GPT-4 report they said that they mixed a portion of GSM8K training set to train the model
  • LLaMA performance on MMLU is from their paper and probably not CoT but AO. Generally on MMLU, AO is better than CoT but just slightly better. So the LLaMA numbers on MMLU might be slightly overestimated.

Why choosing the above tasks?

  • We mostly care about complex reasoning.
    • Other abilities of LLMs such as summarization or translation are not considered here as they are rather standard and probably not challenging enough.
    • We consider
    • MMLU: high school and college knowledge
    • GSM8K: elementary school math. -- Performance improvements on this dataset directly translate to daily math abilities when interacting with LLMs
    • MATH: very hard math and natural science. All current models struggle.
    • BBH: a collection of 27 hard reasoning problems

Run

# MMLU
cd MMLU
mkdir outputs
API_KEY=<your_api_key>
python run_mmlu_gpt_3.5_turbo.py --api_key=${API_KEY}
python run_mmlu_claude.py --api_key=${API_KEY} --engine=claude-v1.3

########################################################
## GSM8K
cd gsm8k 
mkdir outputs

# run gpt-3.5
# codex_gsm8k_complex.ipynb         -- code-davinci-002 + complex prompt
# gpt3.5turbo_gsm8k_complex.ipynb   -- gpt-3.5-turbo + complex prompt

# run claude
python run_gsm8k_claude.py --anthropic_key=${API_KEY} --output_file=outputs/gsm8k_claude_test.txt

########################################################
## BBH
cd BBH
mkdir outputs
# then run jupyter notebook to see an example penguins dataset
cd penguins
# gpt3.5trubo_penguins_original.ipynb

# Or run the script for all datasets
API_KEY=<your_api_key>
TASK=<all | multiple_choice | free_form>
python run_bbh_gpt_3.5_turbo.py --api_key=${API_KEY} --task=${TASK} # task=all by default

FAQ

  • What are the prompts used in the complexity-based prompting paper?
    • For gsm8k they are:
      • prompt_hardest.txt
      • prompt_mid.txt
      • prompt_easy.txt
      • prompt_original.txt
  • There are some prompts that have wrong answer
    • Yes, but we keep it as they are used in the original papers
    • Generally the model can be robust under prompt perturbation: even if sometimes there are errors in the prompt, as long as the format of the prompt is about the corresponding task, the model tend to only look at the format, ignore the prompt error, and make its own prediction.
    • See https://arxiv.org/abs/2202.12837 and https://arxiv.org/abs/2212.10001 about more analysis how the model can ignore errors in the prompt

References

We first discuss the recipe of building models of strong reasoning abilities, which is the same as generic LLM recipe: pretraining, finetuning, reinforcement learning. Then we discuss prompting methods for releasing the reasoning power of large language models.

Pretraining/ Continue Training

Finetuning

Reinforcement Learning

Prompting

More literature

TODO

Engineering

  • Detailed results in a shared google sheet
  • More reasoning datasets
  • Make this prompt and the associating prompts as a huggingface dataset
  • Draw a figure about model scale v.s. reasoning accuracy
  • Add Alpaca and Vacuna
  • Test LLaMA on BBH
  • Fancy tricks in prompt engineering
  • Add smaller LLaMA
  • Add Flan-T5

Research

  • Decoding space
    • Do larger models have "larger" decoding space than smaller models?
    • Can instruction finetuning "closes" some meaningful/ reasonable decoding path? Can it "open" new decoding paths?
  • Advanced prompt engineering
    • Planning and deductive prompting
    • Dialog in-context learning
  • CoT prompt engineering documentation, including
    • Stable:
      • complexity based prompting
    • Test:
      • concise prompting
      • emphasizing important steps

About

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.0%
  • Python 1.0%