Improve MMMU performance with prompt engineering (openai#1450)

With this improvement we now have a 0-shot performance of 59.6% (averaged over 3 eval runs) on the MMMU validation set, which beats the 56.8% reported in the [MMMU paper](https://arxiv.org/pdf/2311.16502.pdf)
hvaara · Jan 3, 2024 · 2981e65 · 2981e65
1 parent f1bb7cb
commit 2981e65
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/evals/elsuite/mmmu/eval.py b/evals/elsuite/mmmu/eval.py
@@ -88,7 +88,7 @@ def eval_sample(self, sample: Sample, rng):
  rng=rng,
  )
  prompt = sample.question + "\n" + options
- system_prompt = f'You are an expert in {self.subject} whose job is to answer questions from the user using images. First, reason about the correct answer. Then write the answer in the following format where X is exactly one of A,B,C,D: "ANSWER: X"'
+ system_prompt = f'You are an expert in {self.subject} whose job is to answer questions from the user using images. First, reason about the correct answer. Then write the answer in the following format where X is exactly one of A,B,C,D: "ANSWER: X". If you are uncertain of the correct answer, guess the most likely one.'
  else:
  correct_answer = sample.label
  prompt = sample.question