Improvements to MGSM #1391

leocnj · 2024-02-02T19:44:26Z

lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_en.yaml

Line 3 in 7411947

 doc_to_target: '{% if answer is not none %}{{answer[6+1]}}{% else %}{{answer_number|string}}{% 

According to my understanding, for target, we need to obtain a sub string of answer from the location 6+1. The existing jiajia-2, however, represent the char on location 7. Is this a bug?

The text was updated successfully, but these errors were encountered:

baberabb · 2024-02-03T05:00:25Z

Hi! @juletx should be able to confirm but I think just using {{answer_number|string}} without the condition should work here. Not quite sure what we are indexing here. The COT prompts also seem to be set incorrectly.

answer:
Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
answer_number:
11

juletx · 2024-02-03T11:25:05Z

Hi @leocnj @baberabb! Not sure why it's implemented like that but I'd say it's a bug. None of the mgsm tasks generates the correct few-shot prompt when testing them with write_out with num_fewshot=5.

mgsm_direct_en: it should be Answer: answer_number

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer-

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer-

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Answer-

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Answer-

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Answer-

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer

mgsm_en_direct: Step-by-Step Answer should contain the full answer text

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Step-by-Step Answer: T

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step-by-Step Answer: R

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Step-by-Step Answer: J

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Step-by-Step Answer: 5

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Step-by-Step Answer: M

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Step-by-Step Answer:

mgsm_en_native_cot: Step-by-Step Answer should contain the full answer text

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Step-by-Step Answer: T

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step-by-Step Answer: R

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Step-by-Step Answer: J

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Step-by-Step Answer: 5

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Step-by-Step Answer: M

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Step-by-Step Answer:

baberabb · 2024-02-03T18:41:41Z

Thanks for the confirmation @juletx! Also looks like the \nAnswer: string in doc_to_text should be in the native language for the direct variation, which doesn't seem to be true for the majority of the languages.

leocnj · 2024-02-04T17:53:42Z

Did more reading on why [6+1] was used.

Inside mgsm/utils.py, we can find

"en": {  # English
        "QUESTION": "Question:",
        "ANSWER": "Step-by-Step Answer:",
        "DIRECT": "Answer:",
        "REGEX": "The answer is (\\-?[0-9\\.\\,]+)",

...

 yaml.dump(
                    {
                        "include": yaml_template,
                        "dataset_name": lang,
                        "task": f"mgsm_{lang}_direct",
                        "doc_to_text": f"""{{% if answer is not none %}}"""
                        f"""{{{{question+"\\n{ANSWER}"}}}}"""
                        f"""{{% else %}}"""
                        f"""{{{{"{QUESTION} "+question+"\\n{ANSWER}"}}}}"""
                        f"""{{% endif %}}""",
                        "doc_to_target": f"""{{% if answer is not none %}}"""
                        f"""{{{{answer[{len(ANSWER)}+1]}}}}"""
                        f"""{{% else %}}"""
                        f"""{{{{answer_number|string}}}}"""
                        f"""{{% endif %}}""",
                        **filter_list,
                    },

This will generate a yaml file like

# Generated by utils.py
dataset_name: en
doc_to_target: '{% if answer is not none %}{{answer[6+1]}}{% else %}{{answer_number|string}}{%
  endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer"}}{% else %}{{"Question:
  "+question+"\nAnswer"}}{% endif %}'
include: direct_yaml
task: mgsm_direct_en

I looks that answer[6+1] intends to skip the pre-defined ANSWER string, "Step-by-Step Answer:", which has a len of 6, from the answer string. However, for such purpose, we need use answer[6+1:]

baberabb · 2024-02-04T19:14:50Z

I looks that answer[6+1] intends to skip the pre-defined ANSWER string, "Step-by-Step Answer:", which has a len of 6, from the answer string. However, for such purpose, we need use answer[6+1:]

aah that makes more sense. "Step-by-Step Answer:" has 20 characters by my count though ("answer" has 6).

leocnj · 2024-02-18T13:43:24Z

Made a PR to fix the issues we observed above. Now generated contexts look normal.

For example, for mgsm_direct_en, new yaml file will be

!!@@##@@!! -- Example 1
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer:Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer:There are 3 cars in the beginning, 2 more arrive, so now there should be 3 + 2 = 5 cars. The answer is 5.

Question: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?
Answer:

For mgsm_native_cot_zh, the new yaml file will be

!!@@##@@!! -- Example 1
问题：罗杰有 5 个网球。他又买了 2 罐网球。每罐有 3 个网球。他现在有多少个网球？
逐步解答: 杰一开始有 5 个球。2 罐各 3 个网球就是 6 个网球。5 + 6 = 11。答案是 11。

问题：如果停车场里有 3 辆车，又来了 2 辆车，停车场里有多少辆车？
逐步解答: 开始有 3 辆车，又来了 2 辆，所以现在应该有 3 + 2 = 5 辆车。答案是 5。

问题：服务器机房里有九台电脑。从周一到周四，每天又安装了五台电脑。服务器机房里现在有多少台电脑？
逐步解答:

* fix the issue #1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

nitsanluke · 2024-02-22T15:48:25Z

The filter on the native languages tasks should also need some updating which currently uses the English format for the answer.

lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/cot_yaml

Line 28 in a72babb

regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"

Hugginface dataset French few shot example:

Question : Roger a 5 balles de tennis. Il achète 2 autres boîtes de balles de tennis en plus. Si chaque boîte contient 3 balles de tennis, combien de balles de tennis a-t-il maintenant ?	Réponse étape par étape : Roger a commencé avec 5 balles. 2 boîtes de 3 balles de tennis chacune représentent 6 balles de tennis. 5 + 6 = 11. La réponse est 11.	11	5 + 6 = 11.

https://huggingface.co/datasets/juletxara/mgsm/viewer/fr?row=0

haileyschoelkopf · 2024-02-22T16:26:00Z

Agreed-- just taking the final number from the response (as in

lm-evaluation-harness/lm_eval/tasks/gsm8k/gsm8k-cot.yaml

Lines 43 to 48 in a72babb

 - name: "flexible-extract" 

 filter: 

 - function: "regex" 

 group_select: -1 

 regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)" 

 - function: "take_first"

) would be an improvement already as well

…EleutherAI#1440) * fix the issue EleutherAI#1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

naiarapm · 2024-04-08T15:10:23Z

Hi! Shouldn't doc_to_target simply be {{answer_number|string}} in direct task variants? Right now all examples for in-context learning include the step-by-step answers, also in the direct setup. If I understood correctly, I don't think that was the intended setup in the original article.

Native CoT

doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question: "+question+"\nStep-by-Step Answer:"}}{% endif %}'

Example

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Step-by-Step Answer: There are 3 cars in the beginning, 2 more arrive, so now there should be 3 + 2 = 5 cars. The answer is 5.

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step-by-Step Answer:

✅ Independently of whether the evaluation is zero- or few-shot, the model is asked to generate a chain of thought.
✅ In few-shot, in-context examples will contain step-by-step answers.

Direct

doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'

Example:

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer:There are 3 cars in the beginning, 2 more arrive, so now there should be 3 + 2 = 5 cars. The answer is 5.

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer:

✅ Independently of whether the evaluation is zero- or few-shot, the model is asked to directly output the answer.
❓ In few-shot, in-context examples will contain step-by-step answers. Models will then tend to also generate a CoT?

As it is now, the only difference between these two setups seems to be that the prompt says "Step-by-Step Answer:" for CoT, and "Answer:" for direct. Am I missing something? Thank you!

…EleutherAI#1440) * fix the issue EleutherAI#1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

leocnj added a commit to leocnj/lm-evaluation-harness that referenced this issue Feb 17, 2024

fix the issue EleutherAI#1391, wrong contexts in mgsm tasks

a5042e2

leocnj mentioned this issue Feb 18, 2024

PR fixing the issue #1391 (wrong contexts in the mgsm task) #1440

Merged

haileyschoelkopf changed the title ~~A question about getting target from doc['answer'] in mgsm task~~ Improvements to MGSM Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to MGSM #1391

Improvements to MGSM #1391

leocnj commented Feb 2, 2024

baberabb commented Feb 3, 2024 •

edited

Loading

juletx commented Feb 3, 2024 •

edited

Loading

baberabb commented Feb 3, 2024 •

edited

Loading

leocnj commented Feb 4, 2024

baberabb commented Feb 4, 2024

leocnj commented Feb 18, 2024

nitsanluke commented Feb 22, 2024

haileyschoelkopf commented Feb 22, 2024

naiarapm commented Apr 8, 2024

Improvements to MGSM #1391

Improvements to MGSM #1391

Comments

leocnj commented Feb 2, 2024

baberabb commented Feb 3, 2024 • edited Loading

juletx commented Feb 3, 2024 • edited Loading

baberabb commented Feb 3, 2024 • edited Loading

leocnj commented Feb 4, 2024

baberabb commented Feb 4, 2024

leocnj commented Feb 18, 2024

nitsanluke commented Feb 22, 2024

haileyschoelkopf commented Feb 22, 2024

naiarapm commented Apr 8, 2024

Native CoT

Direct

baberabb commented Feb 3, 2024 •

edited

Loading

juletx commented Feb 3, 2024 •

edited

Loading

baberabb commented Feb 3, 2024 •

edited

Loading