Implementing temperature sampling and extending description to enable majority voting for tasks #376

albertqjiang · 2023-01-06T20:49:40Z

Currently, the temperature sampling is only implemented for HFLM and maj@k is only implemented for MATH/algebra through the hendrycks_math.MathAlgebraMaj task.

The problem to solve before upstreaming are stylistic, not technical:

Is it ok if we just add a new task for each task we want to do majority voting on? This is clumsy, but somehow necessary. Because for a different task, we might need a different post-processing method to figure out the vote.

albertqjiang · 2023-01-13T20:44:51Z

Updated the PR to enable majority voting without adding extra tasks.

Now majority voting can be enabled through the task description, e.g., saving {"math_algebra": "majority_voting=32,sampling_temperature=0.3"} as config.json and invoking the evaluation by
python main.py --model gpt2 --tasks math_algebra --device cuda --num_fewshot 3 --description_dict_path config.json

GPT2 performance with 3-shot on task math_algebra:

Without majority voting: 0.08±0.08%
With majority voting @ 32 and sampling temperature 0.3: 0.51 ± 0.21 %

StellaAthena · 2023-01-15T23:20:07Z

I will endeavor to look at this during the coming week.

StellaAthena · 2023-02-25T20:49:34Z

Obviously I didn’t have time to look at this as I had hoped. I’m currently discussing a re-factoring of this library with @haileyschoelkopf and @jon-tow which will likely take this feature into account.

Apologies for the delays, things have been quite hectic.

albertqjiang · 2023-02-26T10:04:44Z

Obviously I didn’t have time to look at this as I had hoped. I’m currently discussing a re-factoring of this library with @haileyschoelkopf and @jon-tow which will likely take this feature into account.

Apologies for the delays, things have been quite hectic.

No worries! It's great to see the re-factoring effort.

wellecks · 2023-03-13T19:50:11Z

lm_eval/base.py

+ greedy = False
+ _model_generate_kwargs = {"k": k, "temperature": temperature}
+ elif len(request) == 5:
+ context, until, k, temperature, k_batch = request


Perhaps name k and k_batch to line up with what is specified in the config? E.g. right now the config has majority_voting (or rename to majority_voting_k or majority_voting_num_votes).

This could help with keeping track of (a) which hyperparameters are supported, and (b) which part of the code corresponds to what is specified in the config

I think that this is a very good suggestion. Improving consistency of internal names for things is a huge add for hackability.

Agreed- after thinking about it some more, the k here isn't specific to majority voting though, so perhaps just num_samples

wellecks · 2023-03-14T04:13:33Z

lm_eval/evaluator.py

@@ -215,7 +218,10 @@ def evaluate(
 ctx = task.fewshot_context(
 doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description


will the key-value pair string description be prepended to the few-shot context?

confirmed that it is prepended:

to support both cases, one option is to break apart the two cases in the config, e.g. {"math_algebra": {"description": "You will solve mathematical problems. Here are some examples: ", "params": {"majority_voting": 4,"sampling_temperature":0.3,"eval_batch_size":4}}}

this would also remove the need to implement the logic in parse_description (see below)

wellecks · 2023-03-14T04:16:25Z

lm_eval/tasks/hendrycks_math.py

- def process_results(self, doc, results):
- retval = 0
- indices = [pos for pos, char in enumerate(results[0]) if char == "$"]
+ def parse_description(self, description):


is it possible to reuse lm_eval.utils.simple_parse_args_string?

(the logic is slightly different in that function, so maybe you already tried it and ran into an issue :))

wellecks · 2023-03-14T04:24:58Z

@albertqjiang also if it's helpful to integrate or expand upon, I wrote some documentation while looking through the code:

CLAassistant · 2023-04-23T02:51:29Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ StellaAthena
❌ Albert Jiang

Albert Jiang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

albertqjiang requested review from jon-tow, leogao2 and StellaAthena as code owners January 6, 2023 20:49

albertqjiang changed the title ~~[WIP don't merge] implementing temperature sampling and maj@k for HFLM~~ Implementing temperature sampling and extending description to enable majority voting for tasks Jan 13, 2023

wellecks reviewed Mar 13, 2023

View reviewed changes

wellecks reviewed Mar 14, 2023

View reviewed changes

wellecks mentioned this pull request Mar 15, 2023

Lila wellecks/lm-evaluation-harness#2

Merged

albertqjiang closed this Apr 25, 2023

albertqjiang force-pushed the master branch from 2a6b4e8 to bdd8cc7 Compare April 25, 2023 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing temperature sampling and extending description to enable majority voting for tasks #376

Implementing temperature sampling and extending description to enable majority voting for tasks #376

albertqjiang commented Jan 6, 2023

albertqjiang commented Jan 13, 2023

StellaAthena commented Jan 15, 2023

StellaAthena commented Feb 25, 2023

albertqjiang commented Feb 26, 2023

wellecks Mar 13, 2023

StellaAthena Mar 13, 2023

wellecks Mar 14, 2023

wellecks Mar 14, 2023

wellecks Mar 14, 2023

wellecks Mar 14, 2023 •

edited

Loading

wellecks Mar 14, 2023

wellecks commented Mar 14, 2023

CLAassistant commented Apr 23, 2023

		@@ -215,7 +218,10 @@ def evaluate(
		ctx = task.fewshot_context(
		doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description

Implementing temperature sampling and extending description to enable majority voting for tasks #376

Implementing temperature sampling and extending description to enable majority voting for tasks #376

Conversation

albertqjiang commented Jan 6, 2023

albertqjiang commented Jan 13, 2023

StellaAthena commented Jan 15, 2023

StellaAthena commented Feb 25, 2023

albertqjiang commented Feb 26, 2023

wellecks Mar 13, 2023

Choose a reason for hiding this comment

StellaAthena Mar 13, 2023

Choose a reason for hiding this comment

wellecks Mar 14, 2023

Choose a reason for hiding this comment

wellecks Mar 14, 2023

Choose a reason for hiding this comment

wellecks Mar 14, 2023

Choose a reason for hiding this comment

wellecks Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

wellecks Mar 14, 2023

Choose a reason for hiding this comment

wellecks commented Mar 14, 2023

CLAassistant commented Apr 23, 2023

wellecks Mar 14, 2023 •

edited

Loading