#1442 inverse scaling tasks implementation #1589

h-albert-lee · 2024-03-16T03:04:36Z

Hi! I am implementing the inverse scaling task as #1442.

I tried to implement the task based on the information provided in the paper and the dataset released by the inverse_scaling_prize, but after contacting the authors and checking some resources, I have some issues.

As far as I know, the way they measure 'accuracy'(in paper) is the same as the multiple_choice method in lm-eval-harness. However, if you look at the experimental results below (the experiments were done using the model that the inverse scaling prize used in the actual test), you can see some differences

I'm not sure why this difference occurs, but I think we need to take a closer look at the evaluation code provided by the inverse scaling prize and work on it further.

However, I'm posting this pull request primarily to reflect my current implementation.

Now, the task group is 'inverse_scaling_mc' and the readme file discloses that this is not an official implementation.

Compare table using opt-1.3b

Tasks	lm-eval-harness	original paper
inverse_scaling_hindsight_neglect_1_0shot	0.4508	0.454
- inverse_scaling_neqa	0.4733	0.513
- inverse_scaling_quote_repetition	0.8767	0.950
- inverse_scaling_redefine_math	0.7022	0.669

Compare table using opt-2.7b

Tasks	lm-eval-harness	original paper
inverse_scaling_hindsight_neglect_1_0shot	0.5238	0.448
- inverse_scaling_neqa	0.4600	0.527
- inverse_scaling_quote_repetition	0.9000	0.943
- inverse_scaling_redefine_math	0.7444	0.720

p.s. I apologize for the lack of experiments. I initially ran the experiments with the facebook/opt models(125m~6.7b), but later I found the models that the inverse_scaling_prize used for the actual experiments, so I was in a hurry and only ran the experiments on those two models.

Also, I added 'inverse_scaling_winobias_antistereotype' just before the pull request because I realized it was in the participant's personal repository, not the inverse_scaling repository. Therefore, the experimental results could not be attached. I will experiment further with this as soon as we have computing resources.

…tasks

lintangsutawika · 2024-03-16T07:37:52Z

@h-albert-lee do you have a list of models to evaluate? I can also help run them.

h-albert-lee · 2024-03-17T04:59:21Z

@lintangsutawika Thank you a lot! Since these are scaling-related tasks, it would be nice to see results on larger models to better verify the reliability of this implementation.

Models that need experiment(huggingface) :
inverse-scaling/opt-13b_eval
(expected score: neqa 0.497 / quote 0.800 / redefine 0.593/ hn_10sht 0.270)

inverse-scaling/opt-30b_eval
(expected score: neqa 0.550 / quote 0.790 / redefine 0.659/ hn_10sht 0.356)

inverse-scaling/opt-66b_eval
(expected score: neqa 0.537 / quote 0.837 / redefine 0.614/ hn_10sht 0.232)

lintangsutawika · 2024-03-17T20:17:42Z

@h-albert-lee
inverse-scaling/opt-13b_eval

hf (pretrained=inverse-scaling/opt-13b_eval), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                   Tasks                   |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------------------------------------|-------|------|------|--------|-----:|---|-----:|
|inverse_scaling_mc                         |N/A    |none  |None  |acc_norm|0.5999|±  |0.0098|
|                                           |       |none  |None  |acc     |0.5896|±  |0.0098|
| - inverse_scaling_hindsight_neglect_10shot|      0|none  |None  |acc     |0.5429|±  |0.0281|
|                                           |       |none  |None  |acc_norm|0.5429|±  |0.0281|
| - inverse_scaling_neqa                    |      0|none  |None  |acc     |0.4367|±  |0.0287|
|                                           |       |none  |None  |acc_norm|0.4367|±  |0.0287|
| - inverse_scaling_quote_repetition        |      0|none  |None  |acc     |0.9100|±  |0.0166|
|                                           |       |none  |None  |acc_norm|0.9267|±  |0.0151|
| - inverse_scaling_redefine_math           |      0|none  |None  |acc     |0.6500|±  |0.0159|
|                                           |       |none  |None  |acc_norm|0.6500|±  |0.0159|
| - inverse_scaling_winobias_antistereotype |      0|none  |None  |acc     |0.3714|±  |0.0238|
|                                           |       |none  |None  |acc_norm|0.4150|±  |0.0243|

inverse-scaling/opt-30b_eval

hf (pretrained=inverse-scaling/opt-30b_eval,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                   Tasks                   |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------------------------------------|-------|------|------|--------|-----:|---|-----:|
|inverse_scaling_mc                         |N/A    |none  |None  |acc_norm|0.6237|±  |0.0097|
|                                           |       |none  |None  |acc     |0.6134|±  |0.0097|
| - inverse_scaling_hindsight_neglect_10shot|      0|none  |None  |acc     |0.4825|±  |0.0282|
|                                           |       |none  |None  |acc_norm|0.4825|±  |0.0282|
| - inverse_scaling_neqa                    |      0|none  |None  |acc     |0.5433|±  |0.0288|
|                                           |       |none  |None  |acc_norm|0.5433|±  |0.0288|
| - inverse_scaling_quote_repetition        |      0|none  |None  |acc     |0.9233|±  |0.0154|
|                                           |       |none  |None  |acc_norm|0.9367|±  |0.0141|
| - inverse_scaling_redefine_math           |      0|none  |None  |acc     |0.6833|±  |0.0155|
|                                           |       |none  |None  |acc_norm|0.6833|±  |0.0155|
| - inverse_scaling_winobias_antistereotype |      0|none  |None  |acc     |0.3859|±  |0.0240|
|                                           |       |none  |None  |acc_norm|0.4320|±  |0.0244|

inverse-scaling/opt-66b_eval

hf (pretrained=inverse-scaling/opt-66b_eval,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|                   Tasks                   |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------------------------------------|-------|------|------|--------|-----:|---|-----:|
|inverse_scaling_mc                         |N/A    |none  |None  |acc_norm|0.6111|±  |0.0097|
|                                           |       |none  |None  |acc     |0.6017|±  |0.0098|
| - inverse_scaling_hindsight_neglect_10shot|      0|none  |None  |acc     |0.4794|±  |0.0282|
|                                           |       |none  |None  |acc_norm|0.4794|±  |0.0282|
| - inverse_scaling_neqa                    |      0|none  |None  |acc     |0.5267|±  |0.0289|
|                                           |       |none  |None  |acc_norm|0.5267|±  |0.0289|
| - inverse_scaling_quote_repetition        |      0|none  |None  |acc     |0.8767|±  |0.0190|
|                                           |       |none  |None  |acc_norm|0.9133|±  |0.0163|
| - inverse_scaling_redefine_math           |      0|none  |None  |acc     |0.6900|±  |0.0154|
|                                           |       |none  |None  |acc_norm|0.6900|±  |0.0154|
| - inverse_scaling_winobias_antistereotype |      0|none  |None  |acc     |0.3568|±  |0.0236|
|                                           |       |none  |None  |acc_norm|0.3811|±  |0.0240|

The results seem bit off.

h-albert-lee · 2024-03-18T08:19:33Z

The error seems to be worse for models with larger scales, and the experimental results do not show inverse-scaling. Something seems to be wrong. It seems like there might be an issue with the dataset itself (it's strange that inverse-scaling disappears depending on the evaluation method), but I'll try to make the implementations as identical as possible.

h-albert-lee · 2024-03-18T13:00:26Z

@lintangsutawika Is it right to close this pull request for now? Or is it better to keep it as an implementation of the multiple choice version and make the implementation of the inverse scaling original version the next pull request?

haileyschoelkopf · 2024-03-18T13:18:03Z

@h-albert-lee it's ok to keep this PR open to make the changes!

(it's strange that inverse-scaling disappears depending on the evaluation method)

If this is indeed the case it'd likely be worth writing up somewhere, perhaps as a blogpost or as part of a larger exploration.

h-albert-lee · 2024-03-19T03:46:47Z

@lintangsutawika @haileyschoelkopf I'll leave the PR open, add a generative method and an original method, and request review again. If further experiments show that inverse-scaling is not happening, I'll head over to the EleutherAI channel and talk about it. Thank you!

h-albert-lee · 2024-03-31T09:40:24Z

Things have been busy at work lately, so I apologize for the lack of PR work. I'll get back to it soon!

haileyschoelkopf · 2024-05-29T19:03:23Z

I was able to match results for OPT-2.7B up to the 3rd decimal point with the following changes:

running OPT in float32 rather than default precision
Using add_bos_token=True to add OPT's BOS token
changing target_delimiter: "", as multiple choice choice strings for the inverse-scaling org datasets already have the whitespace prepended to them

I'll aim to fix up this PR and get these tasks merged!

h-albert-lee · 2024-06-10T11:45:43Z

@haileyschoelkopf I apologize for forgetting about that PR for so long. I was trying to fix it halfway through and gave up a bit. Can I check to see if it matches to 3 decimal places for other sizes of OPT models?

haileyschoelkopf · 2024-06-11T12:53:13Z

@h-albert-lee go for it! I have my changes in inverse-scaling-tasks branch.

I believe there are also more inverse scaling prize tasks in the paper than those listed here, if you are interested in investigating those.

h-albert-lee · 2024-06-11T12:59:13Z

@haileyschoelkopf Oh, the round2 dataset for the inverse scaling prize is not currently planned for huggingface dataset, I emailed them and they gave me permission to post the dataset privately and implement it in lm-eval-harness. I downloaded the data now, and planning to post it to private repo, and will commit it to the branch here.

haileyschoelkopf · 2024-06-11T13:11:35Z

Ah I see, didn't realize the others were not publicly downloadable, thanks for the clarification! (and for all your work tracking down and working on this PR! The only issue was, quite literally, a single whitespace character. Don't you love evals? :/)

by "privately" do you mean as a gated repo (similar to Llama2 official repos, and GPQA dataset repo) or an invisible-to-others private upload? It would be great to have our hands on these datasets to run results for research projects privately, but want to make sure we respect the Inv. Scaling authors and dataset creators' wishes and not expose the datasets to leakage.

h-albert-lee · 2024-06-11T13:27:40Z

@haileyschoelkopf It's funny yet challenging that a single whitespace character messed up the results—that's eval for you. By 'privately,' I mean I'll probably post it on HAERAE huggingface organization or my personal huggingface account. (I've got permission, so it should be fine!) It seems that the authors didn't intend to keep the data private; it was more that there hasn’t been much demand for it since the Prize concluded. I’ll make sure you can access it by working on it right after I get off work tomorrow!

h-albert-lee · 2024-06-12T12:07:28Z

@haileyschoelkopf I've uploaded the round2 datasets here! https://huggingface.co/Albertmade

…tasks

h-albert-lee · 2024-06-12T12:26:36Z

Since 'prompt-injection' uses a different metric, I just uploaded the multiple choice tasks first. 'prompt-injection' task says 'Objective is low loss on the target completion'. If you have any other tasks that can be used as a reference for implementing this task, please let me know!

lintangsutawika · 2024-06-12T12:33:45Z

Since 'prompt-injection' uses a different metric, I just uploaded the multiple choice tasks first. 'prompt-injection' task says 'Objective is low loss on the target completion'. If you have any other tasks that can be used as a reference for implementing this task, please let me know!

Would it be possible to implement that different metric in lm-eval, or specifically for this task?

h-albert-lee · 2024-06-12T13:01:20Z

@lintangsutawika The following function seems to give the same result as the original paper. I will try to implement a metric in lm_eval that utilizes it

def _evaluate_sequence_prob(
    self, examples: list[SequenceProbExample]
) -> dict[str, Sequence[float]]:
    prompts = [example.prompt + example.completion for example in examples]
    tokenized_inputs = self.tokenizer(
        prompts, return_tensors="pt", truncation=True
    ).to(self.device)
    target_sequences = [example.completion for example in examples]
    target_token_lengths = [
        len(self.tokenizer(word)["input_ids"]) - self.correction_for_startarg
        for word in target_sequences
    ]
    outputs = self.model(**tokenized_inputs)
    logits = outputs["logits"].detach().to(device="cpu", dtype=torch.float32)
    losses = []
    for i in range(len(examples)):
        tokens = tokenized_inputs["input_ids"][i]
        sequence_logits = logits[i, -target_token_lengths[i] - 1 : -1]
        sequence_tokens = tokens[-target_token_lengths[i] :]
        logprobs = -F.log_softmax(sequence_logits, dim=-1)
        loss = sum([logprobs[i, token] for i, token in enumerate(sequence_tokens)])
        losses.append(loss.item())  
    return {"loss": losses}

haileyschoelkopf · 2024-06-12T13:04:40Z

@h-albert-lee I would recommend modeling the prompt-injection task off of lambada_openai 's perplexity metric! We may likely have to add a new metric to support this though (to normalize by target string length).

…tasks

h-albert-lee · 2024-06-26T13:34:31Z

The pytest issue doesn't seem to be happening in current code, am I correct...? :/

haileyschoelkopf · 2024-06-26T13:38:30Z

Nope, test failures are unrelated--though if you could merge from most recent main, they might go away.

I need to debug this new issue : (

h-albert-lee · 2024-06-26T13:43:47Z

I did merge from the main branch just before the commit, though. 😕..!

lm_eval/tasks/inverse_scaling/README.md

haileyschoelkopf · 2024-06-26T14:34:24Z

No worries--not related to your change. So are the other inv. scaling tasks tested to match OPT scores? If so, could you just add an entry to lm_eval/tasks/README.md linking to this folder?

Co-authored-by: Hailey Schoelkopf <[email protected]>

…tasks

h-albert-lee · 2024-06-29T07:07:43Z

@haileyschoelkopf I've tested it on the OPT-30b model, and it worked well. I just added a link to lm_eval/tasks/README.md as well!

haileyschoelkopf

Thanks so much @h-albert-lee !

lm_eval/tasks/README.md

lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

haileyschoelkopf

Actually, @h-albert-lee sanity checking--shouldn't target_delimiter: "" be required? (this is what I observed when I was testing things on smaller models)--the inverse scaling official datasets include the prepended space for the MC answer choices so we want to avoid adding our own space separator.

lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

lm_eval/tasks/inverse_scaling/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

h-albert-lee · 2024-07-03T10:43:57Z

Oh, thanks for pointing that out! And yes, there are some differences between the data used in the paper and the published data (I think there was some preprocessing, like removing some duplicates from the dataset). Please note!

haileyschoelkopf · 2024-07-03T12:17:12Z

Thanks for your hard work on this @h-albert-lee !

h-albert-lee · 2024-07-03T23:38:30Z

@haileyschoelkopf Thanks a lot!

* initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <[email protected]> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by: Hailey Schoelkopf <[email protected]> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]>

* Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <[email protected]> --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <[email protected]> Co-authored-by: zhabuye <[email protected]> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <[email protected]> Co-authored-by: zhabuye <[email protected]> Co-authored-by: Hailey Schoelkopf <[email protected]> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <[email protected]> Co-authored-by: Alina Lozovskaia <[email protected]> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <[email protected]> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <[email protected]> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <[email protected]> * apply format --------- Co-authored-by: Harish Vadaparty <[email protected]> Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <[email protected]> --------- Co-authored-by: haileyschoelkopf <[email protected]> Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: Hailey Schoelkopf <[email protected]> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <[email protected]> Co-authored-by: anthony <[email protected]> Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <[email protected]> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <[email protected]> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <[email protected]> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <[email protected]> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <[email protected]> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <[email protected]> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Lintang Sutawika <[email protected]> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <[email protected]> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <[email protected]> * add tokenizer logs info (EleutherAI#1731) * add tokenizer logs info * add no tokenizer case * Update lm_eval/logging_utils.py Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/logging_utils.py Co-authored-by: Hailey Schoelkopf <[email protected]> * add updates * fix conflict --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * Hotfix breaking import (EleutherAI#2015) * add arc_challenge_mt (EleutherAI#1900) * add arc_challenge_mt * add README * add icelandic * Remove `LM` dependency from `build_all_requests` (EleutherAI#2011) * refactored `lm.apply_chat_template` * nit * fix weird type error * fixed! * skip failing test * pre-commit run all * add type hints * nit * nit * fixup * Added CommonsenseQA task (EleutherAI#1721) * Initial configuration * Using the validation set for the test set, because the test set on HF doesn't have labels * Probably just makes more sense to have validation be validation * fix format ; add docs to tasks/README.md * fix format --------- Co-authored-by: haileyschoelkopf <[email protected]> * Factor out LM-specific tests (EleutherAI#1859) * separate out optimum/neuralmagic tests to separate job * fix vllm tests * fix bug in --trust_remote_code * use datasets.config instead intentionally * fix remote code issue? * Update interface.md (EleutherAI#1982) * Update interface.md update interface to remove link to really outdated commit of evaluator.py * switch to relative referencing? * Update interface.md --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * Fix `trust_remote_code`-related test failures (EleutherAI#2024) * make MMLU trust remote code to fix tests * remove trust remote code * Fixes scrolls task bug with few_shot examples (EleutherAI#2003) Bug: ``` python -m scripts.write_out --task scrolls_quality --output_base_path ~/workspace/ Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/lm-evaluation-harness/scripts/write_out.py", line 92, in <module> main() File "/lm-evaluation-harness/scripts/write_out.py", line 51, in main task_dict = tasks.get_task_dict(task_names, task_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 423, in get_task_dict task_name_from_string_dict = task_manager.load_task_or_group( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 271, in load_task_or_group collections.ChainMap(*map(self._load_individual_task_or_group, task_list)) File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 162, in _load_individual_task_or_group return load_task(task_config, task=name_or_config, group=parent_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 148, in load_task task_object = config["class"]() ^^^^^^^^^^^^^^^^^ File "/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 120, in __init__ super().__init__() File "/lm-evaluation-harness/lm_eval/api/task.py", line 703, in __init__ self._config = TaskConfig(**config) ^^^^^^^^^^^^^^^^^^^^ TypeError: lm_eval.api.task.TaskConfig() argument after ** must be a mapping, not NoneType ``` * fix cache (EleutherAI#2037) * Add chat template to `vllm` (EleutherAI#2034) * add chat template * refactor token padding * nit * nit * check on failing test * check transformers version * remove transformers pin * add ids to test * nit * fixup * fix bos bug * nit * fixup! fix bos bug * increase tolerance for table test * don't detokenize vllm logprobs * Update lm_eval/models/utils.py Co-authored-by: Hailey Schoelkopf <[email protected]> * pre-commit run --all-files --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * fail gracefully upon tokenizer logging failure (EleutherAI#2038) * ship with exact_match function already used ; don't call evaluate.load() on import (EleutherAI#2045) * update to v0.4.3 (EleutherAI#2046) * fix wandb logger module import in example (EleutherAI#2041) * Fix strip whitespace filter (EleutherAI#2048) * batch commit * :Revert "batch commit" This reverts commit d859d1c. * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup * update gemma-2 default BOS behavior (EleutherAI#2049) * Update hellaswag.yaml (EleutherAI#2029) * Adds Open LLM Leaderboard Taks (EleutherAI#2047) * adds leaderboard tasks * Delete lm_eval/tasks/leaderboard/leaderboard_chat_template.yaml * add readme * Delete lm_eval/tasks/leaderboard/mmlu_pro/mmlu_pro_chat_template.yaml * modify readme * fix bbh task * fix bbh salient task * modify the readme * Delete lm_eval/tasks/leaderboard/ifeval/README.md * Delete lm_eval/tasks/leaderboard/math/README.md * add leaderboard to the tasks repertory * add anouncment about new leaderbaord tasks * linting * Update README.md Co-authored-by: Hailey Schoelkopf <[email protected]> * installs ifeval dependency in new_task github workflow --------- Co-authored-by: Nathan Habib <[email protected]> Co-authored-by: Hailey Schoelkopf <[email protected]> * EleutherAI#1442 inverse scaling tasks implementation (EleutherAI#1589) * initial_implementation (test has to be proceeded) * minor fix * revised task name and implemented new task * minor fixes * new tasks implement * minor fix * added 'prompt injection' task * delete prompt injection task (will be implemented at next PR) * trust remote code * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <[email protected]> * added readme * Update lm_eval/tasks/README.md * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml * Update lm_eval/tasks/inverse_scaling/README.md Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Co-authored-by: Hailey Schoelkopf <[email protected]> * Update README.md * precommit? * run precommit on readme --------- Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> * Fix TypeError in samplers.py by converting int to str (EleutherAI#2074) Co-authored-by: yhjo <[email protected]> * Group agg rework (EleutherAI#1741) * add greoup_config arg * add a group config that allows disabling table for group score and group aggregate in general * fixed size configuration * adjust config * add group config * adjust mmlu to use group_config * fixed args input in aggregate_subtask_metrics * fixed issues related to printing alias of group and updated yaml * update all mmlu variants to include group_config * edit format * modify mmlu tasks * adjust group to also be a configurable group * add configurable group * simplify get_task_list * adjust group scoring with using ConfigurableGroup * adjust args * update mmlu * update mmlu * update to work with new group and task configuration * readd group_agg * readd files * move prepare_print_tasks to evaluator_utils * sort set to False by default, fix predict_only arg * add version for groups * reversed task list * update additional condition when loading a group in a group yaml * update truthfulqa * add description regarding tags replacing group * replace group to tag * fixed conditional statement * remove warning * update loading of task group and newly added tags * reformat with pre-commit * fixed info log * update * fix bug * fix bug * use task id to differentiate tasks * convert all groups to configurable groups * use task_id * reformat * add task_id for python tasks as well * add task_id for python tasks as well * add task_id for python tasks as well * revert truthfulqa * revert mmlu tasks * new mmlu config * new group config parameter `tag_to_task` * Update truthfulqa_mc2.yaml * reformate * add _process_group_config * adjust task_id * add get_subtask_list function to get proper subtask list * group config to_dict update * remove tag check * update mmlu * fix config passing issues * add test yaml * format fix * add documentation * corner case for single tag being called * fix indentation * formatting * update all mmlu variants * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <[email protected]> * remove group_alias * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <[email protected]> * remove version for metadata * Update docs/task_guide.md Co-authored-by: Hailey Schoelkopf <[email protected]> * update mmlu/ * removed " " in make_table * change how aggregate_metric is loaded * change how aggregate_metric is loaded * update aggregate_metric arg * update format * update format * some docs fixes * add groups for agieval, aexams, aclue * add more explicit aggregation groups * add more groupings / tags distinctions * add more groupings * more groupings * add many explicit group configs * add many explicit group configs * add more explicit group configs * add more explicit group configs * add more error msgs, agg_metric -> agg_metric_list * some docs updates * update task_id to be updateable and uses group:task format * make KMMLU a tag for now * update docs * don't duplicate task names * fix merge conflicts? * giving this a try * clean up diff * switch mmlu variants over to using * don't use to-be-deprecated group: config field in overview notebook * Python tasks which subclass ConfigurableTask now run * update mmlu * pre-commit format * fixed sorting for multi-level printing * move group api to separate file * fix bbh aggregation filter usage * track api/group.py * adjust group and tags loading * make explicit group configs for leaderboard and other newer tasks * fix arabicmmlu * update * change arabicmmlu template name??? * update group alias * fix printing bugs * check table printing is correct ; update tests * use mmlu_stem to have a group included in print tests --------- Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> * we run with bootstrap_iters=0 for printing tests (EleutherAI#2080) * Easier unitxt tasks loading and removal of unitxt library dependancy (EleutherAI#1933) * Updated unitxt loading Signed-off-by: Elron Bandel <[email protected]> * Revert change to general Readme Signed-off-by: Elron Bandel <[email protected]> * Adjust fda,squadv2,squad_completion and swde to work accept config in the constructor Signed-off-by: Elron Bandel <[email protected]> * Fix scrolls Signed-off-by: elronbandel <[email protected]> * Update documentation Signed-off-by: elronbandel <[email protected]> * Enforce backward compatability Signed-off-by: elronbandel <[email protected]> * Format unitxt class Signed-off-by: elronbandel <[email protected]> --------- Signed-off-by: Elron Bandel <[email protected]> Signed-off-by: elronbandel <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> * Allow gating EvaluationTracker HF Hub results; customizability (EleutherAI#2051) * batch commit * :Revert "batch commit" This reverts commit d859d1c. * batch commit * checkout from main * checkout from main * checkout from main * checkout from main * checkout from main * cleanup * cleanup * cleanup * cleanup * cleanup * cleanup eval results * cleanup * add check for gated repo * fix jsonline issue * fix * add try catch when gating the details repo * add doc * adds back hub_repo_name * readds hub repo name * Minor doc fix: leaderboard README.md missing mmlu-pro group and task (EleutherAI#2075) leaderboard README.md missing mmlu-pro group and task * fix: utf-8 encoding for logged sample files was missing (EleutherAI#2082) * Update utils.py (EleutherAI#2085) Group Configs with no aggregation will print a empty space as the score for result table. Example ``` | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------------|-------|------|-----:|--------|---|-----:|---|-----:| |group | N/A| | | | | | | | | - task 0 |Yaml |none | 0|acc |↑ |0.4000|± |0.0910| | - task 1 |Yaml |none | 0|acc |↑ |0.3333|± |0.0875| | - task 2 |Yaml |none | 0|acc |↑ |0.2667|± |0.0821| | - task 3 |Yaml |none | 0|acc |↑ |0.3333|± |0.0875| ``` So the `v` variable in the `make_table` needs to check if the value is a float or a string. * batch_size may be str if 'auto' is specified (EleutherAI#2084) * Prettify lm_eval --tasks list (EleutherAI#1929) * add and ; move task list newline logic to new TaskManager.list_all_tasks() method * format table list into markdown table; add config location column * add Output Type column * add logic for printing table of tags separately * merge with main and fix conflicts ; update docstrings --------- Co-authored-by: haileyschoelkopf <[email protected]> * make RougeScorer only initialized once (EleutherAI#2090) * Update default.yaml (EleutherAI#2092) * Add new dataset MMLU-SR tasks (EleutherAI#2032) * add mmlusr tasks * renamed all tasks names in mmlusr * edit format and readme * added mmlu_sr * mmlu_sr -> mmlusr * update --------- Co-authored-by: lintangsutawika <[email protected]> * Irokobench: Benchmark Dataset for African languages (EleutherAI#2042) * add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by: Israel Abebe Azime <[email protected]> Co-authored-by: Israel Abebe Azime <[email protected]> * docs: remove trailing sentence from contribution doc (EleutherAI#2098) Signed-off-by: Nathan Weinberg <[email protected]> * Added MedConceptsQA Benchmark (EleutherAI#2010) * Added MedConceptsQA Benchmark * pre-commit factor * update group name * update in naming * changed name * Changed mcqa to med_concepts_qa prefix * Added med_concepts_qa to README.md * Changed config files according the new format * Updated README --------- Co-authored-by: lintangsutawika <[email protected]> * make recurrent_gemma model types included in the force-BOS case (EleutherAI#2105) * formatting (EleutherAI#2104) * docs: align local test command to match CI (EleutherAI#2100) Also add 'test_logs/' to .gitignore Signed-off-by: Nathan Weinberg <[email protected]> * Fixed colon in Belebele _default_template_yaml (EleutherAI#2111) * [python] fix haerae tasks (EleutherAI#2112) * fix: broken discord link in CONTRIBUTING.md (EleutherAI#2114) Signed-off-by: Nathan Weinberg <[email protected]> * docs: update truthfulqa tasks (EleutherAI#2119) * fix caching module (hotfix for now) (EleutherAI#2124) * Refactor API models (EleutherAI#2008) * refactor pad_token handling to fn * fix docs * add pad_token_handling to vllm * start on API superclass * don't detokenize the returned logits * streamline vllm tokenizer * add type hint * pre-commit * seems to be in working order * add model to init * refactor api models * nit * cleanup * add pbar * fix type hints * change optional dependencies * json encode chat template * add type hints * deal with different prompt input requiremnts * nits * fix * cache inside async * fix * fix * nits * nits * nits * nit * fixup * fixup * nit * add dummy retry * add dummy retry * handle imports; skip failing test * add type hint * add tests * add dependency to tests * add package names to exception * nit * docs; type hints * handle api key * nit * tokenizer bug * fix tokenizer * nit * nit * add better error messages * nit * remove decorator * CI: install api dep * revert evaluator.py * consolidate * consolidate * nits * nit * fix typealias * nit * nit * nit * Update lm_eval/models/api_models.py typo Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/models/openai_completions.py Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/models/anthropic_llms.py Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/models/api_models.py Co-authored-by: Hailey Schoelkopf <[email protected]> * fix typo * add news section * add info for API * pre-commit * typo * fix bug: unpack logliklehood requests * fix bug: shared gen_kwargs mutated * nit: handle copy properly * Update README.md * Update README.md * Update README.md * Update api_models.py * Update README.md --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * bugfix and docs for API (EleutherAI#2139) * encoding bugfix * encoding bugfix * overload logliklehood rather than loglikehood_tokens * add custom tokenizer * add docs * Update API_guide.md fix link; add note * Update API_guide.md typo * pre-commit * add link in readme * nit * nit * nit * Update API_guide.md nits * Update API_guide.md * Update API_guide.md * Update API_guide.md * Update API_guide.md * Update README.md * Update docs/API_guide.md * Update docs/API_guide.md * Update API_guide.md --------- Co-authored-by: Hailey Schoelkopf <[email protected]> * [Bugfix] add temperature=0 to logprobs and seed args to API models (EleutherAI#2149) * add temperature for log probs * add seed * nit * add new args to test * added warning for api chat models * refactor: limit usage of `scipy` and `skilearn` dependencies (EleutherAI#2097) * refactor: move scipy and sklearn module imports to func imports Signed-off-by: Nathan Weinberg <[email protected]> * refactor: consolidate weighted_f1_score func into lm_eval utils Signed-off-by: Nathan Weinberg <[email protected]> * lint: allow for utils file to have unused imports this allows for shared functions to be defined only once while allowing for the YAML function importing to continue working Signed-off-by: Nathan Weinberg <[email protected]> --------- Signed-off-by: Nathan Weinberg <[email protected]> --------- Signed-off-by: changwangss <[email protected]> Signed-off-by: Elron Bandel <[email protected]> Signed-off-by: elronbandel <[email protected]> Signed-off-by: Nathan Weinberg <[email protected]> Co-authored-by: Nick Doiron <[email protected]> Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: zhabuye <[email protected]> Co-authored-by: Edward Gan <[email protected]> Co-authored-by: DongGeon Lee <[email protected]> Co-authored-by: Huazhong Ji <[email protected]> Co-authored-by: Lintang Sutawika <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: jiaqiw09 <[email protected]> Co-authored-by: zhabuye <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]> Co-authored-by: KonradSzafer <[email protected]> Co-authored-by: Nathan Habib <[email protected]> Co-authored-by: Alina Lozovskaia <[email protected]> Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: LSinev <[email protected]> Co-authored-by: anthony-dipofi <[email protected]> Co-authored-by: Harish Vadaparty <[email protected]> Co-authored-by: Maxime <[email protected]> Co-authored-by: MorishT <[email protected]> Co-authored-by: Iker García-Ferrero <[email protected]> Co-authored-by: khalil <[email protected]> Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Sadra Barikbin <[email protected]> Co-authored-by: Nikita Lozhnikov <[email protected]> Co-authored-by: Baber Abbasi <[email protected]> Co-authored-by: johnwee1 <[email protected]> Co-authored-by: Wang, Chang <[email protected]> Co-authored-by: Yazeed Alnumay <[email protected]> Co-authored-by: Julen Etxaniz <[email protected]> Co-authored-by: achervyakov <[email protected]> Co-authored-by: Stella Biderman <[email protected]> Co-authored-by: jonabur <[email protected]> Co-authored-by: Brendan Murphy <[email protected]> Co-authored-by: Steven Basart <[email protected]> Co-authored-by: Ogundepo Odunayo <[email protected]> Co-authored-by: Nathan Habib <[email protected]> Co-authored-by: Hanwool Albert Lee <[email protected]> Co-authored-by: Choyunhui <[email protected]> Co-authored-by: yhjo <[email protected]> Co-authored-by: Elron Bandel <[email protected]> Co-authored-by: Pankaj Mathur <[email protected]> Co-authored-by: meg <[email protected]> Co-authored-by: Wonung Kim <[email protected]> Co-authored-by: SuperCat <[email protected]> Co-authored-by: Jess <[email protected]> Co-authored-by: Israel Abebe Azime <[email protected]> Co-authored-by: Israel Abebe Azime <[email protected]> Co-authored-by: Nathan Weinberg <[email protected]> Co-authored-by: Ben Shoham Ofir <[email protected]> Co-authored-by: jab13x <[email protected]> Co-authored-by: Jungwhan Kim <[email protected]> Co-authored-by: Jennifer Cwagenberg <[email protected]>

h-albert-lee added 4 commits March 9, 2024 11:23

initial_implementation (test has to be proceeded)

5d07f4e

minor fix

10c9d5d

Merge remote-tracking branch 'origin/main' into 1442-inverse-scaling-…

3101984

…tasks

revised task name and implemented new task

4b31ee8

h-albert-lee requested review from haileyschoelkopf and lintangsutawika as code owners March 16, 2024 03:04

h-albert-lee linked an issue Mar 16, 2024 that may be closed by this pull request

Inverse Scaling Tasks? #1442

Closed

minor fixes

4b2d565

h-albert-lee added 2 commits June 12, 2024 21:09

Merge remote-tracking branch 'origin/main' into 1442-inverse-scaling-…

a6bac8c

…tasks

new tasks implement

3040bfc

minor fix

a9d6ef9

h-albert-lee added 3 commits June 26, 2024 22:19

Merge remote-tracking branch 'origin/main' into 1442-inverse-scaling-…

a35a23e

…tasks

delete prompt injection task (will be implemented at next PR)

3497890

trust remote code

a4a5057

haileyschoelkopf reviewed Jun 26, 2024

View reviewed changes

lm_eval/tasks/inverse_scaling/README.md Outdated Show resolved Hide resolved

h-albert-lee and others added 3 commits June 29, 2024 16:02

Update lm_eval/tasks/inverse_scaling/README.md

2bb77ab

Co-authored-by: Hailey Schoelkopf <[email protected]>

Merge remote-tracking branch 'origin/main' into 1442-inverse-scaling-…

1b0a180

…tasks

added readme

e1fe408

haileyschoelkopf approved these changes Jul 1, 2024

View reviewed changes

lm_eval/tasks/README.md Outdated Show resolved Hide resolved

Update lm_eval/tasks/README.md

72bf5d6

haileyschoelkopf reviewed Jul 1, 2024

View reviewed changes

lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Outdated Show resolved Hide resolved

Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

dae57e2

haileyschoelkopf requested changes Jul 1, 2024

View reviewed changes

lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml Show resolved Hide resolved

lm_eval/tasks/inverse_scaling/README.md Outdated Show resolved Hide resolved

h-albert-lee and others added 2 commits July 3, 2024 19:41

Update lm_eval/tasks/inverse_scaling/README.md

84032ce

Co-authored-by: Hailey Schoelkopf <[email protected]>

Update lm_eval/tasks/inverse_scaling/_inverse_scaling_mc_yaml

16dcbd3

Co-authored-by: Hailey Schoelkopf <[email protected]>

haileyschoelkopf and others added 3 commits July 3, 2024 08:08

Update README.md

57191ac

precommit?

ee8006e

run precommit on readme

bcff77b

haileyschoelkopf approved these changes Jul 3, 2024

View reviewed changes

haileyschoelkopf merged commit d855d0b into main Jul 3, 2024
9 checks passed

haileyschoelkopf deleted the 1442-inverse-scaling-tasks branch July 3, 2024 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1442 inverse scaling tasks implementation #1589

#1442 inverse scaling tasks implementation #1589

h-albert-lee commented Mar 16, 2024

lintangsutawika commented Mar 16, 2024

h-albert-lee commented Mar 17, 2024

lintangsutawika commented Mar 17, 2024 •

edited

Loading

h-albert-lee commented Mar 18, 2024

h-albert-lee commented Mar 18, 2024

haileyschoelkopf commented Mar 18, 2024 •

edited by StellaAthena

Loading

h-albert-lee commented Mar 19, 2024

h-albert-lee commented Mar 31, 2024

haileyschoelkopf commented May 29, 2024

h-albert-lee commented Jun 10, 2024

haileyschoelkopf commented Jun 11, 2024

h-albert-lee commented Jun 11, 2024

haileyschoelkopf commented Jun 11, 2024

h-albert-lee commented Jun 11, 2024

h-albert-lee commented Jun 12, 2024

h-albert-lee commented Jun 12, 2024

lintangsutawika commented Jun 12, 2024

h-albert-lee commented Jun 12, 2024

haileyschoelkopf commented Jun 12, 2024

h-albert-lee commented Jun 26, 2024

haileyschoelkopf commented Jun 26, 2024

h-albert-lee commented Jun 26, 2024

haileyschoelkopf commented Jun 26, 2024

h-albert-lee commented Jun 29, 2024

haileyschoelkopf left a comment

haileyschoelkopf left a comment

h-albert-lee commented Jul 3, 2024

haileyschoelkopf commented Jul 3, 2024

h-albert-lee commented Jul 3, 2024

#1442 inverse scaling tasks implementation #1589

#1442 inverse scaling tasks implementation #1589

Conversation

h-albert-lee commented Mar 16, 2024

Compare table using opt-1.3b

Compare table using opt-2.7b

lintangsutawika commented Mar 16, 2024

h-albert-lee commented Mar 17, 2024

lintangsutawika commented Mar 17, 2024 • edited Loading

h-albert-lee commented Mar 18, 2024

h-albert-lee commented Mar 18, 2024

haileyschoelkopf commented Mar 18, 2024 • edited by StellaAthena Loading

h-albert-lee commented Mar 19, 2024

h-albert-lee commented Mar 31, 2024

haileyschoelkopf commented May 29, 2024

h-albert-lee commented Jun 10, 2024

haileyschoelkopf commented Jun 11, 2024

h-albert-lee commented Jun 11, 2024

haileyschoelkopf commented Jun 11, 2024

h-albert-lee commented Jun 11, 2024

h-albert-lee commented Jun 12, 2024

h-albert-lee commented Jun 12, 2024

lintangsutawika commented Jun 12, 2024

h-albert-lee commented Jun 12, 2024

haileyschoelkopf commented Jun 12, 2024

h-albert-lee commented Jun 26, 2024

haileyschoelkopf commented Jun 26, 2024

h-albert-lee commented Jun 26, 2024

haileyschoelkopf commented Jun 26, 2024

h-albert-lee commented Jun 29, 2024

haileyschoelkopf left a comment

Choose a reason for hiding this comment

haileyschoelkopf left a comment

Choose a reason for hiding this comment

h-albert-lee commented Jul 3, 2024

haileyschoelkopf commented Jul 3, 2024

h-albert-lee commented Jul 3, 2024

lintangsutawika commented Mar 17, 2024 •

edited

Loading

haileyschoelkopf commented Mar 18, 2024 •

edited by StellaAthena

Loading