AI Model Evaluations Quickstart Guide

Model evaluation is a technical field of AI safety and ML research focused on benchmarking and evaluating models. Sub-fields include Benchmarking, Dangerous Capability Evaluations, and Demonstrations. Related research topics include the governance of AI, alignment research, and scenario planning for AI.

This is a short overview of introductory material available in the field of model evaluations and will work as a guide for you to engage with the field.

This is a work in progress and we invite interested parties to submit pull requests for new materials

Demonstrations research

Demonstrating risks to institutional fragility from AI

Visit to the democracy x AI hackathon where material for demonstrating and extrapolating risks to society from AI is the main topic.

Project examples include developing an LLM to contain a sleeper agent that activates on election day, training agents to skew poll results to inflate support for particular policies, or using an LLM to draft uncontroversial legislative proposals that, if implemented, indirectly impacts more contentious or harmful policies.

These are ideas to get you started and you can check out the results from previous hackathons to see examples of the types of projects you can develop during just one weekend, e.g. EscalAItion found that LLMs had a propensity to escalate in military scenarios and was accepted at the multi-agent security workshop at NeurIPS 2023 after further development.

Inspiration

Here is some interesting material to get inspiration for the hackathon:

Some starter code and Colabs for the hackathon

Loading Open-source models in Google Colab

See this Colab notebook to use the Transformerlens model downloader utility to easily load language models in Colab. It also has all the available models there from EleutherAI, OpenAI, Facebook AI Research, Neel Nanda and more. Alternatively, you can use this notebook to use the Replicate API to run the latest open-source language models, such as LLama-3.

You can also use the huggingface Transformers library directly like this.

Extrapolating a trend into the future

Simple notebook that uses a polynomial regression to make a plot of an extrapolation of some data into the future.

Notebooks by Hugging Face

These notebooks all pertain to the usage of transformers and shows how to use their library. See them all here. Some notable notebooks include:

Finetuning language model with OpenAI or Replicate

This notebook provides an overview of how you can finetune LLMs, such as Replicate usage and more. Go to the notebook here.

Cloning your voice

This notebook allows you to clone your own voice.

Language Model Evaluation Harness

The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.

See a Colab notebook shortly introducing how to use it here.

Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.

Cyber capabilities evaluation

Existing work

METR's task suite tests for the ability of LLMs to complete tasks relevant to understanding autonomous and ML R&D AI capabilities. At the moment, they all require a level of cyber capability since they are computer-based tasks.

SWEBench is the latest academic benchmark for LLM performance on difficult programming tasks based on about 2,000 Github commits across 12 Python repositories (hf). It seems the most widely used in recent months. For 11 previous benchmarks, we can see an overview from the CodeAgent paper's review.

Scenarios for AI cyber risk

OpenAI's preparedness framework describes four levels of risk corresponding to 1) weak cyber assistance, 2) expert replacement, 3) development of MVP high-value exploits, and 4) devising and executing end-to-end novel attacks on hardened targets (OpenAI, 2023)

The responsible scaling policy (Anthropic, 2023) defines which safeguards to put in place for various sizes of models with a focus on catastrophic risk and containment strategies at capability levels measured by a task-based evaluation (METR, 2024).

Google DeepMind (2024) defines levels of AGI competence in narrow and general tasks according to levels competence with the highest level 5 defined as outperforming 100% of humans. Additionally, they define levels of autonomy, with the highest (level 5) being AI as an independent agent, as opposed to AI as an expert (level 4).

The weapons of mass destruction proxy (WMDP) benchmark (Li et al., 2024) measures cyber risk according to accuracy on questions that proxy the potential for misuse assistance along four stages of a cyberattack: 1) Reconnaissance, 2) weaponization, 3) exploitation, and 4) post-exploitation.

From academia, Giudici et al. (2024) define a risk management framework for errors introduced by AI. Outside labs, Kokotajlo (2022) describes overpowering as an important metric. OpenAI's head of superalignment, Leike, describes (2023) self-exfiltration as an important capability to mitigate due to the control implications.

Science of Evaluations

Apollo Research argues that if AI model evaluations want to have meaningful real-world impact, we need a “Science of Evals" (Apollo Research, 2024). They provide a small overview of current work in the direction of science of evals:

Many different papers such as Liang et al., 2022; Mizrahi et al., 2023; Scalar et al., 2023 find that different phrasings of the same question can lead to very different results thus suggesting to always evaluate LMs on a set of diverse prompts.
Multiple papers investigate how different ways of structuring an evaluation, e.g. as multiple choice or generative evaluation, can lead to substantially different results, e.g. Robinson et al., 2022; Wang et al., 2023; Savelka et al., 2023, Khatun et al, 2024. Since model evaluations often try to make statements about the maximal capability of a model, it’s important to be aware of how a question is structured (e.g. discriminative vs. generative evaluation) and worded.
Several papers investigated the relationship between fine-tuning and prompting which has important implications for capability elicitation, e.g. C. Wang et al., 2022; Liu et al., 2022, and Lin et al. 2023.
“With Little Power Comes Great Responsibility” (Card et al., 2023) investigates the statistical significance of typical ML experiments. This is a good example of how more rigorous hypothesis testing could be applied to evaluations.
“Are emergent capabilities a mirage?” (Schaeffer et al., 2023) argue that previously reported emergent capabilities of LMs (Wei et al., 2022, Srivastava et al. 2022) primarily depend on the metric used, e.g. accuracy vs. log-likelihood. While these flaws had already been recognized by Wei et al., 2022 and Srivastava et al. 2022, it is very valuable to rigorously understand how the choice of metric can influence the perceived capabilities.
“True few-shot learning with Language Models” (Perez et al., 2021) argues that common few-shot techniques at the time would bias the results and thus overestimate the true abilities of LMs. Concretely, many evaluations would select few-shot examples based on a held-out validation set, instead of randomly sampling them. This emphasizes the importance of adequately designing the evals, e.g. not accidentally leaking information from the test set.
“Elo Uncovered: Robustness and Best Practices in Language Model Evaluation” (Boubdir et al., 2023) investigates whether the commonly used ELO ranking to compare LMs (Zheng et al. 2023) fulfills two core desiderata, reliability and transitivity, in practice. Thus it is a good example of empirically validating evals methodology.
Model evaluation survey papers like Chang et al., 2023 summarize the state of the field, discuss trends and examples, and explicitly call for model evaluations as an explicit discipline. Zhang et al. 2023 and Ivanova 2023's "Running cognitive evaluations on large language models: The do's and the don'ts" are initial work in meta-evaluating model evaluations as a field and proposing concrete recommendations.

Evaluation Frameworks

Language Model Evaluation Harness aims to provide a unified framwork to test generative language models on evaluation tasks. It features 200+ tasks and support for both open source and commercial model APIs.
Holistic Evaluation of Language Models is a framework to evaluate generative language models on a collection of scenarios. Models can be accessed via a unified interface.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Model Evaluations Quickstart Guide

Demonstrations research

Demonstrating risks to institutional fragility from AI

Inspiration

Some starter code and Colabs for the hackathon

Cyber capabilities evaluation

Existing work

Scenarios for AI cyber risk

Science of Evaluations

Evaluation Frameworks

About

Releases

Packages

Contributors 4

License

apartresearch/evaluations-starter

Folders and files

Latest commit

History

Repository files navigation

AI Model Evaluations Quickstart Guide

Demonstrations research

Demonstrating risks to institutional fragility from AI

Inspiration

Some starter code and Colabs for the hackathon

Cyber capabilities evaluation

Existing work

Scenarios for AI cyber risk

Science of Evaluations

Evaluation Frameworks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages