Model evaluation is a technical field of AI safety and ML research focused on benchmarking and evaluating models. Sub-fields include Benchmarking, Dangerous Capability Evaluations, and Demonstrations. Related research topics include the governance of AI, alignment research, and scenario planning for AI.
This is a short overview of introductory material available in the field of model evaluations and will work as a guide for you to engage with the field.
This is a work in progress and we invite interested parties to submit pull requests for new materials
Visit to the democracy x AI hackathon where material for demonstrating and extrapolating risks to society from AI is the main topic.
Project examples include developing an LLM to contain a sleeper agent that activates on election day, training agents to skew poll results to inflate support for particular policies, or using an LLM to draft uncontroversial legislative proposals that, if implemented, indirectly impacts more contentious or harmful policies.
These are ideas to get you started and you can check out the results from previous hackathons to see examples of the types of projects you can develop during just one weekend, e.g. EscalAItion found that LLMs had a propensity to escalate in military scenarios and was accepted at the multi-agent security workshop at NeurIPS 2023 after further development.
Here is some interesting material to get inspiration for the hackathon:
- CivAI Demonstrations for Policy Makers
- A paper presenting a framework for AI and Democracy
- Article by Yoshua Bengio in the Journal of Democracy
- Why it Matters podcast on The Year of AI and Elections
- Guardian article about the influence of AI on the US elections
- 80000 hours podcast with Nina Schick on disinformation and the rise of synthetic media
- Rest of World tracks key global incidents of AI-generated election content
- Weapons of Mass Destruction Proxy (WMDP) Benchmark
- Spending $200 to make LLMs unsafe
- An experimental demonstration of strategic deception in LLMs
Loading Open-source models in Google Colab
See this Colab notebook to use the Transformerlens model downloader utility to easily load language models in Colab. It also has all the available models there from EleutherAI, OpenAI, Facebook AI Research, Neel Nanda and more. Alternatively, you can use this notebook to use the Replicate API to run the latest open-source language models, such as LLama-3.
You can also use the huggingface Transformers library directly like this.
Extrapolating a trend into the future
Simple notebook that uses a polynomial regression to make a plot of an extrapolation of some data into the future.
Notebooks by Hugging Face
These notebooks all pertain to the usage of transformers and shows how to use their library. See them all here. Some notable notebooks include:
Finetuning language model with OpenAI or Replicate
This notebook provides an overview of how you can finetune LLMs, such as Replicate usage and more. Go to the notebook here.
Cloning your voice
This notebook allows you to clone your own voice.
Language Model Evaluation Harness
The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.
See a Colab notebook shortly introducing how to use it here.
Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.
METR's task suite tests for the ability of LLMs to complete tasks relevant to understanding autonomous and ML R&D AI capabilities. At the moment, they all require a level of cyber capability since they are computer-based tasks.
SWEBench is the latest academic benchmark for LLM performance on difficult programming tasks based on about 2,000 Github commits across 12 Python repositories (hf). It seems the most widely used in recent months. For 11 previous benchmarks, we can see an overview from the CodeAgent paper's review.
OpenAI's preparedness framework describes four levels of risk corresponding to 1) weak cyber assistance, 2) expert replacement, 3) development of MVP high-value exploits, and 4) devising and executing end-to-end novel attacks on hardened targets (OpenAI, 2023)
The responsible scaling policy (Anthropic, 2023) defines which safeguards to put in place for various sizes of models with a focus on catastrophic risk and containment strategies at capability levels measured by a task-based evaluation (METR, 2024).
Google DeepMind (2024) defines levels of AGI competence in narrow and general tasks according to levels competence with the highest level 5 defined as outperforming 100% of humans. Additionally, they define levels of autonomy, with the highest (level 5) being AI as an independent agent, as opposed to AI as an expert (level 4).
The weapons of mass destruction proxy (WMDP) benchmark (Li et al., 2024) measures cyber risk according to accuracy on questions that proxy the potential for misuse assistance along four stages of a cyberattack: 1) Reconnaissance, 2) weaponization, 3) exploitation, and 4) post-exploitation.
From academia, Giudici et al. (2024) define a risk management framework for errors introduced by AI. Outside labs, Kokotajlo (2022) describes overpowering as an important metric. OpenAI's head of superalignment, Leike, describes (2023) self-exfiltration as an important capability to mitigate due to the control implications.
Apollo Research argues that if AI model evaluations want to have meaningful real-world impact, we need a “Science of Evals" (Apollo Research, 2024). They provide a small overview of current work in the direction of science of evals:
- Many different papers such as Liang et al., 2022; Mizrahi et al., 2023; Scalar et al., 2023 find that different phrasings of the same question can lead to very different results thus suggesting to always evaluate LMs on a set of diverse prompts.
- Multiple papers investigate how different ways of structuring an evaluation, e.g. as multiple choice or generative evaluation, can lead to substantially different results, e.g. Robinson et al., 2022; Wang et al., 2023; Savelka et al., 2023, Khatun et al, 2024. Since model evaluations often try to make statements about the maximal capability of a model, it’s important to be aware of how a question is structured (e.g. discriminative vs. generative evaluation) and worded.
- Several papers investigated the relationship between fine-tuning and prompting which has important implications for capability elicitation, e.g. C. Wang et al., 2022; Liu et al., 2022, and Lin et al. 2023.
- “With Little Power Comes Great Responsibility” (Card et al., 2023) investigates the statistical significance of typical ML experiments. This is a good example of how more rigorous hypothesis testing could be applied to evaluations.
- “Are emergent capabilities a mirage?” (Schaeffer et al., 2023) argue that previously reported emergent capabilities of LMs (Wei et al., 2022, Srivastava et al. 2022) primarily depend on the metric used, e.g. accuracy vs. log-likelihood. While these flaws had already been recognized by Wei et al., 2022 and Srivastava et al. 2022, it is very valuable to rigorously understand how the choice of metric can influence the perceived capabilities.
- “True few-shot learning with Language Models” (Perez et al., 2021) argues that common few-shot techniques at the time would bias the results and thus overestimate the true abilities of LMs. Concretely, many evaluations would select few-shot examples based on a held-out validation set, instead of randomly sampling them. This emphasizes the importance of adequately designing the evals, e.g. not accidentally leaking information from the test set.
- “Elo Uncovered: Robustness and Best Practices in Language Model Evaluation” (Boubdir et al., 2023) investigates whether the commonly used ELO ranking to compare LMs (Zheng et al. 2023) fulfills two core desiderata, reliability and transitivity, in practice. Thus it is a good example of empirically validating evals methodology.
- Model evaluation survey papers like Chang et al., 2023 summarize the state of the field, discuss trends and examples, and explicitly call for model evaluations as an explicit discipline. Zhang et al. 2023 and Ivanova 2023's "Running cognitive evaluations on large language models: The do's and the don'ts" are initial work in meta-evaluating model evaluations as a field and proposing concrete recommendations.
-
Language Model Evaluation Harness aims to provide a unified framwork to test generative language models on evaluation tasks. It features 200+ tasks and support for both open source and commercial model APIs.
-
Holistic Evaluation of Language Models is a framework to evaluate generative language models on a collection of scenarios. Models can be accessed via a unified interface.