Skip to content
/ evals Public
forked from openai/evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

License

Notifications You must be signed in to change notification settings

laiso/evals

 
 

Repository files navigation

Evals

Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.

We now support evaluating the behavior of any system including prompt chains or tool-using agents, via the Completion Function Protocol.

With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior. To get started, we recommend that you follow these steps:

To get set up with evals, follow the setup instructions below.

Running evals

Writing evals

Writing CompletionFns

If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.


🚨 For a limited time, we will be granting GPT-4 access to those who contribute high quality evals. Please follow the instructions mentioned above and note that spam or low quality submissions will be ignored❗️

Access will be granted to the email address associated with an accepted Eval. Due to high volume, we are unable to grant access to any email other than the one used for the pull request.


Setup

To run evals, you will need to set up and specify your OpenAI API key. You can generate one at https://platform.openai.com/account/api-keys. After you obtain an API key, specify it using the OPENAI_API_KEY environment variable. Please be aware of the costs</