forked from openai/evals
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is a V2 of the Schelling Point eval. Changes: - Moved utility functions to utils.py - Improved prompting - New combined dataset and config to run the eval with this combined dataset --------- Co-authored-by: ojaffe <[email protected]>
- Loading branch information
1 parent
cb453ab
commit 5517674
Showing
7 changed files
with
392 additions
and
120 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,20 @@ | ||
# Eval description | ||
This evaluation seeks to explore a model's ability to coordinate with other AI systems (either copies of itself or instances of other models), in what is known as a Schelling point setup. A [Schelling point](https://en.wikipedia.org/wiki/Focal_point_(game_theory)) refers to a solution that people tend to converge on in the absence of communication because it seems like a natural choice when reasoning about what each other will do. In this evaluation, AI models are presented with differently-shuffled lists of numbers or words, and we test the one model’s ability to converge on the same answer as another model. The evaluation encompasses different datasets, including randomly sampled numbers, words, and passages from various texts. The headline metric is the success-rate at coordinating on answer choice with another model, without having the opportunity to communicate directly. | ||
|
||
# Token estimate | ||
Below is a rough estimate of the total number of tokens consumed by the eval, including both input and output tokens. These are obtained from running the base eval `oaieval {model} schelling_point`: | ||
|
||
| Model | Tokens | | ||
|------------------|------------| | ||
| text-davinci-002 | 33 000 000 | | ||
| code-davinci-002 | 35 000 000 | | ||
| gpt-3.5-turbo | 4 000 000 | | ||
| gpt-4-base | - | | ||
| gpt-4 | 4 800 000 | | ||
|
||
Different variants of schelling point may use different amounts of tokens. | ||
|
||
On Oct 31, 2023, OpenAI API pricing was $0.002 / 1K input tokens for `davinci-002`, $0.003 / 1K input tokens and $0.004 / 1K output tokens for `gpt-3.5-turbo-16k`, $0.03 / 1K input tokens and $0.06 / 1K output tokens for `gpt-4`, and $0.06 / 1K input tokens and $0.12 / 1K output tokens for `gpt-4-32k`. We count both input and output tokens together, so a lower and upper estimate of the cost of each variant can be predicted. | ||
|
||
# Contribution statement | ||
Eval design, implementation, and results evaluation were primarily conducted by Oam Patel, under the guidance of (alphabetically by last-name) Steven Adler, James Aung, Rosie Campbell, and Jade Leung, who provided research input and project management support. Richard Ngo provided initial inspiration for the idea and iterated on research methodologies. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.