Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tasks and metrics for Biomedicine from lmj #13

Open
wants to merge 668 commits into
base: main
Choose a base branch
from
Open

Conversation

Linmj-Judy
Copy link
Collaborator

Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨

PLEASE READ THIS:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.

Also, please note that we're using Git LFS for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available here.

Eval details 📑

Eval name

add evals:

evals:

  • semantic_role_recognition
  • chemical_entities_recognition
  • CDR
  • disease_entities_recognition

Eval description

[Insert a short description of what your eval does here]

What makes this a useful eval?

[Insert why this eval is worth including and any additional context]

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

  • Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • Include at least 15 high-quality examples.

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

Insert what makes your eval high quality that was not mentioned above. (Not required)

Eval structure 🏗️

Your eval should

  • Check that your data is in evals/registry/data/{name}
  • Check that your YAML is registered at evals/registry/evals/{name}.yaml
  • Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

  • I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request.

  • I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

  • I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted.

Submit eval

  • I have filled out all required fields of this form
  • I have used Git LFS for the Eval JSON data
  • (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that mypy, black, isort, autoflake and ruff are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

INSERT_EVAL_HERE

jwang47 and others added 30 commits June 6, 2023 11:29
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, pelase note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑
### Eval name
russian-verse

### Eval description

The most popular Russian poems that nearly every Russian speaker can
recall

### What makes this a useful eval?

Understanding a basic Russian poem or any foreign literature is
significant for a Language Learning Model (LLM) like GPT-4 because it
enhances multilingual ability, provides cultural context, and improves
understanding of language structure. It makes the model globally useful,
and culturally sensitive.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Continue verse with no
punctuation marks:\nМороз и солнце день чудесный\nЕще ты дремлешь друг
прелестный \nПора красавица проснись\nОткрой сомкнуты негой
взоры\nНавстречу северной Авроры"}], "ideal": "Звездою севера явись"}
{"input": [{"role": "system", "content": "Continue verse with no
punctuation marks:\nУ лукоморья дуб зелёный\nЗлатая цепь на дубе том\nИ
днём и ночью кот учёный\nВсё ходит по цепи кругом\nИдёт направо песнь
заводит"}], "ideal": "Налево сказку говорит"}
{"input": [{"role": "system", "content": "Continue verse with no
punctuation marks:\nЯ к вам пишу чего же боле\nЧто я могу еще
сказать\nТеперь я знаю в вашей воле\nМеня презреньем наказать\nНо вы к
моей несчастной доле"}], "ideal": "Хоть каплю жалости храня"}
{"input": [{"role": "system", "content": "Continue verse with no
punctuation marks:\nЯ помню чудное мгновенье\nПередо мной явилась
ты\nКак мимолетное виденье\nКак гений чистой красоты\nВ томленьях грусти
безнадежной"}], "ideal": "В тревогах шумной суеты"}
{"input": [{"role": "system", "content": "Continue verse with no
punctuation marks:\nЛюбви надежды тихой славы\nНедолго нежил нас
обман\nИсчезли юные забавы\nКак сон как утренний туман\nНо в нас горит
еще желанье"}], "ideal": "Под гнетом власти роковой"}
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### Eval name
cybersecurity-filepaths

### Eval description

Assesses cybersecurity skills by identifying the malicious Windows
filepath from a given collection of filepaths. Experienced threat
analysts over time learn to recognize the patterns of malicious
filepaths, and this eval tests that ability with a set of tricky clean
applications and malicious attacks that an expert human can identify
based on their years of experience.

This includes 93 tests, that is constructed from 44 clean filepaths and
32 malicious filepaths. Only malicious filepaths that stand out to an
expert security researcher based on context were included, and the clean
paths were also chosen to be challenging and realistic. Usernames are
stripped from the filepaths. The test is constructed by selected a
single malicious filepath, and mixing it with 5, 10, or 20 clean
filepaths to model increasing levels of difficulty.

gpt-3.5-turbo scores 18.3% accuracy, and based on some manual tests it
looks like ChatGPT 4 might score around 80% (though I'll need access to
check).

### What makes this a useful eval?
Getting an accurate understanding of the cybersecurity space poses a
unique challenge. Web content scraped from the web might over-index on
certain filepath being related to malware - for example SolarWinds
related filepaths are generally clean but may be incorrectly understood
as malware due to the amount of content on the web related to the supply
chain attack leveraging their software.

Application of ChatGPT 4 to cybersecurity landscape to protect the world
against attacks is an important emerging space for these large models.
They are starting to show similar human-like insight into telemetry. If
we can deploy the type of intelligence demonstrated by this eval at a
cheaper cost-point, we'd be able to make a big difference to protecting
the world from attacks.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program Files
(x86)\\Google\\Chrome\\Application\\113.0.5672.64\\Installer\\chrome.7z,C:\\Program
Files (x86)\\Prococious Technology
Inc\\ClearDent\\txtools.dll,C:\\ProgramData\\checkmk\\agent\\modules\\python-3\\.venv\\Scripts\\python.exe,C:\\Program
Files (x86)\\EaseUS\\Todo
Backup\\bin\\Loader.exe,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\AppData\\Roaming\\chrome\\chrome.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\chrome\\chrome.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\assembly\\NativeImages_v2.0.50727_64\\mscorlib\\b40ad7167488f81c80abc2de7e8c3b9c\\mscorlib.ni.dll,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar,C:\\Program
Files (x86)\\Prococious Technology
Inc\\ClearDent\\txtools.dll,C:\\Users\\username\\Desktop\\rufus-3.20p
(1).exe,C:\\Windows\\svchost.exe,C:\\Windows\\SysWOW64\\mfc120deu.dll```
Answer: C:\\"}], "ideal": "Windows\\svchost.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Program Files
(x86)\\YouKu\\YoukuClient\\plugins\\accs.dll,C:\\Program Files
(x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\ProgramData\\muIWiyPVLEpuZPVB\\QVEVjLh.wsf,C:\\windows\\system32\\fsprocsvc.exe,C:\\LDPlayer\\LDPlayer9\\libssl-1_1.dll```
Answer: C:\\"}], "ideal": "ProgramData\\muIWiyPVLEpuZPVB\\QVEVjLh.wsf"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\ProgramData\\{2FA73671-7C6C-4bf8-9F0C-E1A75A3860ED}.tmp\\HardwareProtectEx_x64.sys,C:\\Program
Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe,C:\\Program
Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar,C:\\Users\\username\\AppData\\Roaming\\Microsoft\\unityhub.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\Microsoft\\unityhub.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\SysWOW64\\drivers\\lsass.exe,C:\\Users\\username\\Downloads\\Inventory
Update Tool GDSM
v5.docx,C:\\Users\\username\\Downloads\\NTW3-9.2-Campaigns.exe,C:\\Windows\\Redirected\\Sync\\CCC_Certificates.vbs,C:\\Program
Files (x86)\\EaseUS\\Todo
Backup\\bin\\Loader.exe,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar```
Answer: C:\\"}], "ideal": "Windows\\SysWOW64\\drivers\\lsass.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program
Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Windows\\Redirected\\Sync\\CCC_Certificates.vbs,C:\\Program
Files
(x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll,C:\\Windows\\Installer\\a8a9a8f.msi,C:\\Windows\\Resources\\spoolsv.exe,C:\\Program
Files (x86)\\Zscaler\\ZSATunnel\\ZSATunnel.exe``` Answer: C:\\"}],
"ideal": "Windows\\Resources\\spoolsv.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program Files
(x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Users\\username\\AppData\\Roaming\\imageclass\\Library.dll,C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Program
Files (x86)\\Steam\\steamapps\\common\\Grand Theft Auto
V\\GTA5.exe,C:\\Program Files (x86)\\EaseUS\\Todo
Backup\\bin\\Loader.exe,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys```
Answer: C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\imageclass\\Library.dll"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\windows\\system32\\fsprocsvc.exe,C:\\Users\\username\\Downloads\\Inventory
Update Tool GDSM
v5.docx,C:\\Users\\username\\Downloads\\Order_Verication_363900421
(1).docm,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Program Files
(x86)\\Steam\\steamapps\\common\\Grand Theft Auto V\\GTA5.exe``` Answer:
C:\\"}], "ideal":
"Users\\username\\Downloads\\Order_Verication_363900421 (1).docm"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Users\\username\\Downloads\\Invoice_1232.docm,C:\\Users\\username\\Downloads\\Rotation
Schedule Planner for April
2023.doc,C:\\Windows\\System32\\svchost.exe,C:\\Windows\\System32\\SECOMN64.exe,C:\\Windows\\Redirected\\Sync\\CCC_Certificates.vbs,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll```
Answer: C:\\"}], "ideal":
"Users\\username\\Downloads\\Invoice_1232.docm"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Users\\username\\Downloads\\validationDocument
(1).docm,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\ProgramData\\checkmk\\agent\\modules\\python-3\\.venv\\Scripts\\python.exe,C:\\Program
Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\Downloads\\validationDocument (1).docm"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Users\\username\\Downloads\\Rotation
Schedule Planner for April
2023.doc,C:\\windows\\system32\\fsprocsvc.exe,C:\\Users\\username\\Downloads\\PaymentReceipt.docm,C:\\Program
Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Program Files
(x86)\\SolarWinds\\Agent\\Plugins\\ADMProbe\\SolarWinds.ADM.AgentPlugin.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\Downloads\\PaymentReceipt.docm"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\assembly\\NativeImages_v2.0.50727_64\\mscorlib\\b40ad7167488f81c80abc2de7e8c3b9c\\mscorlib.ni.dll,C:\\Program
Files\\SoftEther VPN
Client\\vpnsetup.exe,C:\\Users\\username\\Downloads\\scan.docm,C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\Windows\\Installer\\a8a9a8f.msi```
Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\scan.docm"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program Files\\SoftEther VPN
Client\\vpnsetup.exe,C:\\Program
Files\\qBittorrent\\qbittorrent.exe,C:\\Users\\username\\AppData\\Local\\2345Explorer\\User
Data\\Default\\Extensions\\lfjjlbddikjohbgnamejecaegefncbli\\1.0_0\\static\\js\\bfb.js,C:\\Users\\username\\Downloads\\Receipt(7682).docm,C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys```
Answer: C:\\"}], "ideal":
"Users\\username\\Downloads\\Receipt(7682).docm"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program
Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe,C:\\LDPlayer\\LDPlayer9\\libssl-1_1.dll,C:\\Program
Files
(x86)\\Zscaler\\ZSATunnel\\ZSATunnel.exe,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\Downloads\\Advice_Paymet_IBGC825120123123254
Pdf.html,C:\\Windows\\System32\\cmd.exe``` Answer: C:\\"}], "ideal":
"Users\\username\\Downloads\\Advice_Paymet_IBGC825120123123254
Pdf.html"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program Files
(x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Program Files
(x86)\\SolarWinds\\Agent\\Plugins\\ADMProbe\\SolarWinds.ADM.AgentPlugin.exe,C:\\Windows\\System32\\cmd.exe,C:\\Program
Files
(x86)\\YouKu\\YoukuClient\\plugins\\accs.dll,C:\\Users\\username\\Downloads\\Complete_Setup_Use_2023_As_Passwrd.rar,C:\\Program
Files\\OpeniT\\Core\\bin\\curl.exe``` Answer: C:\\"}], "ideal":
"Users\\username\\Downloads\\Complete_Setup_Use_2023_As_Passwrd.rar"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\AppData\\Roaming\\3545695800.exe,C:\\Users\\username\\Desktop\\rufus-3.20p
(1).exe,C:\\Program
Files\\OpeniT\\Core\\bin\\curl.exe,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Windows\\System32\\svchost.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\3545695800.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\PROGRA~1\\COMMON~1\\System\\symsrv.dll,C:\\Users\\username\\Downloads\\Inventory
Update Tool GDSM v5.docx,C:\\Program Files
(x86)\\YouKu\\YoukuClient\\plugins\\accs.dll,C:\\Program Files
(x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Program Files
(x86)\\EaseUS\\Todo
Backup\\bin\\Loader.exe,C:\\Windows\\System32\\cmd.exe``` Answer:
C:\\"}], "ideal": "PROGRA~1\\COMMON~1\\System\\symsrv.dll"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program
Files\\qBittorrent\\qbittorrent.exe,C:\\Windows\\assembly\\NativeImages_v2.0.50727_64\\mscorlib\\b40ad7167488f81c80abc2de7e8c3b9c\\mscorlib.ni.dll,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Program
Files
(x86)\\SolarWinds\\Agent\\Plugins\\ADMProbe\\SolarWinds.ADM.AgentPlugin.exe,C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe,C:\\Windows\\SysWOW64\\WMIScriptingAPI\\nssy.exe```
Answer: C:\\"}], "ideal":
"Windows\\SysWOW64\\WMIScriptingAPI\\nssy.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\System32\\cmd.exe,C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Windows\\Temp\\analyse.exe,C:\\windows\\system32\\fsprocsvc.exe,C:\\ProgramData\\checkmk\\agent\\modules\\python-3\\.venv\\Scripts\\python.exe,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar```
Answer: C:\\"}], "ideal": "Windows\\Temp\\analyse.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```D:\\RECYCLER.BIN\\hex.dll,C:\\Program Files
(x86)\\Google\\Chrome\\Application\\113.0.5672.64\\Installer\\chrome.7z,C:\\ProgramData\\Overwolf\\Setup\\0.221.109.14\\OverwolfSetup.exe,C:\\Program
Files
(x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll,C:\\Windows\\System32\\cmd.exe,C:\\Windows\\SysWOW64\\mfc120deu.dll```
Answer: C:\\"}], "ideal": "RECYCLER.BIN\\hex.dll"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Program Files
(x86)\\EaseUS\\Todo Backup\\bin\\Loader.exe,C:\\Program Files
(x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Program Files
(x86)\\Google\\Update\\GoogleUpdate.exe,C:\\Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start
Menu\\Programs\\Startup\\ui.exe,C:\\Windows\\System32\\svchost.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start
Menu\\Programs\\Startup\\ui.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\System32\\SECOMN64.exe,C:\\Program Files
(x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start
Menu\\Programs\\Startup\\drive.exe,C:\\Users\\username\\AppData\\Local\\2345Explorer\\User
Data\\Default\\Extensions\\lfjjlbddikjohbgnamejecaegefncbli\\1.0_0\\static\\js\\bfb.js,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Program
Files (x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll``` Answer:
C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start
Menu\\Programs\\Startup\\drive.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\ProgramData\\MslBooster\\WindowsPaint-Ver3.8.8.3.exe,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe,C:\\Windows\\Installer\\a8a9a8f.msi,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\ProgramData\\Overwolf\\Setup\\0.221.109.14\\OverwolfSetup.exe```
Answer: C:\\"}], "ideal":
"ProgramData\\MslBooster\\WindowsPaint-Ver3.8.8.3.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program Files
(x86)\\Steam\\steamapps\\common\\Grand Theft Auto
V\\GTA5.exe,C:\\Program
Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Users\\username\\Downloads\\Inventory
Update Tool GDSM
v5.docx,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\AppData\\Roaming\\1z6E2nwrGf05i3P\\jChGsIxjHw.exe,C:\\Windows\\System32\\cmd.exe```
Answer: C:\\"}], "ideal":
"Users\\username\\AppData\\Roaming\\1z6E2nwrGf05i3P\\jChGsIxjHw.exe"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious. ```C:\\Program Files
(x86)\\Steam\\steamapps\\common\\Grand Theft Auto
V\\GTA5.exe,C:\\Windows\\Temp\\wk_4996_28103\\wk.dll,C:\\Program Files
(x86)\\Zscaler\\ZSATunnel\\ZSATunnel.exe,C:\\ProgramData\\modPackages\\mod\\AgentFix
Client\\1.0\\Source\\Setup.exe,C:\\Users\\username\\Desktop\\rufus-3.20p
(1).exe,C:\\Program Files\\qBittorrent\\qbittorrent.exe``` Answer:
C:\\"}], "ideal": "Windows\\Temp\\wk_4996_28103\\wk.dll"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\Program
Files (x86)\\Prococious Technology
Inc\\ClearDent\\txtools.dll,C:\\Program
Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe,C:\\WINDOWS\\system32\\drivers\\v7JViu6IUh.sys,C:\\Program
Files (x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll``` Answer:
C:\\"}], "ideal": "WINDOWS\\system32\\drivers\\v7JViu6IUh.sys"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Users\\username\\AppData\\Local\\Microsoft\\OneDrive\\Secur32.dll,C:\\Users\\username\\Desktop\\Pinball-Arcade.rar,C:\\ProgramData\\{2FA73671-7C6C-4bf8-9F0C-E1A75A3860ED}.tmp\\HardwareProtectEx_x64.sys,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\Program
Files (x86)\\YouKu\\YoukuClient\\plugins\\accs.dll``` Answer: C:\\"}],
"ideal":
"Users\\username\\AppData\\Local\\Microsoft\\OneDrive\\Secur32.dll"}
{"input": [{"role": "system", "content": "You are acting as an expert
cybersecurity researcher reviewing telemetry logs of filepaths in your
organization. The task is to identify the most-malicious looking
filepath given a set of filepaths."}, {"role": "user", "content": "Here
are a bunch of filepaths in my organization. Please tell me which one
looks the most malicious.
```C:\\Users\\username\\Downloads\\NTW3-9.2-Campaigns.exe,C:\\Users\\username\\Downloads\\Set-up.exe,C:\\Program
Files
(x86)\\Google\\Chrome\\Application\\113.0.5672.64\\Installer\\chrome.7z,C:\\Users\\username\\Desktop\\rufus-3.20p
(1).exe,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\Windows\\System32\\cmd.exe```
Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Set-up.exe"}
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

medication_dose

### Eval description

This tests the LLM ability to identify medication doses that are
inappropriate for therapeutic uses.

### What makes this a useful eval?

For LLM to have a role in medical applications, the ability to recognize
medication doses that are out of therapeutic range is a very important
function. I have found that these models frequently fail at this task.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

LLM have tremendous possibilities for medical applications. Many future
applications will need the LLM to be able to identify appropriate
medication dose ranges. I have found that current LLM frequently fail at
this task and this is an area where improvement would be important.

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl

{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Metronidazole 500 mg,
Icatibant 30 mg, Haegarda 3000 IU, Docusate 100 mg, Levofloxacin 750
mcg, Famotidine 40 mg, Fentanyl 100 mcg, Budesonide 0.25 mg, Metoprolol
12.5 mg, Atenolol 50 mg, Labetalol 100 mg"}, {"role": "user", "content":
"The medication that is not within a standard dose range is: "}],
"ideal": "Levofloxacin"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Docusate 100 mg,
Diazepam 5 mg, Lanoxin 250 mcg, Privigen 25 g, Doxycycline 100 mg,
CellCept 1.5 g, Xolair 300 mcg, Hydrocortisone 1%, Keflex 250 mg,
Naloxone 400 mcg, Famotidine 40 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Xolair"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Budesonide 0.25 mg,
Theophylline 450 mg, Humira 160 mg, Haegarda 3000 IU, Neoral 50 mg,
Metronidazole 500 mg, Formoterol 20 mcg, Advair 500 mcg, Zosyn 3.375 mg,
Furosemide 20 mg, Dilantin 100 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Zosyn"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Breo 200/25 mcg,
Acetaminophen 1000 mg, Claratin 10 mg, Gammunex 100 gram, Levothyroxine
125 mg, Albuterol 108 mcg, Lovenox 40 mg, Betapace 120 mg, Levofloxacin
500 mg, Nystatin 100000 U, Warfarin 6.5 mg"}, {"role": "user",
"content": "The medication that is not within a standard dose range is:
"}], "ideal": "Levothyroxine"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Azithromycin 500 mg,
Heparin 5000 U, Atenolol 50 mg, Betapace 120 mg, Budesonide 0.25 mg,
Privigen 25 g, Furosemide 20 mg, Humira 160 mg, Keflex 250 mg, Verapamil
40 mg, Symbicort 500/4.5 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Symbicort"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Famotidine 40 mg,
Digoxin 0.125 mg, Rifampin 150 mg, Albuterol 108 mcg, Allegra 60 mg,
Azithromycin 5 mg, Spiriva 1.25 mcg, Warfarin 6.5 mg, Nasacort 220 mcg,
Cetirizine 10 mg, Azithromycin 500 mg"}, {"role": "user", "content":
"The medication that is not within a standard dose range is: "}],
"ideal": "Azithromycin"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Acyclovir 200 mg,
Patanol 0.2%, Famotidine 40 mg, Heparin 5000 U, Gammunex 100 gram,
Prednisone 40 mg, Amitriptyline 25 mg, Betapace 120 mg, Sotolol 80 mg,
Cetirizine 100 mg, Loperamide 2 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Cetirizine"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Fasenra 30 mg,
Sotolol 80 mg, Levothyroxine 125 mcg, Digoxin 0.125 mg, Budesonide 1 gm,
Loperamide 2 mg, Humira 160 mg, Patanol 0.2%, Dilantin 100 mg, Rifampin
150 mg, Keflex 250 mg"}, {"role": "user", "content": "The medication
that is not within a standard dose range is: "}], "ideal": "Budesonide"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Fasenra 130 mg,
Doxycycline 100 mg, Patanol 0.2%, Pulmicort 1 mg, Flonase 50 mcg,
Tiotropium 2.5 mcg, Nasacort 220 mcg, Acetaminophen 1000 mg, Icatibant
30 mg, Cetirizine 10 mg, Theophylline 450 mg"}, {"role": "user",
"content": "The medication that is not within a standard dose range is:
"}], "ideal": "Fasenra"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Plaquenil 400 mg,
Atorvastatin 2000 mg, Lovenox 40 mg, Keflex 250 mg, Pulmicort 1 mg,
Versed 5 mg, Spiriva 1.25 mcg, Cyclosporine 100 mg, Ventolin 90 mcg,
Icatibant 30 mg, Loperamide 2 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Atorvastatin"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Tiotropium 2.5 mcg,
Plaquenil 400 mg, Azelastine 137 mg, Haegarda 3000 IU, Albendazole 400
mg, Phenytoin 30 mg, Naloxone 400 mcg, Symbicort 160/4.5 mcg, Isoniazid
100 mg, Diazepam 5 mg, Metformin 500 mg"}, {"role": "user", "content":
"The medication that is not within a standard dose range is: "}],
"ideal": "Azelastine"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Budesonide 0.25 mg,
Fexofenadine 180 mg, Keflex 250 mg, Fexofenadine 1.8 mg, Allegra 60 mg,
Flonase 50 mcg, Rhinocort 32 mcg, Azithromycin 500 mg, Sotolol 80 mg,
Fluoxetine 20 mg, Pulmicort 1 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Fexofenadine"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Flecainide 50 mg,
Claratin 10 mg, Sotolol 80 mg, Atenolol 50 mg, Metformin 500 mg, Flonase
50 mcg, Advair 5 mcg, Labetalol 100 mg, Nasacort 220 mcg, Fentanyl 100
mcg, Budesonide 0.25 mg"}, {"role": "user", "content": "The medication
that is not within a standard dose range is: "}], "ideal": "Advair"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Prednisone 40 g,
Digoxin 0.125 mg, Budesonide 0.25 mg, Icatibant 30 mg, Betapace 120 mg,
Fluconazole 50 mg, Patanol 0.2%, Heparin 5000 U, Coumadin 3.5 mg,
Metformin 500 mg, Fasenra 30 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Prednisone"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Fluconazole 50 mg,
Furosemide 20 mg, Verapamil 40 mg, Nystatin 100000 U, Augmentin 875 mg,
Augmentin 8.75 mg, Theophylline 450 mg, Lanoxin 250 mcg, Neoral 50 mg,
Nasacort 220 mcg, Diazepam 5 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Augmentin"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Haegarda 3000 IU,
Lithium 600 mg, Acyclovir 200 mg, CellCept 1.5 g, Fasenra 30 mg,
Metformin 500 mg, Albendazole 400 mg, Advair 500 mcg, Zofran 4 mg,
Ciprofloxacin 500 mg, Haegarda 3000 mg"}, {"role": "user", "content":
"The medication that is not within a standard dose range is: "}],
"ideal": "Haegarda"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Orapred 15 mg,
Carvedilol 3.125 mg, Lanoxin 250 mcg, Lovenox 40 mg, Loperamide 2 mg,
Hydrocortisone 1%, Diazepam 5 mg, Zosyn 3.375 g, Warfarin 6.5 mg,
Synthroid 88 mg, Cetirizine 10 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Synthroid"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Patanol 0.2%,
Icatibant 30 g, Gammunex 100 gram, Versed 5 mg, Formoterol 20 mcg,
Rhinocort 32 mcg, Metformin 500 mg, Motrin 800 mg,, Enoxaprin 30 mg,
Metoprolol 12.5 mg, Xolair 300 mg"}, {"role": "user", "content": "The
medication that is not within a standard dose range is: "}], "ideal":
"Icatibant"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? CellCept 1.5 g,
Flonase 50 mcg, Trimethoprim/Sulfamethoxazole 160 mg/800 mg, Fluconazole
50 mg, Verapamil 40 mg, Loperamide 2 mg, Prednisone 40 mg, Pepcid 20 mg,
Plaquenil 4 mcg, Nystatin 100000 U, Labetalol 100 mg"}, {"role": "user",
"content": "The medication that is not within a standard dose range is:
"}], "ideal": "Plaquenil"}
{"input": [{"role": "system", "content": "Answer this question as
concisely as possible. Of the following medications and doses which is
incorrect for a standard dose for that medication? Carvedilol 3.125 mg,
Digoxin 0.125 mg, Fexofenadine 180 mg, Nasacort 220 mcg, Zofran 4 mg,
Fluoxetine 20 mg, Mupirocin 2%, CellCept 1.5 g, Keflex 250 mg,
Atorvastatin 20 mg, Amitriptyline 25 g"}, {"role": "user", "content":
"The medication that is not within a standard dose range is: "}],
"ideal": "Amitriptyline"}


  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

context-free-grammar

### Eval description

This tests the ability for GPT-4 to evaluate whether or not a string can
be produced by a given context-free grammar.

### What makes this a useful eval?

This is an interesting computational task. Context-free languages are
important in linguistics, and it will be interesting to see
how a language model fares in handling this task.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [✅] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [✅] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [✅] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [✅] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)
I've handcrafted a lot of these examples. Some of them are there to
'trick' the model––I think it will be a useful test to see how well the
language model can do at those.

## Eval structure 🏗️

Your eval should

- [✅] Check that your data is in `evals/registry/data/{name}`
- [✅] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [✅] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [✅] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [✅] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [✅] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [✅] I have filled out all required fields of this form
- [✅] I have used **Git LFS** for the Eval JSON data
- [✅] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system","content": "You will be provided with a
context-free grammar, and a string. Respond whether or not that string
can be produced by the given grammar. Respond only with the word true or
false; you do not need to show your work. The empty string is
represented with ϵ, and productions are separated with the symbol ;. The
alphabet is always {0, 1}."}, {"role": "user", "content": "Given the
grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '01010101' in the
language?"}], "ideal": "true"}
{"input": [{"role": "system","content": "You will be provided with a
context-free grammar, and a string. Respond whether or not that string
can be produced by the given grammar. Respond only with the word true or
false; you do not need to show your work. The empty string is
represented with ϵ, and productions are separated with the symbol ;. The
alphabet is always {0, 1}."}, {"role": "user", "content": "Given the
grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '00011101' in the
language?"}], "ideal": "true"}
{"input": [{"role": "system","content": "You will be provided with a
context-free grammar, and a string. Respond whether or not that string
can be produced by the given grammar. Respond only with the word true or
false; you do not need to show your work. The empty string is
represented with ϵ, and productions are separated with the symbol ;. The
alphabet is always {0, 1}."}, {"role": "user", "content": "Given the
grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '00110101' in the
language?"}], "ideal": "true"}
{"input": [{"role": "system","content": "You will be provided with a
context-free grammar, and a string. Respond whether or not that string
can be produced by the given grammar. Respond only with the word true or
false; you do not need to show your work. The empty string is
represented with ϵ, and productions are separated with the symbol ;. The
alphabet is always {0, 1}."}, {"role": "user", "content": "Given the
grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '01001101' in the
language?"}], "ideal": "true"}
{"input": [{"role": "system","content": "You will be provided with a
context-free grammar, and a string. Respond whether or not that string
can be produced by the given grammar. Respond only with the word true or
false; you do not need to show your work. The empty string is
represented with ϵ, and productions are separated with the symbol ;. The
alphabet is always {0, 1}."}, {"role": "user", "content": "Given the
grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '01010011' in the
language?"}], "ideal": "true"}
  ```
</details>

---------

Co-authored-by: Arjun Taneja <[email protected]>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

3D Globe Movement

### Eval description
This eval tests an LLMs ability to understand 3D movement through space,
in particular movement on or through planet Earth. Each example provides
a starting point and a path consisting of one or two movements, and the
expected answer is a state/province or ocean.

Similar to the evals from openai#462 and
openai#1060, this eval shows how difficult
movement is for LLMs to understand, and builds upon those by showing how
the problem is seemingly magnified by 3D movement and/or by requesting a
region as answer rather than numerical positions.

Testing on gpt-3.5-turbo, accuracy ranges from ~0.24 to ~0.31 

### What makes this a useful eval?
This eval demonstrates that a long series of steps is not necessary in
order to create a path that GPT is unable to follow, and that a simple
trip to the planet's core and back again, with a slight offset in any
direction, will often get it lost.

Whereas it can often handle "travel 15 degrees East", hiding that actual
travel behind a 2-step 3D path significantly hurts the model's
performance.

Interestingly, this eval seems to demonstrate the statistical rather
than reasoning nature of GPT, as regardless of starting point or path,
it shows a bias towards answering with oceans, especially the Pacific
Ocean.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 15 degree offset to the East"}], "ideal": "Colorado"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel due East 15
degrees"}], "ideal": "Colorado"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel due East 0.1
degrees"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel due East 1
meter"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel 0 meter"}],
"ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and make one complete
revolution around the planet"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 1 degree offset to the East"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 0.1 degree offset to the East"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 0.01 degree offset to the East"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 1 degree offset to the West"}], "ideal": "Pacific Ocean"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 90 degree offset to the East"}], "ideal": "Atlantic
Ocean"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 180 degree offset to the East"}], "ideal": "Ahal"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of San Francisco and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 1 degree offset to the North"}], "ideal": "California"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of Kansas City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 15 degree offset to the East"}], "ideal": "West Virginia"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of Kansas City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 15 degree offset to the West"}], "ideal": "Utah"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 1 degree offset to the East"}], "ideal": "Atlantic Ocean"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due East 10
degrees"}], "ideal": "Atlantic Ocean"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due East 500
kilometers"}], "ideal": "Atlantic Ocean"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 1 degree offset to the West"}], "ideal": "New Jersey"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 10 degree offset to the West"}], "ideal": "Ohio"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due West 10
degrees"}], "ideal": "Ohio"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due West 15
degrees"}], "ideal": "Illinois"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due West 25
degrees"}], "ideal": "Nebraska"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due West 30
degrees"}], "ideal": "Colorado"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of New York City and travel due West 35
degrees"}], "ideal": "Colorado"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of Oklahoma City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface, at a 0.1 degree offset to the East"}], "ideal": "Oklahoma"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of Oklahoma City and travel directly
downwards to the center of the Earth, then straight up, back to the
surface along the same path that was traveled downards."}], "ideal":
"Oklahoma"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of Oklahoma City and travel directly
downwards to the center of the Earth, then continue onwards to the other
side of the planet"}], "ideal": "Xinjiang"}
{"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a
model that can accurately track 3D paths on, around, and through planet
Earth based on plain language descriptions. For each path, determine the
state, province, or ocean of the final destination. You may assume the
object traveling the path encounters no resistance, such as if it were a
neutrino. Try reasoning through the 3D path one step at a time, and at
the end, provide the final answer enclosed in square brackets like
[Europe]. Start at the center of Oklahoma City and travel directly
downwards through to the other side of the Earth"}], "ideal":
"Xinjiang"}

  ```
</details>
This is a fix for a problem with 

https://github.com/openai/evals/blob/19352afd7ba6290b3320d2fe60a990efe9aaebd6/evals/registry/data/dutch-lexicon/samples.jsonl
that I noted here:
openai#616 (comment)

✅ Correct spelling (line 49): 

```json
{"input":[{"role":"system","content":"You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."},{"role":"user","content":"faliekant"}],"ideal":"Y"}
```

✅ Incorrect spelling (line 100): 

```json
{"input":[{"role":"system","content":"You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."},{"role":"user","content":"falikant"}],"ideal":"N"}
```

❌ But there's a mixup on line 137 - the correct spelling is marked as
"not ideal":
```json
{"input":[{"role":"system","content":"You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."},{"role":"user","content":"faliekant"}],"ideal":"N"}
```

Co-authored-by: László van den Hoek <[email protected]>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

simple_math
logic_and_probability

### Eval description

Eval that checks ability to do simple math questions.
Eval that checks ability to do logical physics and statistics questions.


### What makes this a useful eval?

ChatGPT fails in simple car arriving problem.
ChatGPT fails in simple car elevator probability question and famous ant
arriving problem.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [ ] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [ ] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [ ] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [ ] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [ ] Check that your data is in `evals/registry/data/{name}`
- [ ] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [ ] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [ ] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [ ] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [ ] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [ ] I have filled out all required fields of this form
- [ ] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
  INSERT_EVAL_HERE
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

mapping_to_matricies

### Eval description

Given an array of binary values (0 or 1), a request is made for the
array to be mapped to a two-dimensional array. The length of the
original array must be evenly divisible by the dimensions of the
two-dimensional array (i.e. an array of length 12 is evenly mappable
onto a 3x4 two-dimensional array). An evaluation is made by comparing
the final row of the mapped 2D array with the corresponding values of
the original array. To further demonstrate that the failure cases are
not due to poor prompting, I've included instructions in the prompt to
present some rationale in the response -- it is evident therein that the
LLM indeed understands the task, but fails to accomplish it. In fact,
when asked to verify the answer, the LLM appears to double down and
effectively "re-write" its own memory of the original input array so
that it can claim that its answer was valid.

A small python script that was used for generating the samples has been
included at
`/evals/registry/data/mapping_to_matricies/data_generator.py`

### What makes this a useful eval?

This eval demonstrates a task that a human can easily do, but LLMs have
trouble accomplishing. Further, it also demonstrates that the LLM
understands the task accurately, but confidently and consistently
provides the wrong answer; and when asked to check its answer, it alters
its own understanding of the original user input so that it can claim to
be correct.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[0,1,1,0,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]\nGrid
Dimensions: 2x20"}], "ideal": "{\"Final
Row\":[1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[0,1,1,0,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]\nGrid
Dimensions: 4x10"}], "ideal": "{\"Final Row\":[1,0,1,0,0,0,0,0,0,1]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[0,1,1,0,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]\nGrid
Dimensions: 5x8"}], "ideal": "{\"Final Row\":[1,0,0,0,0,0,0,1]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]\nGrid
Dimensions: 2x21"}], "ideal": "{\"Final
Row\":[1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]\nGrid
Dimensions: 3x14"}], "ideal": "{\"Final
Row\":[0,1,1,0,1,1,1,1,0,0,0,1,1,0]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]\nGrid
Dimensions: 6x7"}], "ideal": "{\"Final Row\":[1,0,0,0,1,1,0]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[1,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,1,0,1,0,0,1,1]\nGrid
Dimensions: 2x22"}], "ideal": "{\"Final
Row\":[0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,1,0,1,0,0,1,1]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[1,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,1,0,1,0,0,1,1]\nGrid
Dimensions: 4x11"}], "ideal": "{\"Final Row\":[1,1,0,1,1,0,1,0,0,1,1]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[0,0,1,1,0,1,1,1,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,0,0,1,1,0,0,0,1,0,1,0]\nGrid
Dimensions: 3x15"}], "ideal": "{\"Final
Row\":[0,1,0,1,0,0,1,1,0,0,0,1,0,1,0]}"}
{"input": [{"role": "system", "content": "Given the user-provided binary
array, map the binary array onto a grid, wherein the dimensions of the
grid are as provided by the user ([num rows]x[num elements per row]),
and the mapping to the grid is done from left to right, top to bottom
(provide a visualization of the mapped result). Then explain in a second
visualization how the final row of the grid was mapped from the
corresponding final binary numbers of the array. Lastly, provide the
final row of the grid, in minified JSON format, like this: {\"Final
Row\":[...]}"}, {"role": "user", "content": "Array:
[0,0,1,1,0,1,1,1,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,0,0,1,1,0,0,0,1,0,1,0]\nGrid
Dimensions: 5x9"}], "ideal": "{\"Final Row\":[1,1,0,0,0,1,0,1,0]}"}
  ```
</details>
A bug in the handling of `--registry_path` was introduced in
https://github.com/openai/evals/pull/1036/files#diff-a694333152a5a73db19b8951647e60aed78f43fec6b119707bc6d489289be6c0R87

To repro, run the example from
https://github.com/openai/evals/blob/main/examples/retrieval-completionfn.ipynb

Before the fix, the following exception will be thrown
```
TypeError: unsupported operand type(s) for /: 'list' and 'str'
```
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

Sindarin Fluency (Nouns)

### Eval description

This eval tests the GPT model's ability to translate Sindarin (from the
Tolkien works) to English. Sindarin was created in 1915, and was
expanded upon until 1973, being heavily inspired by Literary Welsh. As
of today, hundreds of poems and texts have been written in Sindarin, and
a major effort is made by several publishers to kept to preserve the
language. The eval uses a collection of 150 commonly used nouns.

### What makes this a useful eval?

This eval provides an opportunity to understand how well GPT models can
understand the fictional language, and improve upon their overall
understanding in translating it to English.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [X] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [X] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [X] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [X] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> This eval will help preserve the historic fictional language, by
making it possible to improve upon retaining knowledge of Sindarin, and
potentially other languages used in fictional works.

## Eval structure 🏗️

Your eval should

- [X] Check that your data is in `evals/registry/data/{name}`
- [X] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [X] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [X] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [X] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [X] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [X] I have filled out all required fields of this form
- [X] I have used **Git LFS** for the Eval JSON data
- [X] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Translate a word from the
Sindarin language (from Tolkien's works) to English. Respond only with
the translated word or phrase, or 'none' if there is no translation."},
{"role": "user", "content": "ablad"}], "ideal": ["prohibition",
"refusal"]}
{"input": [{"role": "system", "content": "Translate a word from the
Sindarin language (from Tolkien's works) to English. Respond only with
the translated word or phrase, or 'none' if there is no translation."},
{"role": "user", "content": "achad"}], "ideal": ["rock ridge", "neck"]}
{"input": [{"role": "system", "content": "Translate a word from the
Sindarin language (from Tolkien's works) to English. Respond only with
the translated word or phrase, or 'none' if there is no translation."},
{"role": "user", "content": "Adan"}], "ideal": ["man"]}
{"input": [{"role": "system", "content": "Translate a word from the
Sindarin language (from Tolkien's works) to English. Respond only with
the translated word or phrase, or 'none' if there is no translation."},
{"role": "user", "content": "adar"}], "ideal": ["father"]}
{"input": [{"role": "system", "content": "Translate a word from the
Sindarin language (from Tolkien's works) to English. Respond only with
the translated word or phrase, or 'none' if there is no translation."},
{"role": "user", "content": "aduial"}], "ideal": ["twilight"]}
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

resource_id_extraction

### Eval description

This Eval asks the model to identify UI elements and extract their
resource ID from Android XML dumps.

Android allows you to serialize the content of the screen to XML using
accessibility information. A human can (with effort) read through this
XML and understand approximately what is visible on the screen and what
the semantic intent of UI elements are.

Each sample in the eval contains:
- Instructions to find and identify a ui element and extract its
resource id (or return error code)
- An XML dump from an actual Android app
- A description of a UI element that may or may not be present in the
xml

The eval uses the simple "includes" test to see if the correct resource
id is returned.
 
### What makes this a useful eval?

This eval is useful because:
- It requires semantic understanding of ui elements (foundational
capacity) (eg that the ⭐ button means "add to favorites" and the
➕ button is used to create a new contact)
- It tests against hallucinations (eg sometimes the model may
hallucinate resource ids that don't exist)
- It enforces error codes that distinguish between different problem
states (system message steer-ability)

Given certain successes, it seems that the model has the capacity to
return good responses, but often fails to do so. This eval can be
extended by:
1. Adding more test cases / xml dumps (especially more complex cases
using the larger gpt4 context window)
2. Adding more obscure descriptions of ui elements
3. Extracting information other than resource ids, for example "center
point" of ui elements (computable via their bounding box)

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

A lot of implicit information is conveyed in UIs. Models reading
serialized UIs have the ability to understand some of that implicit
information. This is quite valuable, not least because it is required in
order to give good results for code completion and generation.

This eval is uniquely valuable because it gives a structured way to
definitively test the model's understanding of code/UIs from something
like a "reading comprehension" perspective, rather than just generation.

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

:warning: I was only able to add 1 due to length limits of the PR body.
Sorry!

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"ideal": "com.google.android.apps.photos:id/selected_account_disc",
"input": [{"role": "system", "content": "\nYou have two tasks:\n-
Identify an interactable UI element in an Android XML dump based on a
description of of what it looks like, where it is, or what it does.\n-
Retrieve the corresponding resource ID you would use to interact with
that UI element.\nYou'll be given two inputs:\n- An XML output from the
'dumpXMLHierarchy'.\n- A description of a UI element.\nBased on these
inputs your output should be as follows:\n- If you find the described UI
element and it has a resource ID, return the resource ID.\n- If you
don't find the described UI element, return \"element_not_found\".\n- If
the described UI element is there but it doesn't have a resource ID,
return \"no_resource_id\".\n"}, {"role": "user", "content": "XML output:
<hierarchy rotation=\"0\"><node index=\"0\" text=\"\" resource-id=\"\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.LinearLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/action_bar_root\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"android:id/content\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/touch_capture_view\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/photo_container\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\" /><node index=\"1\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/drawer_layout\"
class=\"androidx.drawerlayout.widget.DrawerLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/main_container\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/toolbar_parent\"
class=\"android.view.ViewGroup\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/touch_capture_view\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/empty_view_container\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.ScrollView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"true\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,312][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.LinearLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,965][1080,1621]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/empty_page_image\"
class=\"android.widget.ImageView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[237,965][842,1180]\" /><node index=\"1\"
text=\"Take a picture.&#10;Photos &amp; videos appear here.\"
resource-id=\"com.google.android.apps.photos:id/empty_page_caption\"
class=\"android.widget.TextView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[155,1230][925,1406]\" /><node index=\"2\"
text=\"No Photos\"
resource-id=\"com.google.android.apps.photos:id/empty_page_title_top\"
class=\"android.widget.TextView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[155,1406][925,1533]\" /><node index=\"3\"
text=\"\" resource-id=\"\" class=\"android.widget.LinearLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[496,1533][584,1577]\"
/></node></node></node><node index=\"1\" text=\"\"
resource-id=\"com.google.android.apps.photos:id/fragment_container\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/photos_photogrid_date_scrubber_view\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/recycler_view\"
class=\"android.support.v7.widget.RecyclerView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"true\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,2138]\"
/></node></node></node></node></node><node index=\"1\" text=\"\"
resource-id=\"com.google.android.apps.photos:id/scrolling_toolbar_container\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,312]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/toolbar_container\"
class=\"android.widget.LinearLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,0][1080,312]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/notification_bar_spacer\"
class=\"android.view.View\" package=\"com.google.android.apps.photos\"
content-desc=\"\" checkable=\"false\" checked=\"false\"
clickable=\"false\" enabled=\"true\" focusable=\"false\"
focused=\"false\" scrollable=\"false\" long-clickable=\"false\"
password=\"false\" selected=\"false\" bounds=\"[0,0][1080,136]\" /><node
index=\"1\" text=\"\"
resource-id=\"com.google.android.apps.photos:id/toolbar\"
class=\"android.view.ViewGroup\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,136][1080,312]\"><node index=\"0\"
text=\"\" resource-id=\"\" class=\"android.widget.ImageButton\"
package=\"com.google.android.apps.photos\" content-desc=\"Show
Navigation Drawer\" checkable=\"false\" checked=\"false\"
clickable=\"true\" enabled=\"true\" focusable=\"true\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[0,147][154,301]\" /><node index=\"1\"
text=\"\" resource-id=\"\" class=\"android.view.ViewGroup\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[354,136][726,312]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/product_lockup_view\"
class=\"android.view.ViewGroup\"
package=\"com.google.android.apps.photos\" content-desc=\"Google
Photos\" checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[354,190][726,259]\"><node index=\"0\"
text=\"\" resource-id=\"com.google.android.apps.photos:id/logo\"
class=\"android.widget.ImageView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[354,198][541,259]\" /><node index=\"1\"
text=\"Photos\"
resource-id=\"com.google.android.apps.photos:id/product_name\"
class=\"android.widget.TextView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[552,190][726,259]\" /></node></node><node
index=\"2\" text=\"\" resource-id=\"\"
class=\"android.support.v7.widget.LinearLayoutCompat\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[925,147][1080,301]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/selected_account_disc\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"Sign in\"
checkable=\"false\" checked=\"false\" clickable=\"true\"
enabled=\"true\" focusable=\"true\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[925,157][1080,290]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/og_selected_account_disc_apd\"
class=\"android.widget.FrameLayout\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[925,157][1058,290]\"><node index=\"0\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/og_apd_internal_image_view\"
class=\"android.widget.ImageView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[931,163][1052,284]\" /><node index=\"1\"
text=\"\"
resource-id=\"com.google.android.apps.photos:id/og_apd_ring_view\"
class=\"android.widget.ImageView\"
package=\"com.google.android.apps.photos\" content-desc=\"\"
checkable=\"false\" checked=\"false\" clickable=\"false\"
enabled=\"true\" focusable=\"false\" focused=\"false\"
scrollable=\"false\" long-clickable=\"false\" password=\"false\"
selected=\"false\" bounds=\"[931,163][1052,284]\"
/></node></node></node></node></node></node></node></node></node></node></node></node></node></node><node
index=\"1\" text=\"\" resource-id=\"android:id/statusBarBackground\"
class=\"android.view.View\" package=\"com.google.android.apps.photos\"
content-desc=\"\" checkable=\"false\" checked=\"false\"
clickable=\"false\" enabled=\"true\" focusable=\"false\"
focused=\"false\" scrollable=\"false\" long-clickable=\"false\"
password=\"false\" selected=\"false\" bounds=\"[0,0][1080,136]\"
/></node></hierarchy>\n, target item description: clickable account disc
in the top right"}]}
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

chinese_homonym

### Eval description

Check the model's ability to recognize Chinese homonyms, which are words
that have the same pronunciation (Hànyǔ Pīnyīn) but different meanings.

### What makes this a useful eval?

It's easy for beginners learning Chinese, whether children or foreign
language learners, to distinguish whether two different Chinese words
have the same pronunciation. However, GPT's performance on this task is
noticeably poor. Recognizing Chinese homonyms is a fundamental language
skill. It's used to understand the context of content, assist language
learners, and recognize typos, given that the mainstream Chinese input
method is based on Hànyǔ Pīnyīn (pronunciation), among other reasons.
GPT-3.5 scored 0.476 on this task (even worse than a random guess),
while GPT-4 achieved 0.7 through the ChatGPT Plus subscription. We can
further examine GPT-4's hallucinations by diving deep into the
explanations of correct answers, as shown in the following screenshot.
![CleanShot 2023-06-08 at 00 43
44@2x](https://github.com/openai/evals/assets/14120040/fb0529bf-638d-4c00-a77e-0fba9d82756b)


## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

I manually created a diverse set of tests that uncovered GPT's poor
performance on Chinese pronunciation. This could lead to further
evaluations from different perspectives.

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"由于她最近在减肥,所以她今晚上决定不吃鱿鱼了。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他最近在出差。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"玲儿去医院看病,医生给开了胃药,但是她不愿意吃,所以都是她妈妈给她喂药。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"欲穷千里目,更上一层楼。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他给她送回家了"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"现在电视里面演的是一场清朝的殿试。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"这家初创公司发布的新产品原型是一个圆形的扫地机器人。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"这个网站的会员太贵了,比另外一家的贵。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他在农村生活,每天都要自己生火做饭。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他在超级猩猩的徐汇区运动群里面,看到了一个昵称叫星星的女孩。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"华北工业电机厂新的行政楼会在6月20日举行奠基仪式。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"坏人是一个聋子,把男主锁在了笼子里。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"知名医生张教授说过,人的一生其实很短暂。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"老宋忘了晚上有约会,当他赶到的时候女孩已经先走了,他感到很自责。"}],"ideal":["是"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他们在玩的游戏是掼蛋,小明抓完牌了说这把有戏。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"黄土坡派出所的警察在接到报案后,来找小区的保安调取监控。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"要是你不答应,晚上我就饿着。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"生活像一把无情的雕刻刀,改变了我们的样子。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"我没什么功利心,我要的是大量的空闲时间,能读书、看电视、想想事,我已经有了,应该知足了。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他听着雨花隐隐约约地飘落,慢慢地睡着了,雨花穿过窗外轻轻地落下,落到所有的生者和死者身上。"}],"ideal":["否"]}

{"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"厦门大学临海而建,内拥穿过芙蓉隧道,6月是这个城市一年中最美的时候,红红绿绿点缀中更显韵味,一定要去看那“凤凰花开的路口”。"}],"ideal":["否"]}
  ```
</details>
what are the side effects of no longer needing escape characters when
passing around message payloads?

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### gpt protocol buffers
GPT Protocol Buffers

### Eval description

Using length delimited strings, at multiple levels, we can fashion tag
value messages which do not require escape characters. Even if the
messages are nested, escape characters are not needed.

### What makes this a useful eval?

similar to the unified patch diff, the eval requires that gpt can
reliably/accurately handle offsets within text payloads.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [ ] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [ ] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [ ] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [ ] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

Google Protocol Buffers are quite popular. JSON is also quite popular.
My hope is "gpt protocol buffers" finds a better "sweet spot" between
both approaches.

## Eval structure 🏗️

Your eval should
- [ ] Check that your data is in `evals/registry/data/{name}`
- [ ] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [ ] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [ x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [ x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x ] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [ x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag91\", \"value92\"], [\"tag36\",
\"value13\"], [\"tag11\", \"value50\"], [\"tag88\", \"value28\"],
[\"tag87\",
\"value10\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(15)tag91(17)value92\n(15)tag36(17)value13\n(15)tag11(17)value50\n(15)tag88(17)value28\n(15)tag87(17)value10\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag21\", \"value3\"], [\"tag20\",
\"value58\"], [\"tag13\", \"value63\"], [\"tag46\",
\"value78\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag21(16)value3\n(15)tag20(17)value58\n(15)tag13(17)value63\n(15)tag46(17)value78\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag4\", \"value21\"], [\"tag76\",
\"value83\"], [\"tag52\", \"value2\"], [\"tag58\", \"value90\"],
[\"tag47\",
\"value84\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(14)tag4(17)value21\n(15)tag76(17)value83\n(15)tag52(16)value2\n(15)tag58(17)value90\n(15)tag47(17)value84\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag32\", \"value66\"], [\"tag50\",
\"value95\"], [\"tag40\",
\"value87\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(13)\n(15)tag32(17)value66\n(15)tag50(17)value95\n(15)tag40(17)value87\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag13\", \"value69\"], [\"tag29\",
\"value16\"], [\"tag5\", \"value82\"], [\"tag52\",
\"value30\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag13(17)value69\n(15)tag29(17)value16\n(14)tag5(17)value82\n(15)tag52(17)value30\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag78\", \"value38\"], [\"tag81\",
\"value0\"], [\"tag6\", \"value27\"], [\"tag60\", \"value22\"],
[\"tag50\",
\"value38\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(15)tag78(17)value38\n(15)tag81(16)value0\n(14)tag6(17)value27\n(15)tag60(17)value22\n(15)tag50(17)value38\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag18\", \"value61\"], [\"tag38\",
\"value68\"], [\"tag33\", \"value65\"], [\"tag64\",
\"value76\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag18(17)value61\n(15)tag38(17)value68\n(15)tag33(17)value65\n(15)tag64(17)value76\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag99\", \"value97\"], [\"tag86\",
\"value95\"], [\"tag15\", \"value79\"], [\"tag19\",
\"value69\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag99(17)value97\n(15)tag86(17)value95\n(15)tag15(17)value79\n(15)tag19(17)value69\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag89\", \"value52\"], [\"tag6\",
\"value79\"], [\"tag71\", \"value64\"], [\"tag3\", \"value62\"],
[\"tag54\",
\"value65\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(15)tag89(17)value52\n(14)tag6(17)value79\n(15)tag71(17)value64\n(14)tag3(17)value62\n(15)tag54(17)value65\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag6\", \"value10\"], [\"tag1\",
\"value15\"], [\"tag13\", \"value90\"], [\"tag31\", \"value38\"],
[\"tag68\",
\"value0\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(14)tag6(17)value10\n(14)tag1(17)value15\n(15)tag13(17)value90\n(15)tag31(17)value38\n(15)tag68(16)value0\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nex…
…ls with the Raven Matrices test (openai#1078)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, pelase note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑
### Eval name
Raven Matrices

### Eval description

This benchmark evaluates the ability of a language to perform abstract
reasoning using a text-based version of the Raven Matrices test. The
task consist of finding a pattern from a set of choices that completes a
sequence of eight previous samples. We provide various types of
matrices, either under natural language or symbolic formats, with
multiple-choices and open-ended settings.

### What makes this a useful eval?

Abstract reasoning is an useful task to evaluate the ability of a
language model to extract a pattern from few examples. The abstract
nature of the pattern requires the model to find the most generic
pattern, allowing to test the generalization capacities of language
models. Abstract reasoning is a task on which current language models do
not perform well, although they have been under-evaluated in the
research.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [X] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [X] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [X] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [X] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

Our eval contains an extensive list of high-quality samples from a
challenging and under-evaluated task, with several levels of difficulty
and different formats.

## Eval structure 🏗️

Your eval should
- [X] Check that your data is in `evals/registry/data/{name}`
- [X] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [X] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [X] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [X] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [X] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [X] I have filled out all required fields of this form
- [X] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Pick the letter in front of the correct pattern
that logically follows in the sequence from the answer set. Patterns in
the sequence are preceded by a number from 1 to 8. Patterns in the
answer set are preceded by a letter from A to H. Only return the letter
in front of the correct pattern."}, {"role": "user", "content": "1. On
an image, a large orange circle rotated at 90 degrees. "}, {"role":
"user", "content": "2. On an image, a giant orange pentagon rotated at
90 degrees. "}, {"role": "user", "content": "3. On an image, a small red
triangle rotated at 90 degrees. "}, {"role": "user", "content": "4. On
an image, a small orange circle rotated at 135 degrees. "}, {"role":
"user", "content": "5. On an image, a large orange pentagon rotated at
135 degrees. "}, {"role": "user", "content": "6. On an image, a giant
red triangle rotated at 135 degrees. "}, {"role": "user", "content": "7.
On an image, a giant red circle rotated at -45 degrees. "}, {"role":
"user", "content": "8. On an image, a small red pentagon rotated at -45
degrees. "}, {"role": "user", "content": "A. On an image, a large red
triangle rotated at -45 degrees. "}, {"role": "user", "content": "B. On
an image, a large red circle rotated at -45 degrees. "}, {"role":
"user", "content": "C. On an image, a large red hexagon rotated at -45
degrees. "}, {"role": "user", "content": "D. On an image, a medium red
triangle rotated at -45 degrees. "}, {"role": "user", "content": "E. On
an image, a large orange triangle rotated at -45 degrees. "}, {"role":
"user", "content": "F. On an image, a large red pentagon rotated at -45
degrees. "}, {"role": "user", "content": "G. On an image, a large pink
triangle rotated at -45 degrees. "}, {"role": "user", "content": "H. On
an image, a large lime triangle rotated at -45 degrees. "}, {"role":
"user", "content": "The answer is "}], "ideal": "A"}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Pick the letter in front of the correct pattern
that logically follows in the sequence from the answer set. Patterns in
the sequence are preceded by a number from 1 to 8. Patterns in the
answer set are preceded by a letter from A to H. Only return the letter
in front of the correct pattern."}, {"role": "user", "content": "1. On
an image, a huge purple triangle rotated at 180 degrees in the bottom
right, a small purple triangle rotated at -45 degrees in the top left, a
large purple triangle rotated at 45 degrees in the bottom left. "},
{"role": "user", "content": "2. On an image, a huge pink circle rotated
at 180 degrees in the bottom right, a small pink circle rotated at -45
degrees in the bottom left, a large pink circle rotated at 45 degrees in
the top right. "}, {"role": "user", "content": "3. On an image, a huge
white square rotated at 180 degrees in the bottom right, a small white
square rotated at -45 degrees in the top left, a large white square
rotated at 45 degrees in the top right. "}, {"role": "user", "content":
"4. On an image, a large lime circle rotated at 0 degrees in the bottom
right, a tiny lime circle rotated at 90 degrees in the top left, a giant
lime circle rotated at -45 degrees in the top right. "}, {"role":
"user", "content": "5. On an image, a large green square rotated at 0
degrees in the bottom right, a tiny green square rotated at 90 degrees
in the top left, a giant green square rotated at -45 degrees in the
bottom left. "}, {"role": "user", "content": "6. On an image, a large
cyan triangle rotated at 0 degrees in the bottom right, a tiny cyan
triangle rotated at 90 degrees in the bottom left, a giant cyan triangle
rotated at -45 degrees in the top right. "}, {"role": "user", "content":
"7. On an image, a huge lime square rotated at 135 degrees in the bottom
right, a tiny lime square rotated at 0 degrees in the bottom left, a
tiny lime square rotated at 135 degrees in the top right. "}, {"role":
"user", "content": "8. On an image, a huge green triangle rotated at 135
degrees in the bottom right, a tiny green triangle rotated at 0 degrees
in the top left, a tiny green triangle rotated at 135 degrees in the top
right. "}, {"role": "user", "content": "A. On an image, a huge cyan
pentagon rotated at 135 degrees in the bottom right, a tiny cyan
triangle rotated at 0 degrees in the top left, a tiny cyan triangle
rotated at 135 degrees in the bottom left. "}, {"role": "user",
"content": "B. On an image, a huge cyan circle rotated at 135 degrees in
the top right, a tiny cyan circle rotated at 0 degrees in the bottom
left, a tiny cyan circle rotated at 135 degrees in the top left. "},
{"role": "user", "content": "C. On an image, a huge cyan square rotated
at 135 degrees in the bottom right, a tiny cyan hexagon rotated at 0
degrees in the top left, a tiny cyan hexagon rotated at 135 degrees in
the bottom left. "}, {"role": "user", "content": "D. On an image, a huge
cyan circle rotated at 135 degrees in the top left, a tiny cyan circle
rotated at 0 degrees in the bottom right, a tiny cyan circle rotated at
135 degrees in the top right. "}, {"role": "user", "content": "E. On an
image, a huge cyan circle rotated at 135 degrees in the bottom right, a
tiny cyan circle rotated at 0 degrees in the top left, a tiny cyan
circle rotated at 135 degrees in the bottom left. "}, {"role": "user",
"content": "F. On an image, a huge yellow circle rotated at 135 degrees
in the bottom right, a tiny lime circle rotated at 0 degrees in the top
left, a tiny orange circle rotated at 135 degrees in the bottom left.
"}, {"role": "user", "content": "G. On an image, a huge cyan circle
rotated at 135 degrees in the bottom left, a tiny cyan circle rotated at
0 degrees in the top right, a tiny cyan circle rotated at 135 degrees in
the bottom right. "}, {"role": "user", "content": "H. On an image, a
huge cyan hexagon rotated at 135 degrees in the bottom right, a tiny
cyan pentagon rotated at 0 degrees in the top left, a tiny cyan pentagon
rotated at 135 degrees in the bottom left. "}, {"role": "user",
"content": "The answer is "}], "ideal": "E"}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Pick the letter in front of the correct pattern
that logically follows in the sequence from the answer set. Patterns in
the sequence are preceded by a number from 1 to 8. Patterns in the
answer set are preceded by a letter from A to H. Only return the letter
in front of the correct pattern."}, {"role": "user", "content": "1. On
an image, a small red circle rotated at -135 degrees in the top left.
"}, {"role": "user", "content": "2. On an image, a small red hexagon
rotated at -135 degrees in the top right. "}, {"role": "user",
"content": "3. On an image, a small red triangle rotated at -135 degrees
in the center. "}, {"role": "user", "content": "4. On an image, a giant
cyan hexagon rotated at -135 degrees in the top center. "}, {"role":
"user", "content": "5. On an image, a giant cyan triangle rotated at
-135 degrees in the center left. "}, {"role": "user", "content": "6. On
an image, a giant cyan circle rotated at -135 degrees in the center
right. "}, {"role": "user", "content": "7. On an image, a tiny green
triangle rotated at -45 degrees in the center, a tiny green triangle
rotated at -45 degrees in the bottom left, a tiny green triangle rotated
at -45 degrees in the center left. "}, {"role": "user", "content": "8.
On an image, a tiny green circle rotated at -45 degrees in the bottom
left, a tiny green circle rotated at -45 degrees in the bottom right, a
tiny green circle rotated at -45 degrees in the center right. "},
{"role": "user", "content": "A. On an image, a huge yellow circle
rotated at -45 degrees in the bottom center, a large green square
rotated at 180 degrees in the center left, a small red triangle rotated
at -45 degrees in the top center, a medium pink triangle rotated at -45
degrees in the center, a small green pentagon rotated at 135 degrees in
the bottom right, a giant lime triangle rotated at 180 degrees in the
top left, a large blue pentagon rotated at -90 degrees in the center
right. "}, {"role": "user", "content": "B. On an image, a tiny green
circle rotated at -45 degrees in the bottom right, a tiny green triangle
rotated at -45 degrees in the top center, a tiny green triangle rotated
at -45 degrees in the bottom center. "}, {"role": "user", "content": "C.
On an image, a tiny green triangle rotated at -45 degrees in the bottom
right, a tiny green square rotated at -45 degrees in the top center, a
tiny green circle rotated at -45 degrees in the bottom center. "},
{"role": "user", "content": "D. On an image, a large green hexagon
rotated at -45 degrees in the bottom right, a giant green hexagon
rotated at -45 degrees in the top center, a small green hexagon rotated
at -45 degrees in the bottom center. "}, {"role": "user", "content": "E.
On an image, a huge green hexagon rotated at -45 degrees in the bottom
right, a medium green hexagon rotated at -45 degrees in the top center,
a large green hexagon rotated at -45 degrees in the bottom center. "},
{"role": "user", "content": "F. On an image, a tiny green hexagon
rotated at -45 degrees in the center right, a tiny green hexagon rotated
at -45 degrees in the center, a tiny green hexagon rotated at -45
degrees in the top center. "}, {"role": "user", "content": "G. On an
image, a tiny green pentagon rotated at -45 degrees in the bottom right,
a tiny green circle rotated at -45 degrees in the top center, a tiny
green pentagon rotated at -45 degrees in the bottom center. "}, {"role":
"user", "content": "H. On an image, a tiny green hexagon rotated at -45
degrees in the bottom right, a tiny green hexagon rotated at -45 degrees
in the top center, a tiny green hexagon rotated at -45 degrees in the
bottom center. "}, {"role": "user", "content": "The answer is "}],
"ideal": "H"}

{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Pick the letter in front of the correct pattern
that logically follows in the sequence from the answer set. Patterns in
the sequence are preceded by a number from 1 to 8. Patterns in the
answer set are preceded by a letter from A to H. Only return the letter
in front of the correct pattern."}, {"role": "user", "content": "1. [(D,
B, F, F,)] "}, {"role": "user", "content": "2. [(F, B, D, F,)] "},
{"role": "user", "content": "3. [(B, A, B, F,)] "}, {"role": "user",
"content": "4. [(B, B, F, G,)] "}, {"role": "user", "content": "5. [(D,
B, D, G,)] "}, {"role": "user", "content": "6. [(F, A, B, G,)] "},
{"role": "user", "content": "7. [(F, A, F, C,)] "}, {"role": "user",
"content": "8. [(B, A, D, C,)] "}, {"role": "user", "content": "A. [(D,
A, B, C,)] "}, {"role": "user", "content": "B. [(D, A, F, C,)] "},
{"role": "user", "content": "C. [(D, A, E, C,)] "}, {"role": "user",
"content": "D. [(C, A, B, C,)] "}, {"role": "user", "content": "E. [(D,
B, B, C,)] "}, {"role": "user", "content": "F. [(D, A, D, C,)] "},
{"role": "user", "content": "G. [(D, I, B, C,)] "}, {"role": "user",
"content": "H. [(D, D, B, C,)] "}, {"role": "user", "content": "The
answer is "}], "ideal": "A"}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Pick the letter in front of the correct pattern
that logically follows in the sequence from the answer set. Patterns in
the sequence are preceded by a number from 1 to 8. Patterns in the
answer set are preceded by a letter from A to H. Only return the letter
in front of the correct pattern."}, {"role": "user", "content": "1. [(E,
H, B, H, BR), (B, H, B, C, TL), (D, H, B, E, BL)] "}, {"role": "user",
"content": "2. [(E, I, F, H, BR), (B, I, F, C, BL), (D, I, F, E, TR)]
"}, {"role": "user", "content": "3. [(E, J, C, H, BR), (B, J, C, C, TL),
(D, J, C, E, TR)] "}, {"role": "user", "content": "4. [(D, D, F, D, BR),
(A, D, F, F, TL), (F, D, F, C, TR)] "}, {"role": "user", "content": "5.
[(D, E, C, D, BR), (A, E, C, F, TL), (F, E, C, C, BL)] "}, {"role":
"user", "content": "6. [(D, F, B, D, BR), (A, F, B, F, BL), (F, F, B, C,
TR)] "}, {"role": "user", "content": "7. [(E, D, C, G, BR), (A, D, C, D,
BL), (A, D, C, G, TR)] "}, {"role": "user", "content": "8. [(E, E, B, G,
BR), (A, E, B, D, TL), (A, E, B, G, TR)] "}, {"role": "user", "content":
"A. [(E, F, D, G, BR), (A, F, B, D, TL), (A, F, B, G, BL)] "}, {"role":
"user", "content": "B. [(E, F, F, G, TR), (A, F, F, D, BL), (A, F, F, G,
TL)] "}, {"role": "user", "content": "C. [(E, F, C, G, BR), (A, F, E, D,
TL), (A, F, E, G, BL)] "}, {"role": "user", "content": "D. [(E, F, F, G,
TL), (A, F, F, D, BR), (A, F, F, G, TR)] "}, {"role": "user", "content":
"E. [(E, F, F, G, BR), (A, F, F, D, TL), (A, F, F, G, BL)] "}, {"role":
"user", "content": "F. [(E, C, F, G, BR), (A, D, F, D, TL), (A, B, F, G,
BL)] "}, {"role": "user", "content": "G. [(E, F, F, G, BL), (A, F, F, D,
TR), (A, F, F, G, BR)] "}, {"role": "user", "content": "H. [(E, F, E, G,
BR), (A, F, D, D, TL), (A, F, D, G, BL)] "}, {"role": "user", "content":
"The answer is "}], "ideal": "E"}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Pick the letter in front of the correct pattern
that logically follows in the sequence from the answer set. Patterns in
the sequence are preceded by a number from 1 to 8. Patterns in the
answer set are preceded by a letter from A to H. Only return the letter
in front of the correct pattern."}, {"role": "user", "content": "1. [(B,
A, F, A, TL)] "}, {"role": "user", "content": "2. [(B, A, E, A, TR)] "},
{"role": "user", "content": "3. [(B, A, B, A, C)] "}, {"role": "user",
"content": "4. [(F, F, E, A, TC)] "}, {"role": "user", "content": "5.
[(F, F, B, A, CL)] "}, {"role": "user", "content": "6. [(F, F, F, A,
CR)] "}, {"role": "user", "content": "7. [(A, E, B, C, C), (A, E, B, C,
BL), (A, E, B, C, CL)] "}, {"role": "user", "content": "8. [(A, E, F, C,
BL), (A, E, F, C, BR), (A, E, F, C, CR)] "}, {"role": "user", "content":
"A. [(E, C, F, C, BC), (D, E, C, H, CL), (B, A, B, C, TC), (C, I, B, C,
C), (B, E, D, G, BR), (F, D, B, H, TL), (D, G, D, B, CR)] "}, {"role":
"user", "content": "B. [(A, E, F, C, BR), (A, E, B, C, TC), (A, E, B, C,
BC)] "}, {"role": "user", "content": "C. [(A, E, B, C, BR), (A, E, C, C,
TC), (A, E, F, C, BC)] "}, {"role": "user", "content": "D. [(D, E, E, C,
BR), (F, E, E, C, TC), (B, E, E, C, BC)] "}, {"role": "user", "content":
"E. [(E, E, E, C, BR), (C, E, E, C, TC), (D, E, E, C, BC)] "}, {"role":
"user", "content": "F. [(A, E, E, C, CR), (A, E, E, C, C), (A, E, E, C,
TC)] "}, {"role": "user", "content": "G. [(A, E, D, C, BR), (A, E, F, C,
TC), (A, E, D, C, BC)] "}, {"role": "user", "content": "H. [(A, E, E, C,
BR), (A, E, E, C, TC), (A, E, E, C, BC)] "}, {"role": "user", "content":
"The answer is "}], "ideal": "H"}

{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Write the correct pattern with the same format
as in the examples. Patterns in the sequence are preceded by a number
from 1 to 8. "}, {"role": "user", "content": "1. On an image, a large
orange circle rotated at 90 degrees. "}, {"role": "user", "content": "2.
On an image, a giant orange pentagon rotated at 90 degrees. "}, {"role":
"user", "content": "3. On an image, a small red triangle rotated at 90
degrees. "}, {"role": "user", "content": "4. On an image, a small orange
circle rotated at 135 degrees. "}, {"role": "user", "content": "5. On an
image, a large orange pentagon rotated at 135 degrees. "}, {"role":
"user", "content": "6. On an image, a giant red triangle rotated at 135
degrees. "}, {"role": "user", "content": "7. On an image, a giant red
circle rotated at -45 degrees. "}, {"role": "user", "content": "8. On an
image, a small red pentagon rotated at -45 degrees. "}, {"role": "user",
"content": "The pattern that logically follows is:\n9. "}], "ideal": "On
an image, a large red triangle rotated at -45 degrees. "}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Write the correct pattern with the same format
as in the examples. Patterns in the sequence are preceded by a number
from 1 to 8. "}, {"role": "user", "content": "1. On an image, a huge
purple triangle rotated at 180 degrees in the bottom right, a small
purple triangle rotated at -45 degrees in the top left, a large purple
triangle rotated at 45 degrees in the bottom left. "}, {"role": "user",
"content": "2. On an image, a huge pink circle rotated at 180 degrees in
the bottom right, a small pink circle rotated at -45 degrees in the
bottom left, a large pink circle rotated at 45 degrees in the top right.
"}, {"role": "user", "content": "3. On an image, a huge white square
rotated at 180 degrees in the bottom right, a small white square rotated
at -45 degrees in the top left, a large white square rotated at 45
degrees in the top right. "}, {"role": "user", "content": "4. On an
image, a large lime circle rotated at 0 degrees in the bottom right, a
tiny lime circle rotated at 90 degrees in the top left, a giant lime
circle rotated at -45 degrees in the top right. "}, {"role": "user",
"content": "5. On an image, a large green square rotated at 0 degrees in
the bottom right, a tiny green square rotated at 90 degrees in the top
left, a giant green square rotated at -45 degrees in the bottom left.
"}, {"role": "user", "content": "6. On an image, a large cyan triangle
rotated at 0 degrees in the bottom right, a tiny cyan triangle rotated
at 90 degrees in the bottom left, a giant cyan triangle rotated at -45
degrees in the top right. "}, {"role": "user", "content": "7. On an
image, a huge lime square rotated at 135 degrees in the bottom right, a
tiny lime square rotated at 0 degrees in the bottom left, a tiny lime
square rotated at 135 degrees in the top right. "}, {"role": "user",
"content": "8. On an image, a huge green triangle rotated at 135 degrees
in the bottom right, a tiny green triangle rotated at 0 degrees in the
top left, a tiny green triangle rotated at 135 degrees in the top right.
"}, {"role": "user", "content": "The pattern that logically follows
is:\n9. "}], "ideal": "On an image, a huge cyan circle rotated at 135
degrees in the bottom right, a tiny cyan circle rotated at 0 degrees in
the top left, a tiny cyan circle rotated at 135 degrees in the bottom
left. "}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Write the correct pattern with the same format
as in the examples. Patterns in the sequence are preceded by a number
from 1 to 8. "}, {"role": "user", "content": "1. On an image, a small
red circle rotated at -135 degrees in the top left. "}, {"role": "user",
"content": "2. On an image, a small red hexagon rotated at -135 degrees
in the top right. "}, {"role": "user", "content": "3. On an image, a
small red triangle rotated at -135 degrees in the center. "}, {"role":
"user", "content": "4. On an image, a giant cyan hexagon rotated at -135
degrees in the top center. "}, {"role": "user", "content": "5. On an
image, a giant cyan triangle rotated at -135 degrees in the center left.
"}, {"role": "user", "content": "6. On an image, a giant cyan circle
rotated at -135 degrees in the center right. "}, {"role": "user",
"content": "7. On an image, a tiny green triangle rotated at -45 degrees
in the center, a tiny green triangle rotated at -45 degrees in the
bottom left, a tiny green triangle rotated at -45 degrees in the center
left. "}, {"role": "user", "content": "8. On an image, a tiny green
circle rotated at -45 degrees in the bottom left, a tiny green circle
rotated at -45 degrees in the bottom right, a tiny green circle rotated
at -45 degrees in the center right. "}, {"role": "user", "content": "The
pattern that logically follows is:\n9. "}], "ideal": "On an image, a
tiny green hexagon rotated at -45 degrees in the bottom right, a tiny
green hexagon rotated at -45 degrees in the top center, a tiny green
hexagon rotated at -45 degrees in the bottom center. "}

{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Write the correct pattern with the same format
as in the examples. Patterns in the sequence are preceded by a number
from 1 to 8. "}, {"role": "user", "content": "1. [(D, B, F, F,)] "},
{"role": "user", "content": "2. [(F, B, D, F,)] "}, {"role": "user",
"content": "3. [(B, A, B, F,)] "}, {"role": "user", "content": "4. [(B,
B, F, G,)] "}, {"role": "user", "content": "5. [(D, B, D, G,)] "},
{"role": "user", "content": "6. [(F, A, B, G,)] "}, {"role": "user",
"content": "7. [(F, A, F, C,)] "}, {"role": "user", "content": "8. [(B,
A, D, C,)] "}, {"role": "user", "content": "The pattern that logically
follows is:\n9. "}], "ideal": "[(D, A, B, C,)] "}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Write the correct pattern with the same format
as in the examples. Patterns in the sequence are preceded by a number
from 1 to 8. "}, {"role": "user", "content": "1. [(E, H, B, H, BR), (B,
H, B, C, TL), (D, H, B, E, BL)] "}, {"role": "user", "content": "2. [(E,
I, F, H, BR), (B, I, F, C, BL), (D, I, F, E, TR)] "}, {"role": "user",
"content": "3. [(E, J, C, H, BR), (B, J, C, C, TL), (D, J, C, E, TR)]
"}, {"role": "user", "content": "4. [(D, D, F, D, BR), (A, D, F, F, TL),
(F, D, F, C, TR)] "}, {"role": "user", "content": "5. [(D, E, C, D, BR),
(A, E, C, F, TL), (F, E, C, C, BL)] "}, {"role": "user", "content": "6.
[(D, F, B, D, BR), (A, F, B, F, BL), (F, F, B, C, TR)] "}, {"role":
"user", "content": "7. [(E, D, C, G, BR), (A, D, C, D, BL), (A, D, C, G,
TR)] "}, {"role": "user", "content": "8. [(E, E, B, G, BR), (A, E, B, D,
TL), (A, E, B, G, TR)] "}, {"role": "user", "content": "The pattern that
logically follows is:\n9. "}], "ideal": "[(E, F, F, G, BR), (A, F, F, D,
TL), (A, F, F, G, BL)] "}
{"input": [{"role": "system", "content": "Find the pattern number 9 that
completes the sequence. Write the correct pattern with the same format
as in the examples. Patterns in the sequence are preceded by a number
from 1 to 8. "}, {"role": "user", "content": "1. [(B, A, F, A, TL)] "},
{"role": "user", "content": "2. [(B, A, E, A, TR)] "}, {"role": "user",
"content": "3. [(B, A, B, A, C)] "}, {"role": "user", "content": "4.
[(F, F, E, A, TC)] "}, {"role": "user", "content": "5. [(F, F, B, A,
CL)] "}, {"role": "user", "content": "6. [(F, F, F, A, CR)] "}, {"role":
"user", "content": "7. [(A, E, B, C, C), (A, E, B, C, BL), (A, E, B, C,
CL)] "}, {"role": "user", "content": "8. [(A, E, F, C, BL), (A, E, F, C,
BR), (A, E, F, C, CR)] "}, {"role": "user", "content": "The pattern that
logically follows is:\n9. "}], "ideal": "[(A, E, E, C, BR), (A, E, E, C,
TC), (A, E, E, C, BC)] "}
  ```
</details>
…i#1124)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

The <eval_name> is **population_span_extraction**
ID is **population_span_extraction.dev.v0**

### Eval description

The model is shown abstracts of clinical drug trials and tasked with
extracting the text spans that specify the population demographic of the
shown abstract. The population demographic can be but is not necessarily
specified in multiple seperate spans.

A previous version included examples containing 'problem' as part of the
population (as per PICO criteria labeling) as opposed to strictly
population demographics.
We are now resubmitting a different version, with different abstracts,
which contains only demographics annotations.

### What makes this a useful eval?

The Repository specifically asks for "Real-world use cases". Extracting
population spans from clinical study trials is immensly useful to
researchers who have to go over and compare large amounts of clinical
drug trials.

The eval dataset is generated with multiple different prompts and
statisfies all further critera posed by Open AI.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```
{"input": [{"role": "system", "content": "I want to know how this
abstract defines the Population Demographics. Please extract the
sections in the abstract that define the demographics."}, {"role":
"user", "content": "Efficacy of the dorzolamide/timolol fixed
combination versus latanoprost in the treatment of ocular hypertension
or glaucoma: combined analysis of pooled data from two large randomized
observer and patient-masked studies.\n\nIn previous analyses of primary
efficacy data from two randomized clinical trials, standard dosing
regimens of the dorzolamide/timolol fixed combination (COSOPT) and
latanoprost (XALATAN) were shown to have equivalent efficacy with regard
to reduction in mean daytime diurnal intraocular pressure (IOP). We
performed additional post hoc analyses of pooled data from these studies
to compare further the efficacy of the two treatments. The studies used
identical 3-month, parallel group, randomized, observer-masked and
patient-masked, multicenter designs. Patients with a baseline IOP > or =
24 mm Hg were randomized to either the 2% dorzolamide/0.5% timolol
combination eye drops twice daily (n = 273) or 0.005% latanoprost eye
drops once daily (n = 271). The IOP measurements were made at 8 AM, 10
AM, 2 PM, and 4 PM at the baseline visit and then on each of the 3
monthly assessment days. The following measures were analyzed on a post
hoc basis: 1) percentages of patients meeting target levels of IOP
reduction; 2) mean IOP reduction in those patients with high IOP (> or
=30 mmHg) at baseline; 3) mean IOP at each of the assessment time points
during a day. A total of 259 patients in the dorzolamide/timolol group
and 268 patients in the latanoprost group were included in the efficacy
analysis. At 3 months, both treatments showed similar efficacy with
regard to the percentages of patients who achieved target levels of IOP
reduction (e.g., 40% IOP reduction in 15% of dorzolamide/timolol
combination patients and 13% of latanoprost patients), mean IOP
reduction in those patients with high IOP at baseline (dorzolamide/
timolol combination, 12.5 mmHg, latanoprost, 12.6 mmHg), and mean IOP at
each time point during the day. By the measures used in this analysis,
the dorzolamide/timolol combination and latanoprost were equally
effective at lowering IOP in patients with ocular hypertension or
glaucoma."}], "ideal": "In the abstract, population demographics are
defined by the following spans: 'Patients with a baseline IOP > or = 24
mm Hg '"}
{"input": [{"role": "system", "content": "Extract the text spans
containing information on the Population Demographic from the following
abstract."}, {"role": "user", "content": "Twenty-four-hour control with
latanoprost-timolol-fixed combination therapy vs latanoprost
therapy.\n\nOBJECTIVE: To evaluate the 24-hour efficacy and safety of
the latanoprost-timolol maleate-fixed combination vs latanoprost therapy
in patients with primary open-angle glaucoma.\nMETHODS: A prospective,
observer-masked, crossover, active-controlled, randomized comparison in
which after a 6-week medicine-free period, patients were randomized to
either latanoprost-timolol-fixed combination therapy or latanoprost
therapy, both dosed once each evening, alone for 8 weeks. Patients were
then switched to the opposite treatment for 8 weeks. At the end of the
washout and treatment periods, a 24-hour diurnal curve was
performed.\nRESULTS: The baseline untreated mean +/- SD diurnal curve in
37 patients who completed the study was 24.2 +/- 2.0 mm Hg. The mean
diurnal curve was 19.2 +/- 2.6 mm Hg for those who received latanoprost
therapy alone and 16.7 +/- 2.1 mm Hg for those who received the fixed
combination therapy (P&lt;.001). The fixed combination therapy also
provided a lower absolute intraocular pressure level (1.5-2.9 mm Hg,
P&lt;.001) and a greater intraocular pressure reduction from the
untreated baseline (P&lt;.001). Stinging was statistically lower with
latanoprost therapy alone (P = .04), but itching was statistically
increased compared with the fixed combination therapy (P =
.04).\nCONCLUSION: The result of this study suggests that the
latanoprost-timolol-fixed combination compared with latanoprost therapy
alone provides improved intraocular pressure reduction over the 24-hour
diurnal curve and for each individual time point in patients with
primary open-angle glaucoma."}], "ideal": "In the abstract, population
demographics are defined by the following spans: ' patients with primary
open-angle glaucoma.'"}
{"input": [{"role": "system", "content": "Extract the text spans
containing information on the Population Demographic from the following
abstract."}, {"role": "user", "content": "A 12-week, randomized,
double-masked, multicenter study of the fixed combination of latanoprost
and timolol in the evening versus the individual components.\n\nPURPOSE:
To compare the efficacy and tolerability of fixed-combination
latanoprost and timolol applied in the evening with the concomitant use
of the individual components.\nDESIGN: Twelve-week, randomized,
double-masked, multicenter study.\nPARTICIPANTS: Five hundred seventeen
randomized patients with ocular hypertension; open-angle, pigmentary, or
exfoliation glaucoma; and baseline (after washout) intraocular pressure
(IOP) levels between 23 and 33 mmHg.\nMETHODS: Patients received either
the fixed combination of latanoprost and timolol once daily in the
evening and a placebo in the morning and evening or the unfixed
combination of latanoprost once daily in the evening and timolol in the
morning and evening. Study visits were at weeks 2, 6, and 12. MAIN
OUTCOME MEASURES: The primary efficacy end point was mean change from
baseline to week 12 in diurnal IOP (mean IOPs of 8 am, 12 pm, and 4 pm).
The fixed combination was considered noninferior to the unfixed
combination if the upper limit of the 95% confidence interval (CI) of
the difference was &lt;1.5 mmHg (analysis of covariance). Adverse events
were recorded at each visit.\nRESULTS: In all, 502 patients were
included in intent-to-treat analyses (fixed combination, n = 255;
unfixed combination, n = 247). For the fixed- and unfixed-combination
groups, mean baseline diurnal IOP levels were 25.4 mmHg and 25.2 mmHg,
respectively, and mean diurnal IOP reductions were 8.7 mmHg and 9.0 mmHg
(between-treatment difference, 0.3 mmHg; 95% CI, -0.1 to 0.7 mmHg; P =
0.15). Both treatments were well tolerated.\nCONCLUSIONS: The fixed
combination of latanoprost and timolol administered once daily in the
evening is not inferior to the unfixed combination of latanoprost once
daily in the evening and timolol twice daily. The fixed combination
provides an effective and well-tolerated alternative to multiple
instillations."}], "ideal": "In the abstract, population demographics
are defined by the following spans: 'patients with ocular hypertension;
open-angle, pigmentary, or exfoliation glaucoma; and baseline (after
washout) intraocular pressure (IOP) levels between 23 and 33 mmHg'"}
{"input": [{"role": "system", "content": "This is from a clinical drug
trial abstract. Extract the parts specifying population demographics."},
{"role": "user", "content": "Efficacy of latanoprost or
fixed-combination latanoprost-timolol in patients switched from a
combination of timolol and a nonprostaglandin medication.\n\nPURPOSE: To
compare latanoprost with the fixed-combination latanoprost-timolol in
glaucoma or ocular hypertension patients switched from a combination
glaucoma therapy with timolol and another nonprostaglandin
medication.\nDESIGN: Prospective randomized clinical trial.\nMETHODS:
Glaucoma or ocular hypertension patients receiving a combined treatment
of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%,
alpha-agonist, or a topical carbonic anhydrase inhibitor) underwent a
30-day washout of their medications. A masked observer then measured
their intraocular pressure (IOP). The subjects were randomized to either
latanoprost or fixed-combination latanoprost-timolol eyedrops to use
once daily at 7 am. The IOP was measured again 30 days after the
patients started using one of the study drugs by the same examiner at
the same time. MAIN OUTCOME MEASURE: Comparison of the study
medications' hypotensive effect.\nRESULTS: Fifty-three eyes (28 in the
latanoprost group and 25 in the latanoprost-timolol group) from 28
patients were included in the study. The IOP reduction was greater in
both study groups compared with the previous combination therapy with
timolol and another nonprostaglandin medication in millimeters of
mercury (7.7+/-2.3 vs. 5.5+/-2.3, P&lt;0.001, for the latanoprost group;
8.5+/-3.5 vs. 6.3+/-2.7, P&lt;0.001, for the latanoprost-timolol group)
and percentage (35.8+/-8.2% vs. 25.6+/-8.9%, P&lt;0.001, for the
latanoprost group; 38.6+/-8.7% vs. 28.6+/-9.0%, P&lt;0.001, for the
latanoprost-timolol group). There was no statistical difference between
latanoprost and fixed-combination latanoprost-timolol in reducing IOP,
in either millimeters of mercury (P = 0.3) or percentage (P =
0.2).\nCONCLUSIONS: Both latanoprost and fixed-combination
latanoprost-timolol may be viable substitutes for timolol and another
nonprostaglandin medication in glaucoma or ocular hypertension
patients."}], "ideal": "In the abstract, population demographics are
defined by the following spans: 'Glaucoma or ocular hypertension
patients receiving a combined treatment of timolol 0.5% and another
nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical
carbonic anhydrase inhibitor)'"}
{"input": [{"role": "system", "content": "The Following text is an
abstract of a clinical drug trial that specifies a population
demographic. I want you to extract the text spans that contain these
informations."}, {"role": "user", "content": "A 6-week, double-masked,
parallel-group study of the efficacy and safety of travoprost 0.004%
compared with latanoprost 0:005%/timolol 0.5% in patients with primary
open-angle glaucoma or ocular hypertension.\n\nOBJECTIVE: The objective
of this study was to directly compare the intraocular pressure
(IOP)-lowering efficacy and safety of travoprost 0.004% eyedrops with
the fixed combination of latanoprost 0.005%/timolol 0.5% eyedrops in
patients with primary open-angle glaucoma or ocular
hypertension.\nMETHODS: This was a randomized, double-masked,
multicenter, parallel-group, active-controlled study. Adult subjects
with open-angle glaucoma (with or without pseudoexfoliation or pigment
dispersion component) or ocular hypertension were eligible to
participate if their IOP was inadequately controlled with &gt; or =4
weeks of beta-blocker monotherapy, as indicated by IOP of 22 to 36 mm Hg
at 9 AM at screening. Patients were randomly assigned in a 1:1 ratio to
receive placebo + travoprost or latanoprost/timolol + placebo. Patients
in the travoprost group administered travoprost at 9 PM and placebo at 9
AM; patients in the latanoprost/timolol group administered
latanoprost/timolol at 9 AM and placebo at 9 PM. IOP measurements were
performed using Goldmann applanation tonometry at 9 AM and 5 PM at the
week-2 and week-6 visits. Both volunteered and elicited reports of
adverse events were collected; all patients who were randomized and
received &gt; or =1 dose of study drug were included in the safety
analysis.\nRESULTS: One hundred ten patients were randomized, of whom
106 patients were evaluable (travoprost, n = 50; latanoprost/timolol, n
= 56). There were no statistically significant differences at baseline
between the treatment groups, based on age group, sex, race, iris color,
or diagnosis. Mean IOP values were not statistically different between
groups at baseline or during treatment. In the pooled results for 9 Am
assessment at weeks 2 and 6, mean (SEM) IOP reductions for travoprost
and latanoprost/timolol were 7.0 (0.5) and 6.4 (0.5) mm Hg, respectively
(P = NS). Adverse events related to therapy were mild in nature, and
there were no statistically significant differences between the 2
treatment groups. The most frequently experienced adverse events in the
travoprost group were ocular hyperemia (9.3%), foreign body sensation
(5.6%), abnormal vision (1.9%), allergic reaction (1.9%), conjunctivitis
(1.9%), dacryocystitis (1.9%), eye discharge (1.9%), eye pruritus
(1.9%), lid edema (1.9%), lid erythema (1.9%), and tearing (1.9%). In
the latanoprost/timolol group, the most frequently experienced adverse
events were cataract (1.8%), dry eyes (1.8%), eye pruritus (1.8%),
foreign body sensation (1.8%), and ocular hyperemia
(1.8%).\nCONCLUSIONS: Mean IOP changes from baseline for travoprost
0.004% and latanoprost 0.005%/timolol 0.5% fixed combination were not
significantly different at follow-up in these patients. Both medications
were well tolerated."}], "ideal": "In the abstract, population
demographics are defined by the following spans: 'in patients with
primary open-angle glaucoma or ocular hypertension.', 'Adult subjects
with open-angle glaucoma (with or without pseudoexfoliation or pigment
dispersion component) or ocular hypertension', 'IOP was inadequately
controlled with &gt; or =4 weeks of beta-blocker monotherapy'"}
{"input": [{"role": "system", "content": "I want to know how this
abstract defines the Population Demographics. Please extract the
sections in the abstract that define the demographics."}, {"role":
"user", "content": "Comparison of the efficacy and safety of travoprost
with a fixed-combination of dorzolamide and timolol in patients with
open-angle glaucoma or ocular hypertension.\n\nPURPOSE: The purpose of
this study was to compare travoprost (TRAV; travoprost 0.004%) and the
fixed-combination of dorzolamide/timolol (DTFC; dorzolamide 2.0%/timolol
maleate 0.5%) ophthalmic solutions for reducing intraocular pressure
(IOP) in patients with primary open-angle glaucoma (OAG) or ocular
hypertension (OHT).\nMETHODS: This was a randomized single masked, study
with parallel controls. The TRAV group (n = 29) dosed once daily at 9:00
PM while the DTFC group (n = 27) dosed twice daily at 9:00 AM and 9:00
PM. IOP was measured at baseline, and following 3 weeks and 6 weeks of
treatment at 8:00 AM, 12:00 PM, 4:00 PM, and 8:00 PM.\nRESULTS: Mean
average IOP reductions from baseline during the course of the day were
7.5 (32.7%) and 7.1 (30.7%) mmHg for TRAV and 4.8 (23.1%) and 4.5
(21.7%) mmHg for DTFC at 3 weeks and 6 weeks, respectively. The greater
IOP reduction for patients receiving TRAV was statistically significant
at both the 3 and 6 week visits when averaged across all four time
points (p &lt; 0.01). The two products were well-tolerated over the
course of the 6 week study. Some factors such as taste perversion were
reported more often in the DTFC group.\nCONCLUSIONS: Travoprost
monotherapy provided better efficacy in terms of IOP reduction and
percentage of IOP reduction compared to dorzolamide 2.0%/timolol maleate
0.5% fixed combination."}], "ideal": "In the abstract, population
demographics are defined by the following spans: 'in patients with
primary open-angle glaucoma (OAG)', 'ocular hypertension (OHT)'"}
{"input": [{"role": "system", "content": "What is the Population
Demographic for the following abstract? Extract the text spans that
define it."}, {"role": "user", "content": "Efficacy and safety of
latanoprost versus travoprost in exfoliative glaucoma
patients.\n\nOBJECTIVE: To evaluate 24-hour intraocular pressure (IOP)
efficacy of latanoprost versus travoprost, each given every evening, in
exfoliative glaucoma patients.\nDESIGN: Prospective, observer-masked,
crossover comparison.\nPARTICIPANTS: Forty patients with exfoliation
glaucoma.\nMETHODS: Patients with a pressure of >24 mmHg were randomized
to latanoprost or travoprost for an 8-week treatment period after a
6-week medicine-free period. Patients were then switched to the opposite
treatment for the second period. At untreated baseline and at the end of
each treatment period the IOP was measured at 6 am, 10 am, 2 pm, 6 pm,
10 pm, and 2 am. MAIN OUTCOME MEASURE: Diurnal IOP.\nRESULTS: The mean
24-hour IOP was 25.1+/-2.5 mmHg at baseline, 17.8+/-2.1 mmHg on
latanoprost, and 17.3+/-2.2 mmHg on travoprost (P = 0.001). Individual
time points were similar between treatments, except at 6 pm when
travoprost provided lower IOP (16.7+/-2.6 vs 17.9+/-2.5 mmHg, P<0.001).
Adverse events showed more conjunctival hyperemia with travoprost (n =
15) than latanoprost (n = 6; P = 0.03).\nCONCLUSIONS: Latanoprost and
travoprost both significantly reduce the 24-hour IOP from baseline in
exfoliative glaucoma, but travoprost may demonstrate a greater
hypotensive efficacy in the late afternoon."}], "ideal": "In the
abstract, population demographics are defined by the following spans:
'Patients with a pressure of >24 mmHg', 'exfoliative glaucoma
patients'"}
{"input": [{"role": "system", "content": "What is the Population
Demographic for the following abstract? Extract the text spans that
define it."}, {"role": "user", "content": "Comparison of the ocular
hypotensive effects of bimatoprost and timolol-dorzolamide combination
in patients with elevated intraocular pressure: a 6-month
study.\n\nPURPOSE: To compare the ocular hypotensive efficacy and safety
of topical bimatoprost and timolol-dorzolamide combination in patients
with primary open-angle glaucoma (POAG) or ocular hypertension during 6
months of treatment.\nMETHODS: A sample of 65 patients with a diagnosis
of POAG or ocular hypertension were randomized to receive either
bimatoprost 0.03% once daily or timolol-dorzolamide combination twice
daily. Study visits occurred at baseline and after 2 weeks and 1, 3 and
6 months of therapy. Intraocular pressure (IOP) measurements were
performed at 12.00 hours at all study visits and also at 08.00 hours and
16.00 hours at baseline and 6-month visits. At each visit, local and
systemic side-effects that occurred during the treatment period were
recorded. Student's t-test was used to compare the differences between
IOP values.\nRESULTS: Differences in IOP between the bimatoprost and
timolol-dorzolamide groups were statistically insignificant at all study
visits (p > 0.05). In the bimatoprost-treated group, the IOP reduction
was 6.2 +/- 1.8 mmHg, whereas it was 6.5 +/- 2.3 mmHg in the
timolol-dorzolamide group after 6 months of treatment. The difference
was not statistically significant (p = 0.48).\nCONCLUSIONS: The
IOP-lowering efficacies of bimatoprost and timolol-dorzolamide
combination were similar over a 6-month follow-up. Both bimatoprost and
the timolol-dorzolamide combination were well tolerated. Bimatoprost can
be used as a longterm monotherapy agent in the treatment of POAG and
ocular hypertension."}], "ideal": "In the abstract, population
demographics are defined by the following spans: 'patients with primary
open-angle glaucoma (POAG) or ocular hypertension'"}
{"input": [{"role": "system", "content": "What is the Population
Demographic for the following abstract? Extract the text spans that
define it."}, {"role": "user", "content": "Comparing the fixed
combination brimonidine-timolol versus fixed combination
dorzolamide-timolol in patients with elevated intraocular
pressure.\n\nPURPOSE: To evaluate the efficacy of fixed combination
brimonidine-timolol (FCBT) versus fixed combination dorzolamide-timolol
(FCDT) given twice daily in patients with primary open angle glaucoma
(POAG) or ocular hypertension (OH).\nDESIGN: Prospective, multicentre,
masked-observer, crossover comparison.\nPARTICIPANTS: Sixteen patients
with POAG and 14 with OH.\nMETHODS: The participants of the study were
washed out from their previous medication and randomized to fixed FCBT
or FCDT for the first 4-week treatment period. Subjects then were washed
for 4 weeks and started on the opposite medication for the second 4-week
period. Intraocular pressure (IOP) was measured with a Goldmann
applanation tonometer at 8:00 a.m., 12:00 noon and 4:00 p.m. at each
baseline and at the end of each treatment period. Unsolicited ocular
adverse events were also recorded. MAIN OUTCOME MEASURES: Comparison of
the IOP lowering effect of FCBT and FCDT.\nRESULTS: The baseline mean
diurnal IOP for all 30 subjects (30 eyes) was 22.9 +/- 1.6 mmHg. Both
fixed combinations significantly reduced IOP compared with baseline (p
&lt; 0.00001). The mean diurnal IOP following 4 weeks of therapy was
15.0 +/- 2.1 mmHg for FCBT and 15.4 +/- 2.1 mmHg for FCDT (p = 0.510).
The mean diurnal IOP reduction was 7.8 +/- 1.9 mmHg for FCBT and 7.4 +/-
1.8 mmHg for FCDT (p = 0.430). Overall, 14 subjects complained about
ocular adverse events: two only for FCBT, seven only for FCDT and five
for both drugs. Although there was no significant difference between the
number of subjects that reported ocular adverse events with FCBT (n = 7)
and FCDT (n = 12) (p = 0.359), FCDT caused more ocular stinging upon
instillation (n = 9) than FCBT (n = 1) (p = 0.027).\nCONCLUSION: This
study suggests that FCBT and FCDT, each given twice daily, have similar
efficacy in patients with POAG or OH."}], "ideal": "In the abstract,
population demographics are defined by the following spans: 'patients
with primary open angle glaucoma (POAG) or ocular hypertension (OH)',
'patients with POAG', 'OH'"}
{"input": [{"role": "system", "content": "I want to know how this
abstract defines the Population Demographics. Please extract the
sections in the abstract that define the demographics."}, {"role":
"user", "content": "A comparison of the safety and intraocular pressure
lowering of bimatoprost/timolol fixed combination versus
latanoprost/timolol fixed combination in patients with open-angle
glaucoma.\n\nPURPOSE: To compare the efficacy and tolerability of a once
daily evening dose of the latanoprost/timolol fixed combination (LTFC)
with that of a once-daily evening dose of the bimatoprost/timolol fixed
combination (BTFC) in patients with open-angle glaucoma with elevated
intraocular pressure (IOP) insufficiently responsive to monotherapy with
prostaglandin analogues/prostamides.\nDESIGN: Prospective, randomized,
evaluator masked, single-center study.\nPARTICIPANTS: 36 patients with a
diagnosis of open-angle glaucoma, with or without pseudoexfoliation, and
inadequate control of IOP, insufficiently responsive to monotherapy with
prostaglandin analogues/prostamides. MAIN OUTCOME MEASURE: The primary
end-points were the change in IOP at 9:00 am from baseline to week 4,
and the difference between treatment groups in the mean diurnal IOP
reduction from baseline to week 4.\nRESULTS: BTFC provided significantly
greater mean diurnal IOP reduction [mean (standard deviation)] 2.8 (0.9)
mmHg, compared with LTFC 2.1 (0.6) mmHg, p = 0.0214. Both treatments
significantly reduced the IOP from baseline at each IOP time-point
measured, p < 0.0001, and for the mean diurnal IOP; p = 0.0049 for the
LTFC, and p < 0.0001 for the BTFC. There were no significant differences
in average hyperemia scores among groups, 1.25 (0.5) vs. 1.62 (0.69), p
= 0.3835, for the LTFC and the BTFC, respectively.\nCONCLUSIONS: The
results of this study showed a significantly higher IOP-lowering effect
of a once-daily evening dose of the BTFC compared to that of a
once-daily evening administration of the LTFC."}], "ideal": "In the
abstract, population demographics are defined by the following spans:
'patients with open-angle glaucoma with elevated intraocular pressure
(IOP) insufficiently responsive to monotherapy with prostaglandin
analogues/prostamides'"}

  ```
</details>
Minor misspelling fix

---------

Co-authored-by: Alvin Wang <[email protected]>
Co-authored-by: Tim <[email protected]>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

korean_dialects

### Eval description

The eval aims to assess the model's ability to identify the specific
South Korean dialect a sentence belongs to.

### What makes this a useful eval?

This eval provides the opportunity to understand how well GPT models can
classify South Korean dialects. The dialects within South Korea are
fairly distinct in terms of pronunciation, vocabulary, grammar, and
intonation. Being able to determine the dialect can help in providing
social, cultural, and historical insights.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"모두 다 오세요."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"벌써 열시 반이에요."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"창피해서 말도 못해요."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"그럼요 가능하죠."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"똑바로 해주세요."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"이거 조금 짜."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"간신히 도착했습니다."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"야 이거 좀 별로다."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"와 정말 많다."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"거기 구멍에 잘
끼워보세요."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"진짜 피곤해요."}],"ideal":"서울"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"마카 모예."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"하마 열시 반이잖소."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"남새시러운기 왜그르나."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"해봐요 그 될껄?"}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"거 똑떼이 해야될끼라요."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"이기 쫌 짜구워."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"거 간신히 도착했잖소."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"아 매련도 읍싸."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"마이 있잖소."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"구녕에 똑띠 끼워봐요."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"쌔가 빠진다야."}],"ideal":"강원도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"야 싹 다 온나."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"벌써 열시 반이가."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"하모 된다카이."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"단디 해라이."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"이거 쫌 짭다."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"내 강가이 도착했다."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"그 영 파이다."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"억수로 많다."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"구녕에 단디 낑가라이."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"아따 대다."}],"ideal":"경상도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"싹 다 와부쇼."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"폴쎄 열시 반이되브렀네."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"아따 하먼 된당께."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"똑떨어지게 해보랑게."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"포도시 도착했시야."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"야 이거 물짜야 못
쓰것다이."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"아따 겁나게 많네."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"구녁에 잘 찡거보쇼 딱 맞게 그게
맞소?"}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"오메 된 거 걍 오만 삭신이 아퍼
죽겄소이."}],"ideal":"전라도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"거기 다 와봐유."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"벌써 열시 반인겨?"}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"될겨."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"할껴 알 할껴."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"갠시히 왔네."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"글쎄 잘 모르겄어."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"뭐여 뭐가 이렇게
많은겨."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"거기 구멍에 잘 좀
낌어봐."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"진짜 대간햐."}],"ideal":"충청도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"발써 열시 반되수과."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"키여 맞수다게."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"졸바로 해줍써."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"제우 제우 와수다."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"영 벨론게게."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"잘도 하영이 숨게데."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"트멍더래 잘 쫍지라."}],"ideal":"제주도"}
{"input":[{"role":"system","content":"You will be prompted with a Korean
sentence to determine which South Korean dialect the sentence belongs
to. This is multiple choice problem where your answer is one of the
following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the
dialect with no other words or
punctuation."},{"role":"user","content":"잘도 버침게."}],"ideal":"제주도"}
```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

NFL Point Combinations

### Eval description

This eval tests the model's ability to calculate all the possible ways
to achieve a specific score by a single team in an NFL game.

### What makes this a useful eval?

This eval is actually very similar to the coin change problem which
GPT-4 handles very well. However, it is extremely inaccurate when asked
to applied that same type of problem to a real life situation (2.5%
accuracy for GPT-3.5-turbo and 12.5% accuracy for GPT-4). It is
important for the model to learn how to derive logic problems from real
life context.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "As of the year 2010, in
American Football, how many unique, order-independent ways can an NFL
(National Football League) team score exactly 4 points in a single game?
Exclude one-point safeties as one of the scoring options. List out all
the possible combinations and write your final answer as a single number
enclosed in square brackets."}], "ideal": "[1]"}
{"input": [{"role": "system", "content": "As of the year 2010, in
American Football, how many unique, order-independent ways can an NFL
(National Football League) team score exactly 6 points in a single game?
Exclude one-point safeties as one of the scoring options. List out all
the possible combinations and write your final answer as a single number
enclosed in square brackets."}], "ideal": "[3]"}
{"input": [{"role": "system", "content": "As of the year 2010, in
American Football, how many unique, order-independent ways can an NFL
(National Football League) team score exactly 7 points in a single game?
Exclude one-point safeties as one of the scoring options. List out all
the possible combinations and write your final answer as a single number
enclosed in square brackets."}], "ideal": "[2]"}
{"input": [{"role": "system", "content": "As of the year 2010, in
American Football, how many unique, order-independent ways can an NFL
(National Football League) team score exactly 12 points in a single
game? Exclude one-point safeties as one of the scoring options. List out
all the possible combinations and write your final answer as a single
number enclosed in square brackets."}], "ideal": "[7]"}
{"input": [{"role": "system", "content": "As of the year 2010, in
American Football, how many unique, order-independent ways can an NFL
(National Football League) team score exactly 25 points in a single
game? Exclude one-point safeties as one of the scoring options. List out
all the possible combinations and write your final answer as a single
number enclosed in square brackets."}], "ideal": "[24]"}
{"input": [{"role": "system", "content": "As of the year 2010, in
American Football, how many unique, order-independent ways can an NFL
(National Football League) team score exactly 38 points in a single
game? Exclude one-point safeties as one of the scoring options. List out
all the possible combinations and write your final answer as a single
number enclosed in square brackets."}], "ideal": "[68]"}
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

## Eval details 📑
### Eval name
Pantone To Hex

### Eval description
This converts Pantone friendly color names to their closest hex
counterparts.

### What makes this a useful eval?

Pantone colors is something that a lot of nontechnical folks use and
converting color names like "Neutral Black" is not intuitive.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Yellow
C"}],"ideal":"#FEDD00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Yellow 012
C"}],"ideal":"#FFD700"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Orange 021
C"}],"ideal":"#FE5000"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Warm Red
C"}],"ideal":"#F9423A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Red 032
C"}],"ideal":"#EF3340"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Rubine Red
C"}],"ideal":"#CE0058"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Rhodamine Red
C"}],"ideal":"#E10098"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Purple
C"}],"ideal":"#BB29BB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Violet
C"}],"ideal":"#440099"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Blue 072
C"}],"ideal":"#10069F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Reflex Blue
C"}],"ideal":"#001489"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Process Blue
C"}],"ideal":"#0085CA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Green
C"}],"ideal":"#00AB84"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Black
C"}],"ideal":"#2D2926"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Yellow 0131
C"}],"ideal":"#F2F0A1"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Red 0331
C"}],"ideal":"#FCAEBB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Magenta 0521
C"}],"ideal":"#F1B2DC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Violet 0631
C"}],"ideal":"#BF9BDE"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Blue 0821
C"}],"ideal":"#74D1EA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Green 0921
C"}],"ideal":"#9DE7D7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Black 0961
C"}],"ideal":"#9E978E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"801 C"}],"ideal":"#009ACE"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"802 C"}],"ideal":"#44D62C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"803 C"}],"ideal":"#FFE900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"804 C"}],"ideal":"#FFAA4D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"805 C"}],"ideal":"#FF7276"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"806 C"}],"ideal":"#FF3EB5"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"807 C"}],"ideal":"#EA27C2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"871 C"}],"ideal":"#84754E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"872 C"}],"ideal":"#85714D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"873 C"}],"ideal":"#866D4B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"874 C"}],"ideal":"#8B6F4E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"875 C"}],"ideal":"#87674F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"876 C"}],"ideal":"#8B634B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"877 C"}],"ideal":"#8A8D8F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Medium Yellow
C"}],"ideal":"#FFD900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Bright Orange
C"}],"ideal":"#FF5E00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Bright Red
C"}],"ideal":"#F93822"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Strong Red
C"}],"ideal":"#CE0056"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Pink C"}],"ideal":"#D62598"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Medium Purple
C"}],"ideal":"#4E008E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Dark Blue
C"}],"ideal":"#00239C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Medium Blue
C"}],"ideal":"#0084CA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Bright Green
C"}],"ideal":"#00B08B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Neutral Black
C"}],"ideal":"#222223"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"100 C"}],"ideal":"#F6EB61"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"101 C"}],"ideal":"#F7EA48"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"102 C"}],"ideal":"#FCE300"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"103 C"}],"ideal":"#C5A900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"104 C"}],"ideal":"#AF9800"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"105 C"}],"ideal":"#897A27"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7401 C"}],"ideal":"#F5E1A4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7402 C"}],"ideal":"#ECD898"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7403 C"}],"ideal":"#EED484"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7404 C"}],"ideal":"#F4DA40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7405 C"}],"ideal":"#F2CD00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7406 C"}],"ideal":"#F1C400"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7407 C"}],"ideal":"#CBA052"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"106 C"}],"ideal":"#F9E547"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"107 C"}],"ideal":"#FBE122"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"108 C"}],"ideal":"#FEDB00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"109 C"}],"ideal":"#FFD100"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"110 C"}],"ideal":"#DAAA00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"111 C"}],"ideal":"#AA8A00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"112 C"}],"ideal":"#9C8412"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"113 C"}],"ideal":"#FAE053"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"114 C"}],"ideal":"#FBDD40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"115 C"}],"ideal":"#FDDA24"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"116 C"}],"ideal":"#FFCD00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"117 C"}],"ideal":"#C99700"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"118 C"}],"ideal":"#AC8400"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"119 C"}],"ideal":"#897322"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"127 C"}],"ideal":"#F3DD6D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"128 C"}],"ideal":"#F3D54E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"129 C"}],"ideal":"#F3D03E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"130 C"}],"ideal":"#F2A900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"131 C"}],"ideal":"#CC8A00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"132 C"}],"ideal":"#A07400"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"133 C"}],"ideal":"#6C571B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1205 C"}],"ideal":"#F8E08E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1215 C"}],"ideal":"#FBD872"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1225 C"}],"ideal":"#FFC845"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1235 C"}],"ideal":"#FFB81C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1245 C"}],"ideal":"#C69214"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1255 C"}],"ideal":"#AD841F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1265 C"}],"ideal":"#886B25"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"120 C"}],"ideal":"#FBDB65"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"121 C"}],"ideal":"#FDD757"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"122 C"}],"ideal":"#FED141"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"123 C"}],"ideal":"#FFC72C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"124 C"}],"ideal":"#EAAA00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"125 C"}],"ideal":"#B58500"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"126 C"}],"ideal":"#9A7611"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7548 C"}],"ideal":"#FFC600"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7549 C"}],"ideal":"#FFB500"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7550 C"}],"ideal":"#D19000"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7551 C"}],"ideal":"#B47E00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7552 C"}],"ideal":"#73531D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7553 C"}],"ideal":"#5A4522"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7554 C"}],"ideal":"#4B3D2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7555 C"}],"ideal":"#D29F13"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7556 C"}],"ideal":"#B78B20"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7557 C"}],"ideal":"#9F7D23"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7558 C"}],"ideal":"#967126"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7559 C"}],"ideal":"#8F6A2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7560 C"}],"ideal":"#7D622E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7561 C"}],"ideal":"#6C5D34"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"134 C"}],"ideal":"#FDD26E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"135 C"}],"ideal":"#FFC658"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"136 C"}],"ideal":"#FFBF3F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"137 C"}],"ideal":"#FFA300"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"138 C"}],"ideal":"#DE7C00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"139 C"}],"ideal":"#AF6D04"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"140 C"}],"ideal":"#74531C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1345 C"}],"ideal":"#FDD086"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1355 C"}],"ideal":"#FFC56E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1365 C"}],"ideal":"#FFB549"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1375 C"}],"ideal":"#FF9E1B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1385 C"}],"ideal":"#D57800"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1395 C"}],"ideal":"#996017"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1405 C"}],"ideal":"#6E4C1E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"141 C"}],"ideal":"#F2C75C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"142 C"}],"ideal":"#F1BE48"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"143 C"}],"ideal":"#F1B434"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"144 C"}],"ideal":"#ED8B00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"145 C"}],"ideal":"#CF7F00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"146 C"}],"ideal":"#A76D11"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"147 C"}],"ideal":"#715C2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7408 C"}],"ideal":"#F6BE00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7409 C"}],"ideal":"#F0B323"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7410 C"}],"ideal":"#FEAD77"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7411 C"}],"ideal":"#E6A65D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7412 C"}],"ideal":"#D38235"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7413 C"}],"ideal":"#DC8633"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7414 C"}],"ideal":"#C16C18"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7562 C"}],"ideal":"#BD9B60"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7563 C"}],"ideal":"#D69A2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7564 C"}],"ideal":"#DB8A06"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7565 C"}],"ideal":"#CD7925"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7566 C"}],"ideal":"#AD6433"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7567 C"}],"ideal":"#89532F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7568 C"}],"ideal":"#775135"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7569 C"}],"ideal":"#D78825"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7570 C"}],"ideal":"#D3832B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7571 C"}],"ideal":"#C67D30"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7572 C"}],"ideal":"#B67233"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7573 C"}],"ideal":"#A7662B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7574 C"}],"ideal":"#9E6A38"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7575 C"}],"ideal":"#835D32"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"712 C"}],"ideal":"#FCC89B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"713 C"}],"ideal":"#FDBE87"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"714 C"}],"ideal":"#FDAA63"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"715 C"}],"ideal":"#F68D2E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"716 C"}],"ideal":"#EA7600"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"717 C"}],"ideal":"#D45D00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"718 C"}],"ideal":"#BE4D00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"148 C"}],"ideal":"#FECB8B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"149 C"}],"ideal":"#FFC27B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"150 C"}],"ideal":"#FFB25B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"151 C"}],"ideal":"#FF8200"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"152 C"}],"ideal":"#E57200"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"153 C"}],"ideal":"#BE6A14"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"154 C"}],"ideal":"#9B5A1A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"155 C"}],"ideal":"#EFD19F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"156 C"}],"ideal":"#EFBE7D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"157 C"}],"ideal":"#ECA154"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"158 C"}],"ideal":"#E87722"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"159 C"}],"ideal":"#CB6015"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"160 C"}],"ideal":"#A1561C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"161 C"}],"ideal":"#603D20"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1485 C"}],"ideal":"#FFAE62"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1495 C"}],"ideal":"#FF8F1C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1505 C"}],"ideal":"#FF6900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1525 C"}],"ideal":"#B94700"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1535 C"}],"ideal":"#94450B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1545 C"}],"ideal":"#653819"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1555 C"}],"ideal":"#FFB990"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1565 C"}],"ideal":"#FFA06A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1575 C"}],"ideal":"#FF7F32"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1585 C"}],"ideal":"#FF6A13"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1595 C"}],"ideal":"#D86018"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1605 C"}],"ideal":"#A65523"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1615 C"}],"ideal":"#8B4720"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"162 C"}],"ideal":"#FFBE9F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"163 C"}],"ideal":"#FF9D6E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"164 C"}],"ideal":"#FF7F41"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"165 C"}],"ideal":"#FF671F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"166 C"}],"ideal":"#E35205"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"167 C"}],"ideal":"#BE531C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"168 C"}],"ideal":"#73381D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7576 C"}],"ideal":"#DB864E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7577 C"}],"ideal":"#E07E3C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7578 C"}],"ideal":"#DC6B2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7579 C"}],"ideal":"#DC582A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7580 C"}],"ideal":"#C05131"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7581 C"}],"ideal":"#864A33"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7582 C"}],"ideal":"#674736"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1625 C"}],"ideal":"#FFA38B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1635 C"}],"ideal":"#FF8D6D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1645 C"}],"ideal":"#FF6A39"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1655 C"}],"ideal":"#FC4C02"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1665 C"}],"ideal":"#DC4405"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1675 C"}],"ideal":"#A9431E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1685 C"}],"ideal":"#833921"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"169 C"}],"ideal":"#FFB3AB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"170 C"}],"ideal":"#FF8674"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"171 C"}],"ideal":"#FF5C39"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"172 C"}],"ideal":"#FA4616"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"173 C"}],"ideal":"#CF4520"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"174 C"}],"ideal":"#963821"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"175 C"}],"ideal":"#6B3529"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7583 C"}],"ideal":"#C4622D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7584 C"}],"ideal":"#BA5826"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7585 C"}],"ideal":"#AF5C37"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7586 C"}],"ideal":"#9E5330"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7587 C"}],"ideal":"#924C2E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7588 C"}],"ideal":"#7B4D35"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7589 C"}],"ideal":"#5C4738"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7590 C"}],"ideal":"#D4B59E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7591 C"}],"ideal":"#C07D59"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7592 C"}],"ideal":"#B15533"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7593 C"}],"ideal":"#9D432C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7594 C"}],"ideal":"#7C3A2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7595 C"}],"ideal":"#6B3D2E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7596 C"}],"ideal":"#5C3D31"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7597 C"}],"ideal":"#D14124"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7598 C"}],"ideal":"#BD472A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7599 C"}],"ideal":"#B33D26"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7600 C"}],"ideal":"#8D3F2B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7601 C"}],"ideal":"#83412C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7602 C"}],"ideal":"#7B4931"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7603 C"}],"ideal":"#674230"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7604 C"}],"ideal":"#E4D5D3"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7605 C"}],"ideal":"#E1BBB4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7606 C"}],"ideal":"#D6938A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7607 C"}],"ideal":"#C26E60"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7608 C"}],"ideal":"#A4493D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7609 C"}],"ideal":"#823B34"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7610 C"}],"ideal":"#683431"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7611 C"}],"ideal":"#DDBCB0"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7612 C"}],"ideal":"#CA9A8E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7613 C"}],"ideal":"#BC8A7E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7614 C"}],"ideal":"#A37F74"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7615 C"}],"ideal":"#866761"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7616 C"}],"ideal":"#6B4C4C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7617 C"}],"ideal":"#583D3E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7520 C"}],"ideal":"#EABEB0"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7521 C"}],"ideal":"#C09C83"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7522 C"}],"ideal":"#B46A55"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7523 C"}],"ideal":"#AB5C57"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7524 C"}],"ideal":"#A45248"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7525 C"}],"ideal":"#9A6A4F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7526 C"}],"ideal":"#8A391B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"489 C"}],"ideal":"#ECC3B2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"488 C"}],"ideal":"#ECBAA8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"487 C"}],"ideal":"#EAA794"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"486 C"}],"ideal":"#E8927C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"485 C"}],"ideal":"#DA291C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"484 C"}],"ideal":"#9A3324"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"483 C"}],"ideal":"#653024"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"176 C"}],"ideal":"#FFB1BB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"177 C"}],"ideal":"#FF808B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"178 C"}],"ideal":"#FF585D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"179 C"}],"ideal":"#E03C31"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"180 C"}],"ideal":"#BE3A34"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"181 C"}],"ideal":"#81312F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1765 C"}],"ideal":"#FFA3B5"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1775 C"}],"ideal":"#FF8DA1"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1785 C"}],"ideal":"#F8485E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1788 C"}],"ideal":"#EE2737"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1795 C"}],"ideal":"#D22630"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1805 C"}],"ideal":"#AF272F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1815 C"}],"ideal":"#7C2529"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1767 C"}],"ideal":"#FCAFC0"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1777 C"}],"ideal":"#FB637E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1787 C"}],"ideal":"#F4364C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1797 C"}],"ideal":"#CB333B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1807 C"}],"ideal":"#A4343A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1817 C"}],"ideal":"#643335"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7618 C"}],"ideal":"#C66E4E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7619 C"}],"ideal":"#C04C36"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7620 C"}],"ideal":"#B7312C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7621 C"}],"ideal":"#AB2328"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7622 C"}],"ideal":"#93272C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7623 C"}],"ideal":"#8A2A2B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7624 C"}],"ideal":"#802F2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7625 C"}],"ideal":"#E1523D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7626 C"}],"ideal":"#C63527"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7627 C"}],"ideal":"#A72B2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7628 C"}],"ideal":"#9E2A2B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7629 C"}],"ideal":"#6D3332"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7630 C"}],"ideal":"#633231"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7631 C"}],"ideal":"#572D2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7415 C"}],"ideal":"#E6BAA8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7416 C"}],"ideal":"#E56A54"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7417 C"}],"ideal":"#E04E39"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7418 C"}],"ideal":"#CD545B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7419 C"}],"ideal":"#B04A5A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7420 C"}],"ideal":"#9B2242"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7421 C"}],"ideal":"#651D32"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"182 C"}],"ideal":"#FABBCB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"183 C"}],"ideal":"#FC9BB3"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"184 C"}],"ideal":"#F65275"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"185 C"}],"ideal":"#E4002B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"186 C"}],"ideal":"#C8102E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"187 C"}],"ideal":"#A6192E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"188 C"}],"ideal":"#76232F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"196 C"}],"ideal":"#ECC7CD"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"197 C"}],"ideal":"#E89CAE"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"198 C"}],"ideal":"#DF4661"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"199 C"}],"ideal":"#D50032"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"200 C"}],"ideal":"#BA0C2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"201 C"}],"ideal":"#9D2235"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"202 C"}],"ideal":"#862633"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"189 C"}],"ideal":"#F8A3BC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"190 C"}],"ideal":"#F67599"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"191 C"}],"ideal":"#EF426F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"192 C"}],"ideal":"#E40046"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"193 C"}],"ideal":"#BF0D3E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"194 C"}],"ideal":"#9B2743"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"195 C"}],"ideal":"#782F40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1895 C"}],"ideal":"#F5B6CD"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1905 C"}],"ideal":"#F59BBB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1915 C"}],"ideal":"#EF4A81"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1925 C"}],"ideal":"#E0004D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1935 C"}],"ideal":"#C5003E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1945 C"}],"ideal":"#A6093D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1955 C"}],"ideal":"#8A1538"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"705 C"}],"ideal":"#F5DADF"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"706 C"}],"ideal":"#F7CED7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"707 C"}],"ideal":"#F9B5C4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"708 C"}],"ideal":"#F890A5"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"709 C"}],"ideal":"#EF6079"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"710 C"}],"ideal":"#E03E52"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"711 C"}],"ideal":"#CB2C30"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"698 C"}],"ideal":"#F2D4D7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"699 C"}],"ideal":"#F4C3CC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"700 C"}],"ideal":"#F2ACB9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"701 C"}],"ideal":"#E68699"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"702 C"}],"ideal":"#D25B73"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"703 C"}],"ideal":"#B83A4B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"704 C"}],"ideal":"#9E2A2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"203 C"}],"ideal":"#ECB3CB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"204 C"}],"ideal":"#E782A9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"205 C"}],"ideal":"#E0457B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"206 C"}],"ideal":"#CE0037"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"207 C"}],"ideal":"#A50034"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"208 C"}],"ideal":"#861F41"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"209 C"}],"ideal":"#6F263D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"210 C"}],"ideal":"#F99FC9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"211 C"}],"ideal":"#F57EB6"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"212 C"}],"ideal":"#F04E98"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"213 C"}],"ideal":"#E31C79"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"214 C"}],"ideal":"#CE0F69"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"215 C"}],"ideal":"#AC145A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"216 C"}],"ideal":"#7D2248"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7422 C"}],"ideal":"#F4CDD4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7423 C"}],"ideal":"#E06287"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7424 C"}],"ideal":"#E24585"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7425 C"}],"ideal":"#B52555"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7426 C"}],"ideal":"#A4123F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7427 C"}],"ideal":"#971B2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7428 C"}],"ideal":"#6A2C3E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7632 C"}],"ideal":"#D6C9CA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7633 C"}],"ideal":"#C4A4A7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7634 C"}],"ideal":"#C16784"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7635 C"}],"ideal":"#C63663"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7636 C"}],"ideal":"#BC204B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7637 C"}],"ideal":"#912F46"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7638 C"}],"ideal":"#7E2D40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"217 C"}],"ideal":"#EABEDB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"218 C"}],"ideal":"#E56DB1"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"219 C"}],"ideal":"#DA1884"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"220 C"}],"ideal":"#A50050"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"221 C"}],"ideal":"#910048"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"222 C"}],"ideal":"#6C1D45"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7639 C"}],"ideal":"#936D73"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7640 C"}],"ideal":"#934054"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7641 C"}],"ideal":"#8E2C48"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7642 C"}],"ideal":"#732E4A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7643 C"}],"ideal":"#672E45"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7644 C"}],"ideal":"#582D40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7645 C"}],"ideal":"#502B3A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"223 C"}],"ideal":"#EF95CF"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"224 C"}],"ideal":"#EB6FBD"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"225 C"}],"ideal":"#DF1995"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"226 C"}],"ideal":"#D0006F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"227 C"}],"ideal":"#AA0061"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"228 C"}],"ideal":"#890C58"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"229 C"}],"ideal":"#672146"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"230 C"}],"ideal":"#F4A6D7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"231 C"}],"ideal":"#F277C6"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"232 C"}],"ideal":"#E93CAC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"233 C"}],"ideal":"#C6007E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"234 C"}],"ideal":"#A20067"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"235 C"}],"ideal":"#840B55"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"670 C"}],"ideal":"#EAD3E2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"671 C"}],"ideal":"#E6BCD8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"672 C"}],"ideal":"#DFA0C9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"673 C"}],"ideal":"#D986BA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"674 C"}],"ideal":"#C6579A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"675 C"}],"ideal":"#AE2573"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"676 C"}],"ideal":"#960051"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"677 C"}],"ideal":"#E5CEDB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"678 C"}],"ideal":"#E3C8D8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"679 C"}],"ideal":"#DEBED2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"680 C"}],"ideal":"#C996B6"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"681 C"}],"ideal":"#B06C96"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"682 C"}],"ideal":"#994878"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"683 C"}],"ideal":"#7C2855"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"684 C"}],"ideal":"#E4C6D4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"685 C"}],"ideal":"#DCB6C9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"686 C"}],"ideal":"#D0A1BA"}
{"input":[{"role"…
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

"Job Title to SOC Title Classifier"

### Eval description

This evaluation involves a machine learning model trained to classify
job titles into their relevant Standard Occupational Classification
(SOC) title from the Bureau of Labor Statistics (BLS). The model uses
historical job title data and associated SOC titles to accurately
predict the SOC title for any given job title.

### What makes this a useful eval?

This evaluation is incredibly valuable because it opens up a wealth of
data possibilities tied to job titles. By accurately classifying job
titles into their relevant SOC titles, we can access and leverage
related data from resources like ONET, BLS, and census data. This can be
particularly useful in labor market analyses, economic research, HR
analytics, and other fields. Moreover, with an accurate SOC title
classification, we can study trends, make predictions, and generate
insights about various occupations, which could be beneficial for job
seekers, employers, policymakers, and researchers alike.


## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Metal Worker'"}], "ideal": " Metal
Workers and Plastic Workers, All Other"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Bread Baker'"}], "ideal": " Bakers"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Malt Liquors Sales Supervisor'"}],
"ideal": " First-Line Supervisors of Non-Retail Sales Workers"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Duck Driver'"}], "ideal": " Tour Guides
and Escorts"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Architect Specialist'"}], "ideal": "
Marine Engineers and Naval Architects"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Golf Course Ranger'"}], "ideal": "
Amusement and Recreation Attendants"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Sewing Supervisor'"}], "ideal": "
First-Line Supervisors of Production and Operating Workers"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Screener and Blender'"}], "ideal": "
Mixing and Blending Machine Setters, Operators, and Tenders"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Field Marketing Representative'"}],
"ideal": " Sales Engineers"}
{"input": [{"role": "system", "content": "You are an expert in Standard
Occupation Code (SOC) labor classifications issued by the Bureau of
Labor Statistics. When give a job title to classify, respond with the
correct BLS Title classification"}, {"role": "user", "content": "What is
the SOC Code for the job title Surveyor'"}], "ideal": " Surveyors"}
  ```
</details>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

korean-phonetics

### Eval description
The eval aims to assess the model's proficiency in identifying phonetic
transcriptions of Korean words. To measure accuracy, the model is given
[word, phonetic transcription] pairs and the test utilizes Match. The
phonetic transcription was taken from the most commonly used online
dictionaries by Naver: [Naver Korean
Dictionary](https://ko.dict.naver.com)

### What makes this a useful eval?
Accurately representing and recognizing phonetic transcription of Korean
words is important for several reasons:

1. Pronunciation Accuracy: Phonetic transcription helps in accurately
representing the sounds of Korean words. By understanding and
recognizing the correct phonetic transcription, learners can ensure they
pronounce the words correctly, which is crucial for effective
communication in Korean.
2. Language Standardization: Phonetic transcription plays a role in
standardizing the pronunciation of Korean words. By adhering to a
consistent system, it helps maintain clarity and avoids
misinterpretation of words, especially in educational materials,
dictionaries, and linguistic research.
3. Linguistic Analysis: For linguists and researchers studying the
Korean language, phonetic transcription provides a precise way to
analyze and compare different speech sounds. It aids in phonological
studies, dialect research, and language documentation.

In summary, accurate representation and recognition of phonetic
transcription in Korean contribute to improved pronunciation, effective
language learning, better communication, standardization, and linguistic
analysis of the language.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [X] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [X] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [X] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [X] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [X] Check that your data is in `evals/registry/data/{name}`
- [X] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [X] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [X] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [X] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [X] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [X] I have filled out all required fields of this form
- [X] I have used **Git LFS** for the Eval JSON data
- [X] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Korean is written using a
phonetic alphabet called hangul. You will be given a pair of Korean
words. Is the second word the correct phonetic transcription of the
first word? Answer with exactly one of the following: 'yes' or 'no'.
Don't add anything else to the response."}, {"role": "user", "content":
"여덟, 여덜"}], "ideal": "yes"}
{"input": [{"role": "system", "content": "Korean is written using a
phonetic alphabet called hangul. You will be given a pair of Korean
words. Is the second word the correct phonetic transcription of the
first word? Answer with exactly one of the following: 'yes' or 'no'.
Don't add anything else to the response."}, {"role": "user", "content":
"값, 갑"}], "ideal": "yes"}
{"input": [{"role": "system", "content": "Korean is written using a
phonetic alphabet called hangul. You will be given a pair of Korean
words. Is the second word the correct phonetic transcription of the
first word? Answer with exactly one of the following: 'yes' or 'no'.
Don't add anything else to the response."}, {"role": "user", "content":
"닭, 닥"}], "ideal": "yes"}
{"input": [{"role": "system", "content": "Korean is written using a
phonetic alphabet called hangul. You will be given a pair of Korean
words. Is the second word the correct phonetic transcription of the
first word? Answer with exactly one of the following: 'yes' or 'no'.
Don't add anything else to the response."}, {"role": "user", "content":
"앉아, 안자"}], "ideal": "yes"}
{"input": [{"role": "system", "content": "Korean is written using a
phonetic alphabet called hangul. You will be given a pair of Korean
words. Is the second word the correct phonetic transcription of the
first word? Answer with exactly one of the following: 'yes' or 'no'.
Don't add anything else to the response."}, {"role": "user", "content":
"젊어, 절머"}], "ideal": "yes"}
{"input": [{"role": "system", "content": "Korean is written using a
phonetic alphabet called hangul. You will be given a pair of Korean
words. Is the second word the correct phonetic transcription of the
first word? Answer with exactly one of the following: 'yes' or 'no'.
Don't add anything else to the response."}, {"role": "user", "content":
"겉옷, 거돋"}], "ideal": "yes"}
  ```
</details>

Co-authored-by: Lena H <[email protected]>
Return bootstrap acc std
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

path_enclosed_area

### Eval description

This eval tests the model's ability to calculate the total area enclosed
by a path walked on a flat plane; the path moves only north, south,
east, or west. These problems are extremely simple for any human, but
the model has a lot of difficulty with them.

The paths were hand constructed to test across a variety of scenarios
including:
- One path with multiple discrete enclosed areas
- One path with zero enclosed areas
- Irrelevant segmentation of an enclosed area into smaller enclosed
areas
- Going back and forth across the same segment
- Different initial directions
- Symmetrical and asymmetrical 

### What makes this a useful eval?

This kind of geometric reasoning and calculation is important for simple
tasks across mathematics, game design, engineering, and various other
fields.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "On a flat plane, I walk 5
miles north, 3 miles east, 2 miles south, 2 miles east, 4 miles south, 6
miles west. What is the area, in square miles, of the area that is
completely enclosed by my path? If no area is completely enclosed, then
there are 0 square miles enclosed. If multiple discrete areas are
enclosed, then please sum their areas for the final answer. Explain your
reasoning step-by-step, and then provide your final answer in the exact
following format surrounded by brackets: \"[X square miles]\" where X is
the integer number of total square miles enclosed by the path, and where
miles should be written as mile if X=1."}], "ideal": "[0 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 3
miles east, 3 miles south, 2 miles east, 3 miles west, 1 mile east, 5
miles north, 4 miles west. What is the area, in square miles, of the
area that is completely enclosed by my path? If no area is completely
enclosed, then there are 0 square miles enclosed. If multiple discrete
areas are enclosed, then please sum their areas for the final answer.
Explain your reasoning step-by-step, and then provide your final answer
in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[0 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 5
miles north, 3 miles east, 2 miles south, 1 mile east, 2 miles south, 5
miles west. What is the area, in square miles, of the area that is
completely enclosed by my path? If no area is completely enclosed, then
there are 0 square miles enclosed. If multiple discrete areas are
enclosed, then please sum their areas for the final answer. Explain your
reasoning step-by-step, and then provide your final answer in the exact
following format surrounded by brackets: \"[X square miles]\" where X is
the integer number of total square miles enclosed by the path, and where
miles should be written as mile if X=1."}], "ideal": "[14 square
miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 2
miles south, 3 miles west, 2 miles north, 2 miles east, 2 miles south, 1
mile west, 2 miles north, 2 miles east. What is the area, in square
miles, of the area that is completely enclosed by my path? If no area is
completely enclosed, then there are 0 square miles enclosed. If multiple
discrete areas are enclosed, then please sum their areas for the final
answer. Explain your reasoning step-by-step, and then provide your final
answer in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[6 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 3
miles south, 2 miles east, 1 mile north, 3 miles west, 1 mile north, 2
miles east, 4 miles south. What is the area, in square miles, of the
area that is completely enclosed by my path? If no area is completely
enclosed, then there are 0 square miles enclosed. If multiple discrete
areas are enclosed, then please sum their areas for the final answer.
Explain your reasoning step-by-step, and then provide your final answer
in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[4 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
west, 1 mile north, 1 mile east, 1 mile south, 1 mile east, 1 mile
north, 1 mile west, 2 miles north. What is the area, in square miles, of
the area that is completely enclosed by my path? If no area is
completely enclosed, then there are 0 square miles enclosed. If multiple
discrete areas are enclosed, then please sum their areas for the final
answer. Explain your reasoning step-by-step, and then provide your final
answer in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[2 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
south, 1 mile east, 1 mile south, 1 mile east, 1 mile south, 1 mile
east, 2 miles north, 1 mile east, 1 mile north, 4 miles west. What is
the area, in square miles, of the area that is completely enclosed by my
path? If no area is completely enclosed, then there are 0 square miles
enclosed. If multiple discrete areas are enclosed, then please sum their
areas for the final answer. Explain your reasoning step-by-step, and
then provide your final answer in the exact following format surrounded
by brackets: \"[X square miles]\" where X is the integer number of total
square miles enclosed by the path, and where miles should be written as
mile if X=1."}], "ideal": "[7 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
west, 1 mile south, 1 mile north, 2 miles east, 1 mile north, 3 miles
west, 3 miles south, 1 mile east, 1 mile south, 1 mile west, 2 miles
north. What is the area, in square miles, of the area that is completely
enclosed by my path? If no area is completely enclosed, then there are 0
square miles enclosed. If multiple discrete areas are enclosed, then
please sum their areas for the final answer. Explain your reasoning
step-by-step, and then provide your final answer in the exact following
format surrounded by brackets: \"[X square miles]\" where X is the
integer number of total square miles enclosed by the path, and where
miles should be written as mile if X=1."}], "ideal": "[1 square mile]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
north, 2 miles east, 1 mile north, 1 mile east, 3 miles south, 2 miles
west, 2 miles north. What is the area, in square miles, of the area that
is completely enclosed by my path? If no area is completely enclosed,
then there are 0 square miles enclosed. If multiple discrete areas are
enclosed, then please sum their areas for the final answer. Explain your
reasoning step-by-step, and then provide your final answer in the exact
following format surrounded by brackets: \"[X square miles]\" where X is
the integer number of total square miles enclosed by the path, and where
miles should be written as mile if X=1."}], "ideal": "[5 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
east, 1 mile north, 1 mile west, 1 mile north, 1 mile east, 1 mile
north, 1 mile west, 1 mile north, 1 mile east, 5 miles south. What is
the area, in square miles, of the area that is completely enclosed by my
path? If no area is completely enclosed, then there are 0 square miles
enclosed. If multiple discrete areas are enclosed, then please sum their
areas for the final answer. Explain your reasoning step-by-step, and
then provide your final answer in the exact following format surrounded
by brackets: \"[X square miles]\" where X is the integer number of total
square miles enclosed by the path, and where miles should be written as
mile if X=1."}], "ideal": "[2 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 5
miles south, 5 miles east, 4 miles north, 4 miles west, 3 miles south, 3
miles east, 2 miles north, 2 miles west, 1 mile south, 1 mile east, 3
miles north. What is the area, in square miles, of the area that is
completely enclosed by my path? If no area is completely enclosed, then
there are 0 square miles enclosed. If multiple discrete areas are
enclosed, then please sum their areas for the final answer. Explain your
reasoning step-by-step, and then provide your final answer in the exact
following format surrounded by brackets: \"[X square miles]\" where X is
the integer number of total square miles enclosed by the path, and where
miles should be written as mile if X=1."}], "ideal": "[8 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
west, 1 mile south, 1 mile north, 1 mile south, 1 mile east, 1 mile
south, 1 mile west, 1 mile south, 2 miles east, 3 miles north. What is
the area, in square miles, of the area that is completely enclosed by my
path? If no area is completely enclosed, then there are 0 square miles
enclosed. If multiple discrete areas are enclosed, then please sum their
areas for the final answer. Explain your reasoning step-by-step, and
then provide your final answer in the exact following format surrounded
by brackets: \"[X square miles]\" where X is the integer number of total
square miles enclosed by the path, and where miles should be written as
mile if X=1."}], "ideal": "[0 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 4
miles north, 2 miles south, 2 miles west, 4 miles east, 1 mile west, 1
mile north, 2 miles west, 2 miles south, 1 mile east. What is the area,
in square miles, of the area that is completely enclosed by my path? If
no area is completely enclosed, then there are 0 square miles enclosed.
If multiple discrete areas are enclosed, then please sum their areas for
the final answer. Explain your reasoning step-by-step, and then provide
your final answer in the exact following format surrounded by brackets:
\"[X square miles]\" where X is the integer number of total square miles
enclosed by the path, and where miles should be written as mile if
X=1."}], "ideal": "[3 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 2
miles south, 1 mile east, 1 mile north, 1 mile south, 1 mile east, 1
mile north, 1 mile south, 1 mile east, 2 miles north, 1 mile east, 1
mile south, 3 miles west. What is the area, in square miles, of the area
that is completely enclosed by my path? If no area is completely
enclosed, then there are 0 square miles enclosed. If multiple discrete
areas are enclosed, then please sum their areas for the final answer.
Explain your reasoning step-by-step, and then provide your final answer
in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[3 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 3
miles east, 1 mile south, 1 mile west, 1 mile south, 1 mile east, 1 mile
south, 3 miles west, 1 mile north, 1 mile east, 1 mile north, 1 mile
west, 2 miles north, 1 mile west. What is the area, in square miles, of
the area that is completely enclosed by my path? If no area is
completely enclosed, then there are 0 square miles enclosed. If multiple
discrete areas are enclosed, then please sum their areas for the final
answer. Explain your reasoning step-by-step, and then provide your final
answer in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[7 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
south, 1 mile east, 1 mile north, 2 miles west, 2 miles south, 3 miles
east, 2 miles north, 1 mile west. What is the area, in square miles, of
the area that is completely enclosed by my path? If no area is
completely enclosed, then there are 0 square miles enclosed. If multiple
discrete areas are enclosed, then please sum their areas for the final
answer. Explain your reasoning step-by-step, and then provide your final
answer in the exact following format surrounded by brackets: \"[X square
miles]\" where X is the integer number of total square miles enclosed by
the path, and where miles should be written as mile if X=1."}], "ideal":
"[6 square miles]"}
{"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile
south, 1 mile east, 1 mile north, 1 mile west. What is the area, in
square miles, of the area that is completely enclosed by my path? If no
area is completely enclosed, then there are 0 square miles enclosed. If
multiple discrete areas are enclosed, then please sum their areas for
the final answer. Explain your reasoning step-by-step, and then provide
your final answer in the exact following format surrounded by brackets:
\"[X square miles]\" where X is the integer number of total square miles
enclosed by the path, and where miles should be written as mile if
X=1."}], "ideal": "[1 square mile]"}
  ```
</details>

Co-authored-by: Ahmed Allawi <[email protected]>
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑
### Eval name
Consensus Summary

### Eval description

Utilize the model's ability to produce a Scientific Consensus in
response to a scientific inquiry using the provided claims.

### What makes this a useful eval?

This is a useful eval because it evaluates the model's ability to
produce a scientific consensus in response to a given set of claims.
This is important because scientific consensus is the result of multiple
studies and data that may or may not support the same conclusion. A
model that can accurately produce scientific consensus can help in
making informed decisions and policies based on scientific evidence.
Hence, evaluating a model's ability to produce a scientific consensus
using the Consensus Summary eval can be useful in assessing its
reliability and potential for practical applications.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: Two doses of mRNA covid-19 vaccines
were observed to be highly effective against symptomatic infection and
severe outcomes.\nclaim: COVID-19 vaccines currently authorized in the
United States are highly effective in preventing COVID-19-associated
hospitalizations in older adults.\nclaim: In summary, vaccines are a
powerful tool that can be used to control the COVID-19 pandemic, with
high efficacy and tolerable ADRs.\nclaim: Conclusion Overall, we
conclude that vaccination against COVID-19 in patients with active
malignancies using activated and inactivated vaccines is a safe and
tolerable procedure that is also accompanied by a high efficacy.\nclaim:
COVID-19 vaccines provide good protection against COVID-19 presentation
at primary care/outpatient level, particularly among fully vaccinated
individuals.\nquestion: are covid-19 vaccines effective?"}], "ideal":
"Summary: Covid-19 vaccines are highly effective at protecting against
infection and hospitalization."}
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: Lower zinc is a hallmark of
depression, while increments in serum zinc and attenuation of the
immune-inflammatory response during treatment appear to play a role in
the clinical efficacy of sertraline.\nclaim: An increase in dietary zinc
and higher plasma zinc levels may reduce the risk of depressive
symptoms.\nclaim: Although decreased zinc levels have been implicated in
the genesis of depression in animal models and in major depressive
disorder in humans, this study provides the first evidence of a role for
zinc in depression in people with dementia and highlights zinc
metabolism as a therapeutic target.\nclaim: The results of this study
show that long-term intake of zinc may modulate symptoms of
depression.\nclaim: The reported results indicated that the serum zinc
level might be a marker of depression as a state (state marker) in
treatment responsive patients.\nquestion: can zinc help treat
depression?"}], "ideal": "Summary: All of these studies suggest that low
zinc levels are a marker of depression and that intake of zinc may have
the ability to help reduce symptoms of depression"}
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: The findings suggest that the
following characteristics of the founder significantly influence the
success potential of an incubated venture: entrepreneurial personality,
motivation for starting the venture, managerial skills, and approach
towards innovation.\nclaim: Using a sample of 384 entrepreneurs selected
from the two leading business districts in Uganda, we observe that
optimism is the component of psychological capital that significantly
moderates the relationship between startup capital and entrepreneurial
success.\nclaim: Both startup capital and psychological capital are
significant predictors of entrepreneurial success; however,
psychological capital is the better predictor.\nclaim: Entrepreneurially
self\u2010efficacious founder/managers may help improve the performance
of very young firms but such benefits dissipate over time.\nclaim: This
finding indicates that the entrepreneurial team\u2019s startup
experience plays stronger roles in venturing profitable startups when
the amount of financial resources and initial firm size are small;
however, the team\u2019s startup experience and intangible resources
have positive interaction effects on new-born startups\u2019
profitability.\nquestion: what predicts success as a startup
founder?"}], "ideal": "Summary: Things like entrepreneurial personality,
motivation for starting the venture, managerial skills, previous
start-up experience, startup and psychological capital and optimism all
predict success as a startup founder"}
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: While homelessness is ultimately the
result of a severe and chronic shortage of affordable housing, creating
accessible, safe, pet-friendly shelter and safe haven options and
instituting a smoother, more transparent process for moving from the
streets could substantially reduce street homelessness.\nclaim: - To
prevent the revolving door to homelessness, it is necessary to remove
the barriers that hinder access to normal health resources which are
experienced by people suffering from social exclusion, while
implementing ongoing support programmes for homeless people or those at
risk of homelessness, which primarily deal with health issues.\nclaim:
We conclude that overcoming homelessness requires policies and practices
that give a greater focus to non-material aspects of homelessness
through an emphasis on empowerment, self-respect and autonomy.\nclaim:
This finding suggests that homelessness can be reduced by appropriate
clinical interventions if housing is available.\nclaim: For homelessness
prevention, systematic and outreach social medical care before and
during homelessness should be provided.\nquestion: What are effective
ways to prevent homelessness?"}], "ideal": "Summary: Ways to prevent
homelessness include creating accessible, safe shelter and safe haven
options, removing barriers to health resources, giving a greater focus
to non-material aspects of homelessness, and providing systematic and
outreach social medical care."}
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: While homelessness is ultimately the
result of a severe and chronic shortage of affordable housing, creating
accessible, safe, pet-friendly shelter and safe haven options and
instituting a smoother, more transparent process for moving from the
streets could substantially reduce street homelessness.\nclaim: - To
prevent the revolving door to homelessness, it is necessary to remove
the barriers that hinder access to normal health resources which are
experienced by people suffering from social exclusion, while
implementing ongoing support programmes for homeless people or those at
risk of homelessness, which primarily deal with health issues.\nclaim:
We conclude that overcoming homelessness requires policies and practices
that give a greater focus to non-material aspects of homelessness
through an emphasis on empowerment, self-respect and autonomy.\nclaim:
This finding suggests that homelessness can be reduced by appropriate
clinical interventions if housing is available.\nclaim: For homelessness
prevention, systematic and outreach social medical care before and
during homelessness should be provided.\nquestion: How to prevent
homelessness?"}], "ideal": "Summary: Ways to prevent homelessness
include creating accessible, safe shelter and safe haven options,
removing barriers to health resources, giving a greater focus to
non-material aspects of homelessness, and providing systematic and
outreach social medical care."}
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: The findings revealed that the
factor that contributes the most to entrepreneurship intention is Locus
of control, followed by Need of Achievement and Subjective
Norms.\nclaim: It was found that entrepreneurial skill, environmental
factors and entrepreneurial orientation have a positive influence on
entrepreneurial intention.\nclaim: The findings indicate that
entrepreneurial motivation has a significant correlation with
entrepreneurial intention and its three determinants, social valuation
of entrepreneurship, having entrepreneurial role models, knowledge of
entrepreneurial support and perceived barriers to starting a
business.\nclaim: Research finding revealed that entrepreneurial
intention is indirectly affected by entrepreneurship education, meaning
that students\u2019 entrepreneurial motivation and attitude are two
important mediating variables.\nclaim: Findings confirm the influence of
individual and socio-cultural factors on entrepreneurial
intention.\nquestion: What are the factors of entrepreneurship
intention"}], "ideal": "Summary: Studies find that intrinsic factors,
such as entrepreneurial skill and motivation, as well as extrinsic
variables, such as the environmental support of entrepreneurship,
mediate entrepreneurship intention."}
{"input": [{"role": "system", "content": "Generate a brief answer using
only the provided claims, with no personal opinions or outside
knowledge. If there is no answer based on the claims, write 'N-A'."},
{"role": "user", "content": "claim: The results show that digital
agriculture is able to help users to increase productivity in a
sustainable way.\nclaim: Digital agriculture technologies continue the
centralization of economic knowledge and power as they facilitate the
transformation of vast territories into \u201coperational
landscapes\u201d that provide the material, energy, and labor for a
rapidly expanding urban system.\nclaim: The digital agriculture system
is an effective tool for insurance industry to use to develop a
dynamical business plan for the changing climate.\nclaim: The technical
fitting-out of agriculture in the digital economy should be considered
as a set of measures to prepare the industry for the production of
high-quality products, which implies the use of digital technologies
that minimize human participation in the production process.\nclaim:
Consequently, the initial Mobile-based Information System evolved into a
Digital Knowledge Ecosystem that can predict current production
situation in near real enabling government agencies to dynamically
adjust the incentives offered to farmers for growing different types of
crops to achieve sustainable agriculture production through crop
diversification.\nquestion: What is digital agriculture?"}], "ideal":
"Summary: N-A"}
  ```
</details>
danesherbs and others added 13 commits January 3, 2024 10:45
**What:** Adds support for `gpt-3.5-turbo-16k` to
`n_ctx_from_model_name`.
**Why:** Currently `n_ctx_from_model_name` returns 4096 for
`gpt-3.5-turbo-16k`.

Co-authored-by: Ian McKenzie <[email protected]>
**What:** Adds a recorder for function calls made by models.
**Why:** Currently function calls can be logged using `record_event` but
it'd be convenient for function calls to logged consistently.
Simple change to fix openai#1394 .
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

 japanese_prime_minister

### Eval description

I would like to know the calculation of the number of days in office of
successive prime ministers and the ranking of the number of days in
office.

### What makes this a useful eval?

I'm almost done calculating tenure, but trying to rank it doesn't work.
There seems to be a demand for ranking a lot of different things.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `mypy`, `black`,
`isort`, `autoflake` and `ruff` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"},
{"role": "user", "content": "通算在籍日数が1番目に長い総理大臣"}], "ideal": "安倍晋三"}
{"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"},
{"role": "user", "content": "通算在籍日数が2番目に長い総理大臣"}], "ideal": "桂太郎"}
{"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"},
{"role": "user", "content": "通算在籍日数が3番目に長い総理大臣"}], "ideal": "佐藤栄作"}
{"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"},
{"role": "user", "content": "通算在籍日数が4番目に長い総理大臣"}], "ideal": "伊藤博文"}
{"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"},
{"role": "user", "content": "通算在籍日数が5番目に長い総理大臣"}], "ideal": "吉田茂"}
  ```
</details>
With this improvement we now have a 0-shot performance of 59.6%
(averaged over 3 eval runs) on the MMMU validation set, which beats the
56.8% reported in the [MMMU paper](https://arxiv.org/pdf/2311.16502.pdf)
In [the previous PR](openai#1405) adding
the Theory of Mind eval, the `evals/registry/evals/theory_of_mind.yaml`
was mistakenly not added, so the eval couldn't be run. This PR adds this
file.

Test with:
```
oaieval gpt-3.5-turbo theory_of_mind
```
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, **failure to follow
the guidelines below will result in the PR being closed automatically**.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject it since GPT-4 is already capable of completing
the task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON
files, so please make sure that you move the JSON file to Git LFS before
submitting a PR. Details on how to use Git LFS are available
[here](https://git-lfs.com).

## Eval details 📑

### Eval name

icelandic-sentences-gec - Grammatical error correction for Icelandic
sentences

### Eval description

The eval contains Icelandic sentences with and without grammatical
errors, spelling errors or other linguistic errors. There are a total of
200 sentences, 100 with errors and 100 where these same errors have been
corrected. The model then predicts whether a particular sentence
contains an error or not, and accuracy is measured.

### What makes this a useful eval?

This is a good measure of the ability of a model to correct grammatical
errors in the Icelandic language. The sentences contain errors which go
against Icelandic language standards, and which a language expert of
Icelandic would correct before a text is published. In addition, it can
serve to measure the general linguistic competence of Icelandic. The
sentences are sourced from the web, and the test set of the Icelandic
Error Corpus (IceEC), which can be freely used.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

> Insert what makes your eval high quality that was not mentioned above.
(Not required)

## Eval structure 🏗️

Your eval should

- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your YAML is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (<https://platform.openai.com/docs/usage-policies>).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the commits on the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and the high volume of submissions, we will
not be able to accept all submissions and thus not grant everyone who
opens a PR GPT-4 access. We know this is disappointing, but we hope to
set the right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access be
granted.

### Submit eval

- [x] I have filled out all required fields of this form
- [x] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `mypy`, `black`,
`isort`, `autoflake` and `ruff` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "Þú ert sérfræðingur í
íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur
villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu,
stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu
'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að
gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role":
"user", "content": "Leiksviðið átti svo hug hans allann við heimkomuna,
og árið 1961 hófst sjónvarpsferill hans."}], "ideal": ["rangt",
"Rangt"]}
{"input": [{"role": "system", "content": "Þú ert sérfræðingur í
íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur
villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu,
stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu
'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að
gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role":
"user", "content": "Baráttusamtök frumbyggja í Hondúras, sem Caceres
átti þátt í að stofna, fagnaði dómsúrskurðinum í gær og sagði hann sigur
fyrir þjóðir Hondúras."}], "ideal": ["rangt", "Rangt"]}
{"input": [{"role": "system", "content": "Þú ert sérfræðingur í
íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur
villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu,
stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu
'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að
gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role":
"user", "content": "Sú var naumast býsperrt."}], "ideal": ["rangt",
"Rangt"]}
{"input": [{"role": "system", "content": "Þú ert sérfræðingur í
íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur
villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu,
stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu
'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að
gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role":
"user", "content": "Fólk er beðið um að fylgjast vel með veðurspám þar
sem breytingar gætu orðið þegar nær dregur."}], "ideal": ["rétt",
"Rétt"]}
{"input": [{"role": "system", "content": "Þú ert sérfræðingur í
íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur
villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu,
stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu
'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að
gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role":
"user", "content": "Gjaldmiðlasamningunum var ætlað að tryggja að Exista
gæti keypt gjaldeyri á fyrir fram ákveðnum dagsetningum á fyrir fram
ákveðnu gengi svo að félagið gæti greitt af skuldum sínum í erlendri
mynt með þeim hagnaði sem til varð í íslenskum krónum eins og segir í
grein Lýðs."}], "ideal": ["rétt", "Rétt"]}
  ```
</details>
(Not an eval)

**One-line summary**: Pre-commit hooks were failing. I identified the
main cause, and then fixed all secondary pre-commit issues. I only
changed the logic in one place, `oiaevalset.py`.

I was having issues with type-hinting and identified that the old
`typings` directory was causing the `from openai import OpenAI` import
to register as an error. I decided to go through and fix all the issues
that appeared in `pre-commit run --all-files`.

NOTE: 
- I changed the logic in `oaievalset.py` by adding a `continue`
statement if an `eval` or `eval.key` was missing.
- As far as I can tell this should basically never happen, but is
correct behavior.
- Another option would be to assert that `eval` and `eval.key` are not
`None` but forcing an error here doesn't match what I interpret as
intended behavior.

The manual work involved was mainly:

1. Deleting the `typings` directory, which was interfering with `openai`
type-hints (such as `from openai import OpenAI`)
2. Fixing type issues in `oaievalset.py`.
3. Moving the `client =
OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))` line below all the
imports.
4. Breaking lines of length >767 into smaller chunks using line
continuation.

Thus this PR is broken into three parts:

1. Deleting `typings` (first commit)
2. Manually cleaning up issues (middle commits)
3. Applying autofixes from the pre-commit hooks (last commit)
* zhishu completion function

* trial implementation of table_extract tasks

* bugfixes and  add retrieve_native completion_fn

* add fuzzy_compare for table content

* add fuzzy_normalize for table headers

* add uni-finder completion_fn and separated format tests (json/csv)

* basic mlops loggers

* bugfixes on example showcase

* add rag to openai native completion_fns

* add RAG for match, modelgraded_classify, table_extract evals

* add scipaper_tag2mol, scipaper_hasmol, scipaper_targets and markush2mol evals

* add Chemistry evalset

* bugfixes

* table comparison with self-defined index

* fix table extraction with detailed csv text processing and edit-distance comparison

* fix match_field compare logic to edit-distance

* fixes on data and details for good scipaper_affinity performance

* update uni_finder api with pdf_parse_mode

* update Zhishu completion_fn with common chat (no file_link) support

* split test sets into general_chemistry and drug_discovery

* fix Zhishu for mocked GPT-4

* move --mlops option into llmreport entrypoint
@Linmj-Judy Linmj-Judy self-assigned this Feb 27, 2024
Naplessss and others added 16 commits March 1, 2024 01:21
add various functions, include:
same_triplets()
pick_same_turples_in_pred()
same_turples()
pick_same_turples_in_pred()
entity_match()
pick_most_similar_entity_in_pred()
macro_f1_score_2()
macro_f1_score_3()
add evals for GDAS task
add eval task for B5CDR
add eval task for B5CDR
add eval task for B5CDR
add eval task for DDI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment