{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":614490633,"defaultBranch":"main","name":"evals_OpenAI","ownerLogin":"Arijit1000","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-03-15T17:32:36.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/73710898?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1678901564.822033","currentOid":""},"activityList":{"items":[{"before":"4a105ae89fc5dac016b0da184cd0661f98dc3ddd","after":"d3dc89042ddee879a68a326fdb37716ee518640c","ref":"refs/heads/main","pushedAt":"2024-06-10T06:21:24.000Z","pushType":"push","commitsCount":33,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"Release 3.0.1 (#1525)\n\nRelease 3.0.1","shortMessageHtmlLink":"Release 3.0.1 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2272703206\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1525\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1525/hovercard\" href=\"https://github.com/openai/evals/pull/1525\">openai#1525</a>)"}},{"before":"0108dd7e76d5f8e07f333d24ad268530eba4b315","after":"4a105ae89fc5dac016b0da184cd0661f98dc3ddd","ref":"refs/heads/main","pushedAt":"2024-02-09T16:29:36.000Z","pushType":"push","commitsCount":23,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"Updates for Solvers (#1461)\n\nWe provide an update to our Solvers infrastructure\r\n- Add a new README to onboard users wanting to work with solvers (beta)\r\n- Creating a separate folder for registration: `evals/registry/solvers`\r\n- Refactoring previous solver code to support reusability: NestedSolvers\r\nallow you to chain multiple solvers\r\n- New solvers: FewShotSolver, SelfConsistencySolver,\r\nOpenAIAssistantsSolver\r\n- A defaults.yaml for commonly reusable solvers\r\n- Change abstract method for Solver action from `__call__` to `_solver`\r\nso that task state is immutable\r\n\r\n---------\r\n\r\nCo-authored-by: johny-b <33967107+johny-b@users.noreply.github.com>\r\nCo-authored-by: ojaffe <ojaffe@users.noreply.github.com>\r\nCo-authored-by: Andrei Alexandru <inwaves@users.noreply.github.com>","shortMessageHtmlLink":"Updates for Solvers (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2101787432\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1461\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1461/hovercard\" href=\"https://github.com/openai/evals/pull/1461\">openai#1461</a>)"}},{"before":"305b237cdb3884c7ddb6a5d12cb184a83551fcba","after":"0108dd7e76d5f8e07f333d24ad268530eba4b315","ref":"refs/heads/main","pushedAt":"2023-12-15T22:09:23.000Z","pushType":"push","commitsCount":58,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"Ballots v2 (#1390)\n\nThis is an update to the Ballots eval which includes\r\n\r\n- A better, cleaned, dataset\r\n- Improved prompting\r\n- Clearer README\r\n\r\n---------\r\n\r\nCo-authored-by: ojaffe <oliver.jaffe@hotmail.co.uk>","shortMessageHtmlLink":"Ballots v2 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1965817933\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1390\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1390/hovercard\" href=\"https://github.com/openai/evals/pull/1390\">openai#1390</a>)"}},{"before":"1df05834aa52ee4265c226bf7fe85850b25df15c","after":"305b237cdb3884c7ddb6a5d12cb184a83551fcba","ref":"refs/heads/main","pushedAt":"2023-07-26T20:23:23.612Z","pushType":"push","commitsCount":37,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"Irrelevant negative diversion (#1318)\n\nTests the model's reasoning ability in face of a negative diversion\r\n(e.g. \"However, ...\") with irrelevant information.\r\n\r\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, **failure to follow\r\nthe guidelines below will result in the PR being closed automatically**.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access be granted. 🚨\r\n\r\n**PLEASE READ THIS**:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject it since GPT-4 is already capable of completing\r\nthe task.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\nAlso, please note that we're using **Git LFS** for storing the JSON\r\nfiles, so please make sure that you move the JSON file to Git LFS before\r\nsubmitting a PR. Details on how to use Git LFS are available\r\n[here](https://git-lfs.com).\r\n\r\n## Eval details 📑\r\n\r\n### Eval name\r\n\r\nirrelevant-negative-diversion\r\n\r\n### Eval description\r\n\r\nThe eval tests the model's ability to reason. It has been demonstrated\r\nthat a negative diversion (e.g. \"However\", \"Despite that\", \"That being\r\nsaid\") can lead the model to a wrong conclusion. Even when the negative\r\ndiversion contains more or less irrelevant information (e.g. \"However,\r\nthey often squabbled as children.\")\r\n\r\n### What makes this a useful eval?\r\n\r\nI have tested GPT-4 through ChatGPT and can see that it often gets these\r\nwrong. It's a little bit random. Where all the samples have the ideal\r\nanswer of \"yes\", ChatGPT would often say No or more often say that it\r\nwas unable to conclude.\r\n\r\nThe prompt is asking the model to choose \"yes\" or \"no\" according to what\r\nis most reasonable.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high-quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your YAML is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (<https://platform.openai.com/docs/usage-policies>).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the commits on the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgment\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and the high volume of submissions, we will\r\nnot be able to accept all submissions and thus not grant everyone who\r\nopens a PR GPT-4 access. We know this is disappointing, but we hope to\r\nset the right expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access be\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields of this form\r\n- [x] I have used **Git LFS** for the Eval JSON data\r\n- [ ] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data\r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. However, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. But, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. Unfortunately, Anna and Briana squabbled often\r\nas children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Sadly, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. But regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Even though, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Despite, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. That being said, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n  ```\r\n</details>","shortMessageHtmlLink":"Irrelevant negative diversion (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1805380910\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1318\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1318/hovercard\" href=\"https://github.com/openai/evals/pull/1318\">openai#1318</a>)"}},{"before":"1df05834aa52ee4265c226bf7fe85850b25df15c","after":"305b237cdb3884c7ddb6a5d12cb184a83551fcba","ref":"refs/heads/main","pushedAt":"2023-07-26T20:23:23.000Z","pushType":"push","commitsCount":37,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"Irrelevant negative diversion (#1318)\n\nTests the model's reasoning ability in face of a negative diversion\r\n(e.g. \"However, ...\") with irrelevant information.\r\n\r\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, **failure to follow\r\nthe guidelines below will result in the PR being closed automatically**.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access be granted. 🚨\r\n\r\n**PLEASE READ THIS**:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject it since GPT-4 is already capable of completing\r\nthe task.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\nAlso, please note that we're using **Git LFS** for storing the JSON\r\nfiles, so please make sure that you move the JSON file to Git LFS before\r\nsubmitting a PR. Details on how to use Git LFS are available\r\n[here](https://git-lfs.com).\r\n\r\n## Eval details 📑\r\n\r\n### Eval name\r\n\r\nirrelevant-negative-diversion\r\n\r\n### Eval description\r\n\r\nThe eval tests the model's ability to reason. It has been demonstrated\r\nthat a negative diversion (e.g. \"However\", \"Despite that\", \"That being\r\nsaid\") can lead the model to a wrong conclusion. Even when the negative\r\ndiversion contains more or less irrelevant information (e.g. \"However,\r\nthey often squabbled as children.\")\r\n\r\n### What makes this a useful eval?\r\n\r\nI have tested GPT-4 through ChatGPT and can see that it often gets these\r\nwrong. It's a little bit random. Where all the samples have the ideal\r\nanswer of \"yes\", ChatGPT would often say No or more often say that it\r\nwas unable to conclude.\r\n\r\nThe prompt is asking the model to choose \"yes\" or \"no\" according to what\r\nis most reasonable.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high-quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your YAML is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (<https://platform.openai.com/docs/usage-policies>).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the commits on the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgment\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and the high volume of submissions, we will\r\nnot be able to accept all submissions and thus not grant everyone who\r\nopens a PR GPT-4 access. We know this is disappointing, but we hope to\r\nset the right expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access be\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields of this form\r\n- [x] I have used **Git LFS** for the Eval JSON data\r\n- [ ] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data\r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. However, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. But, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. Unfortunately, Anna and Briana squabbled often\r\nas children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Sadly, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. But regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Even though, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Despite, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. That being said, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n  ```\r\n</details>","shortMessageHtmlLink":"Irrelevant negative diversion (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1805380910\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1318\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1318/hovercard\" href=\"https://github.com/openai/evals/pull/1318\">openai#1318</a>)"}},{"before":"170dfd886c0704588461af075393cc20cfb0480f","after":"1df05834aa52ee4265c226bf7fe85850b25df15c","ref":"refs/heads/main","pushedAt":"2023-07-05T22:18:02.830Z","pushType":"push","commitsCount":339,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"probabilities-word-problems (#941)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\n## Eval details 📑\r\n### Eval name\r\nprobabilities-word-problems\r\n\r\n### Eval description\r\n\r\nTest the model's ability to calculate probabilities given word problems.\r\n\r\n### What makes this a useful eval?\r\n\r\nThe ability of GPT models to solve probability word problems\r\ndemonstrates their capacity to reason through complex natural language\r\ntasks, requiring a strong understanding of probability and statistics.\r\nAs probability problems often require complex thought processes to\r\nsolve, the ability to accurately answer them highlights the model's\r\ncapability to learn and apply such concepts, similar to humans. Given\r\nthe frequent need to calculate probabilities in many contexts,\r\nreplicating this ability is crucial.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [] (Ignore if not submitting code) I have run `pip install pre-commit;\r\npre-commit install` and have verified that `black`, `isort`, and\r\n`autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a country club, 7% smoke cigars,\r\n28% smoke cigarettes and 5% smoke both. What percentage smoke neither\r\ncigars nor cigarettes?\"}],\"ideal\":\"70%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"During a visit to a primary care\r\nphysician’s office, the probability of having neither lab work nor\r\nreferral to a specialist is 0.21. Of those coming to that office, the\r\nprobability of having lab work is 0.41 and the probability of having a\r\nreferral is 0.53. What is the probability of having both lab work and a\r\nreferral?\"}],\"ideal\":\"15%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"For a certain car; blue, black,\r\nwhite and green are in demand. Three successive orders are placed for\r\ncars of this style. Find the probability that P( 2 Blue\r\n).\"}],\"ideal\":\"14%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"The chance of rain on a given day\r\nis 0.22. The chance of rain and thick clouds on a a given day is 0.11.\r\nThe chance of neither rain nor thick clouds on a given day is 0.55. What\r\nis the chance of thick clouds on a given day?\"}],\"ideal\":\"34%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A box contains 6 white and 4 red\r\nballs. We randomly (without replacement) draw two balls from the box.\r\nWhat is the probability that the second ball is red, given that the\r\nfirst ball is white?\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bowl A contains three red and two\r\nwhite chips, and bowl B contains four red and three white chips. A chip\r\nis drawn at random from bowl A and transferred to bowl B. Compute the\r\nprobability of then drawing a red chip from bowl B.\"}],\"ideal\":\"58%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A researcher finds that, of 982 men\r\nwho died in 2002, 221 died from some heart disease. Also, of the 982\r\nmen, 334 had at least one parent who had some heart disease. Of the\r\nlatter 334 men, 111 died from some heart disease. A man is selected from\r\nthe group of 982. Giventhat neither of his parents had some heart\r\ndisease, find the conditional probability that this man died of some\r\nheart disease.\"}],\"ideal\":\"17%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nwho are required to take a math course are also required to take a stat\r\ncourse?\"}],\"ideal\":\"38%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nrequired to take a stat course are also required to take a math\r\ncourse?\"}],\"ideal\":\"75%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"You buy a lottery ticket every day\r\nfor the next 5 consecutive days. The probability you win for each ticket\r\nis 0.20. What is the probability of having two winning tickets and three\r\nlosing tickets?\"}],\"ideal\":\"20%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Methods A and B are available for\r\nteaching a skill. The failure rate for A is 30%, and for B, 10%. B is\r\nmore expensive and is only used 20% of the time. A worker is taught the\r\nskill by one of the two methods but fails to learn it correctly. What is\r\nthe probability they were taught by A?\"}],\"ideal\":\"92%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly one player is\r\nsuccessful.\"}],\"ideal\":\"29%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly two players make\r\na field goal (i.e., one misses).\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Find\r\nthe probability P(G) that a seed selected at random from the mixed seeds\r\nwill germinate.\"}],\"ideal\":\"79%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Given\r\nthat a seed germinates, find the probability that the seed was purchased\r\nfrom supplier A.\"}],\"ideal\":\"43%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A test indicates the presence of a\r\nparticular disease 90% of the time when the disease is present and the\r\npresence of the disease 2% of the time when the disease is not present.\r\nIf 0.5% of the population has the disease, calculate the conditional\r\nprobability that a person selected at random has the disease if the test\r\nindicates the presence of the disease.\"}],\"ideal\":\"18%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Two processes of a company produce\r\nrolls of materials: The rolls of Process I are 3% defective and the\r\nrolls of Process II are 1% defective. Process I produces 60% of the\r\ncompany’s output, Process II 40%. A roll is selected at random from the\r\ntotal output. Given that this roll is defective, what is the conditional\r\nprobability that it is from Process I?\"}],\"ideal\":\"82%\"}\r\n```\r\n</details>","shortMessageHtmlLink":"probabilities-word-problems (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1700517791\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/941\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/941/hovercard\" href=\"https://github.com/openai/evals/pull/941\">openai#941</a>)"}},{"before":"170dfd886c0704588461af075393cc20cfb0480f","after":"1df05834aa52ee4265c226bf7fe85850b25df15c","ref":"refs/heads/main","pushedAt":"2023-07-05T22:18:02.000Z","pushType":"push","commitsCount":339,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"probabilities-word-problems (#941)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\n## Eval details 📑\r\n### Eval name\r\nprobabilities-word-problems\r\n\r\n### Eval description\r\n\r\nTest the model's ability to calculate probabilities given word problems.\r\n\r\n### What makes this a useful eval?\r\n\r\nThe ability of GPT models to solve probability word problems\r\ndemonstrates their capacity to reason through complex natural language\r\ntasks, requiring a strong understanding of probability and statistics.\r\nAs probability problems often require complex thought processes to\r\nsolve, the ability to accurately answer them highlights the model's\r\ncapability to learn and apply such concepts, similar to humans. Given\r\nthe frequent need to calculate probabilities in many contexts,\r\nreplicating this ability is crucial.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [] (Ignore if not submitting code) I have run `pip install pre-commit;\r\npre-commit install` and have verified that `black`, `isort`, and\r\n`autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a country club, 7% smoke cigars,\r\n28% smoke cigarettes and 5% smoke both. What percentage smoke neither\r\ncigars nor cigarettes?\"}],\"ideal\":\"70%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"During a visit to a primary care\r\nphysician’s office, the probability of having neither lab work nor\r\nreferral to a specialist is 0.21. Of those coming to that office, the\r\nprobability of having lab work is 0.41 and the probability of having a\r\nreferral is 0.53. What is the probability of having both lab work and a\r\nreferral?\"}],\"ideal\":\"15%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"For a certain car; blue, black,\r\nwhite and green are in demand. Three successive orders are placed for\r\ncars of this style. Find the probability that P( 2 Blue\r\n).\"}],\"ideal\":\"14%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"The chance of rain on a given day\r\nis 0.22. The chance of rain and thick clouds on a a given day is 0.11.\r\nThe chance of neither rain nor thick clouds on a given day is 0.55. What\r\nis the chance of thick clouds on a given day?\"}],\"ideal\":\"34%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A box contains 6 white and 4 red\r\nballs. We randomly (without replacement) draw two balls from the box.\r\nWhat is the probability that the second ball is red, given that the\r\nfirst ball is white?\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bowl A contains three red and two\r\nwhite chips, and bowl B contains four red and three white chips. A chip\r\nis drawn at random from bowl A and transferred to bowl B. Compute the\r\nprobability of then drawing a red chip from bowl B.\"}],\"ideal\":\"58%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A researcher finds that, of 982 men\r\nwho died in 2002, 221 died from some heart disease. Also, of the 982\r\nmen, 334 had at least one parent who had some heart disease. Of the\r\nlatter 334 men, 111 died from some heart disease. A man is selected from\r\nthe group of 982. Giventhat neither of his parents had some heart\r\ndisease, find the conditional probability that this man died of some\r\nheart disease.\"}],\"ideal\":\"17%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nwho are required to take a math course are also required to take a stat\r\ncourse?\"}],\"ideal\":\"38%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nrequired to take a stat course are also required to take a math\r\ncourse?\"}],\"ideal\":\"75%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"You buy a lottery ticket every day\r\nfor the next 5 consecutive days. The probability you win for each ticket\r\nis 0.20. What is the probability of having two winning tickets and three\r\nlosing tickets?\"}],\"ideal\":\"20%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Methods A and B are available for\r\nteaching a skill. The failure rate for A is 30%, and for B, 10%. B is\r\nmore expensive and is only used 20% of the time. A worker is taught the\r\nskill by one of the two methods but fails to learn it correctly. What is\r\nthe probability they were taught by A?\"}],\"ideal\":\"92%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly one player is\r\nsuccessful.\"}],\"ideal\":\"29%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly two players make\r\na field goal (i.e., one misses).\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Find\r\nthe probability P(G) that a seed selected at random from the mixed seeds\r\nwill germinate.\"}],\"ideal\":\"79%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Given\r\nthat a seed germinates, find the probability that the seed was purchased\r\nfrom supplier A.\"}],\"ideal\":\"43%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A test indicates the presence of a\r\nparticular disease 90% of the time when the disease is present and the\r\npresence of the disease 2% of the time when the disease is not present.\r\nIf 0.5% of the population has the disease, calculate the conditional\r\nprobability that a person selected at random has the disease if the test\r\nindicates the presence of the disease.\"}],\"ideal\":\"18%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Two processes of a company produce\r\nrolls of materials: The rolls of Process I are 3% defective and the\r\nrolls of Process II are 1% defective. Process I produces 60% of the\r\ncompany’s output, Process II 40%. A roll is selected at random from the\r\ntotal output. Given that this roll is defective, what is the conditional\r\nprobability that it is from Process I?\"}],\"ideal\":\"82%\"}\r\n```\r\n</details>","shortMessageHtmlLink":"probabilities-word-problems (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1700517791\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/941\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/941/hovercard\" href=\"https://github.com/openai/evals/pull/941\">openai#941</a>)"}},{"before":"eaee0ec09e6e78c1a2ccab76f5c942c7dba224a0","after":"170dfd886c0704588461af075393cc20cfb0480f","ref":"refs/heads/main","pushedAt":"2023-05-11T14:06:22.173Z","pushType":"push","commitsCount":3,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"[Eval] An array of Liar Paradox-based evals (#883)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\n## Eval details 📑\r\n### Eval name\r\nlogic-liar-paradox\r\n\r\n### Eval description\r\n\r\nAn array of Liar Paradox-based evals, examining the model's proficiency\r\nin navigating linguistic nuances and logical reasoning within\r\nself-referential statements.\r\n\r\n### What makes this a useful eval?\r\n\r\nThis eval is particularly useful because it delves into complex, nuanced\r\nlogical concepts and self-referential statements, which have\r\nhistorically posed challenges for AI models. By exploring various\r\ncontexts, alternative logical frameworks, and modifications to\r\nstatements, this eval helps assess the model's ability to adapt to\r\ndifferent perspectives, grasp subtleties in language, and engage in\r\nflexible reasoning. The ability to understand and navigate paradoxes is\r\nan essential aspect of human-like reasoning, and improving an AI model's\r\nperformance in this area would significantly enhance its overall\r\nusefulness and reliability in real-world applications. Additionally,\r\nshowcasing the model's improved proficiency in handling paradoxes would\r\nnot only make for a compelling marketing angle (as paradoxes are\r\nunderstood by a much broader range of people than other difficult tasks\r\nsuch as pure maths or quantum mechanics) but it would also demonstrate\r\nthe progress made in AI's capacity to think and reason more like humans.\r\nIt also adds paradox-absorbing crumple zones.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n- [x] Addresses complex logical reasoning: The eval focuses on AI's\r\nability to comprehend and navigate paradoxes, self-referential\r\nstatements, and context switching, which are important aspects of\r\nhuman-like reasoning. By testing the model's proficiency in these areas,\r\nwe can identify areas for improvement and work towards enhancing AI's\r\noverall capacity to think and reason more like humans.\r\n- [x] Demonstrates adaptability and flexibility: The eval showcases the\r\nmodel's ability to switch between contexts, alter premises, and engage\r\nwith different dimensions of inferred logic. This will help assess the\r\nmodel's adaptability and flexibility in diverse real-world situations,\r\nmaking it more reliable and useful.\r\n- [x] Contributes to AI safety and understanding: By identifying the\r\nmodel's weaknesses and limitations in handling paradoxes and complex\r\nlogical constructs, the eval can contribute to AI safety and enable\r\nresearchers to better understand the challenges faced by large language\r\nmodels in these areas.\r\n- [x] Engaging and appealing: An eval that delves into paradoxes and\r\ncomplex thought exercises is not only intellectually stimulating but\r\nalso adds an appealing element to showcase the model's capabilities,\r\nmaking it more attractive for both researchers and end-users.\r\n\r\n### Unique eval value\r\n\r\n- [x] Encourages creativity and lateral thinking: The eval, by focusing\r\non paradoxes and complex logical constructs, encourages both the AI and\r\nits developers to think creatively and approach problem-solving from\r\nunconventional angles. This can lead to the discovery of novel solutions\r\nand a better understanding of the model's capabilities.\r\n- [x] Aligns with human values and expectations: An AI that can\r\nsuccessfully navigate paradoxes and complex logic is more likely to\r\nalign with human values and expectations. By addressing these challenges\r\nin the eval, we strive to develop AI systems that understand and respect\r\nthe nuances of human thought and communication.\r\n- [x] Addresses a broad range of applications: Improved reasoning and\r\ncontext-switching abilities can have a significant impact on various AI\r\napplications, including natural language understanding, decision-making,\r\nand problem-solving in domains such as law, philosophy, ethics, and\r\nmore.\r\n- [x] Fosters interdisciplinary collaboration: The exploration of\r\nparadoxes and complex logic often draws upon insights from multiple\r\ndisciplines, including philosophy, linguistics, psychology, and computer\r\nscience. This eval can help foster interdisciplinary collaboration,\r\nleading to richer and more diverse perspectives on AI development and\r\nits potential impact on society.\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [ ] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"Is the statement \\\"This sentence\r\nis true\\\" a paradox?\"}],\"ideal\":\"No, \\\"This sentence is true\\\" is not a\r\nparadox; it is a tautology.\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"Does the statement \\\"This sentence\r\nis not true\\\" create a paradox?\"}],\"ideal\":\"Yes, \\\"This sentence is not\r\ntrue\\\" creates a paradox, as it is a variant of the Liar Paradox.\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"What is the difference between a\r\nparadox and a tautology?\"}],\"ideal\":\"A paradox is a statement that leads\r\nto a self-contradictory or logically unacceptable conclusion, while a\r\ntautology is a statement that is always true by its logical form.\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"Can the Liar Paradox be resolved\r\nby assuming that sentences can have both true and false\r\nvalues?\"}],\"ideal\":\"No, the Liar Paradox cannot be resolved by assuming\r\nthat sentences can have both true and false values, as this would lead\r\nto a different kind of paradox called the \\\"Dialetheism Paradox.\\\"\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"Consider the statement \\\"This\r\nsentence is neither true nor false.\\\" Is this statement an example of\r\nthe Liar Paradox?\"}],\"ideal\":\"This statement, \\\"This sentence is neither\r\ntrue nor false,\\\" is not an example of the Liar Paradox, but it is a\r\nsimilar paradox known as the 'truth-teller paradox' or the 'strengthened\r\nliar paradox.' It creates a paradoxical situation because if the\r\nstatement is true, then it is neither true nor false, which contradicts\r\nits truth. If the statement is false, then it is not the case that it is\r\nneither true nor false, which implies that it is either true or false,\r\nagain leading to a contradiction. The paradox arises due to\r\nself-reference and the inability to assign a consistent truth value to\r\nthe statement.\"}\r\n  ```\r\n</details>","shortMessageHtmlLink":"[Eval] An array of Liar Paradox-based evals (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1690604730\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/883\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/883/hovercard\" href=\"https://github.com/openai/evals/pull/883\">openai#883</a>)"}},{"before":"a6fe832e7456ce82337f3f355b859c317a94c280","after":"eaee0ec09e6e78c1a2ccab76f5c942c7dba224a0","ref":"refs/heads/main","pushedAt":"2023-05-06T20:08:04.411Z","pushType":"push","commitsCount":48,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"[evals] added pause/unpause to record.py (#898)\n\n- this allows more flexibly controlling when PromptFn etc records logs\r\nor not\r\n\r\n```\r\nwith recorder.as_default_recorder():\r\n  evals.record.pause() \r\n  # ... any record_event() is skipped\r\n  evals.record.unpause()\r\n```","shortMessageHtmlLink":"[evals] added pause/unpause to record.py (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1693276925\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/898\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/898/hovercard\" href=\"https://github.com/openai/evals/pull/898\">openai#898</a>)"}},{"before":"f7ebbe8ae9cd1a94e061bdbd116127f130f4ed4c","after":"a6fe832e7456ce82337f3f355b859c317a94c280","ref":"refs/heads/main","pushedAt":"2023-04-17T15:16:48.187Z","pushType":"push","commitsCount":26,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"[Evals] Update JSON validator to use match check (#682)","shortMessageHtmlLink":"[Evals] Update JSON validator to use match check (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1669083484\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/682\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/682/hovercard\" href=\"https://github.com/openai/evals/pull/682\">openai#682</a>)"}},{"before":"e8c09c867aa4f46cd409a40795cfdfd3d349032a","after":"f7ebbe8ae9cd1a94e061bdbd116127f130f4ed4c","ref":"refs/heads/main","pushedAt":"2023-04-09T21:46:22.267Z","pushType":"push","commitsCount":2,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"[evals] added format() to ModelGradedSpec (#597)\n\n- 'in_message' and 'out_message' formatting for modelgraded evals\r\n- factored out append_answer_prompt function","shortMessageHtmlLink":"[evals] added format() to ModelGradedSpec (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1656703325\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/597\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/597/hovercard\" href=\"https://github.com/openai/evals/pull/597\">openai#597</a>)"}},{"before":"2486f7e3a3247ccc388c37f7459736af13a2dca1","after":"e8c09c867aa4f46cd409a40795cfdfd3d349032a","ref":"refs/heads/main","pushedAt":"2023-04-04T16:47:42.834Z","pushType":"push","commitsCount":1,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"[evals] refactored modelgraded eval (#578)\n\n- moved functions to modelgraded/classify_utils.py\r\n- defined ModelGradedSpec and moved to modelgraded/base.py\r\n- unified interface on registry.py\r\n- other misc refactoring","shortMessageHtmlLink":"[evals] refactored modelgraded eval (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1653157424\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/578\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/578/hovercard\" href=\"https://github.com/openai/evals/pull/578\">openai#578</a>)"}},{"before":"db4932283f07da66062b3412ce002d9ec5a072a5","after":"2486f7e3a3247ccc388c37f7459736af13a2dca1","ref":"refs/heads/main","pushedAt":"2023-03-30T07:12:37.175Z","pushType":"push","commitsCount":73,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"[evals] added eval_model flag to modelgraded eval (#519)","shortMessageHtmlLink":"[evals] added eval_model flag to modelgraded eval (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1646845032\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/519\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/519/hovercard\" href=\"https://github.com/openai/evals/pull/519\">openai#519</a>)"}},{"before":"a7fe8e0ac5c4e2b71975bef5db10d73f64996b0f","after":"db4932283f07da66062b3412ce002d9ec5a072a5","ref":"refs/heads/main","pushedAt":"2023-03-17T06:10:26.128Z","pushType":"push","commitsCount":24,"pusher":{"login":"Arijit1000","name":null,"path":"/Arijit1000","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/73710898?s=80&v=4"},"commit":{"message":"Merge pull request #260 from openai/sg-japan\n\n[evals] added multilingual example and support","shortMessageHtmlLink":"Merge pull request <a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1628263478\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/260\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/260/hovercard\" href=\"https://github.com/openai/evals/pull/260\">openai#260</a> from openai/sg-japan"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEYMNsBgA","startCursor":null,"endCursor":null}},"title":"Activity · Arijit1000/evals_OpenAI"}