{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":616860927,"defaultBranch":"main","name":"evals","ownerLogin":"msapaydin","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-03-21T08:30:35.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/4745554?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1679387443.443904","currentOid":""},"activityList":{"items":[{"before":"f118fca38e3dde127d2eba44374806f581b0da1c","after":"305b237cdb3884c7ddb6a5d12cb184a83551fcba","ref":"refs/heads/main","pushedAt":"2023-07-25T03:21:37.499Z","pushType":"push","commitsCount":514,"pusher":{"login":"msapaydin","name":"Mehmet Serkan Apaydın","path":"/msapaydin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4745554?s=80&v=4"},"commit":{"message":"Irrelevant negative diversion (#1318)\n\nTests the model's reasoning ability in face of a negative diversion\r\n(e.g. \"However, ...\") with irrelevant information.\r\n\r\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, **failure to follow\r\nthe guidelines below will result in the PR being closed automatically**.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access be granted. 🚨\r\n\r\n**PLEASE READ THIS**:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject it since GPT-4 is already capable of completing\r\nthe task.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\nAlso, please note that we're using **Git LFS** for storing the JSON\r\nfiles, so please make sure that you move the JSON file to Git LFS before\r\nsubmitting a PR. Details on how to use Git LFS are available\r\n[here](https://git-lfs.com).\r\n\r\n## Eval details 📑\r\n\r\n### Eval name\r\n\r\nirrelevant-negative-diversion\r\n\r\n### Eval description\r\n\r\nThe eval tests the model's ability to reason. It has been demonstrated\r\nthat a negative diversion (e.g. \"However\", \"Despite that\", \"That being\r\nsaid\") can lead the model to a wrong conclusion. Even when the negative\r\ndiversion contains more or less irrelevant information (e.g. \"However,\r\nthey often squabbled as children.\")\r\n\r\n### What makes this a useful eval?\r\n\r\nI have tested GPT-4 through ChatGPT and can see that it often gets these\r\nwrong. It's a little bit random. Where all the samples have the ideal\r\nanswer of \"yes\", ChatGPT would often say No or more often say that it\r\nwas unable to conclude.\r\n\r\nThe prompt is asking the model to choose \"yes\" or \"no\" according to what\r\nis most reasonable.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high-quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your YAML is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (<https://platform.openai.com/docs/usage-policies>).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the commits on the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgment\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and the high volume of submissions, we will\r\nnot be able to accept all submissions and thus not grant everyone who\r\nopens a PR GPT-4 access. We know this is disappointing, but we hope to\r\nset the right expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access be\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields of this form\r\n- [x] I have used **Git LFS** for the Eval JSON data\r\n- [ ] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data\r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. However, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. But, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. Unfortunately, Anna and Briana squabbled often\r\nas children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Sadly, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. But regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Even though, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Despite, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. That being said, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n  ```\r\n</details>","shortMessageHtmlLink":"Irrelevant negative diversion (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1805380910\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1318\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1318/hovercard\" href=\"https://github.com/openai/evals/pull/1318\">openai#1318</a>)"}},{"before":"f118fca38e3dde127d2eba44374806f581b0da1c","after":"305b237cdb3884c7ddb6a5d12cb184a83551fcba","ref":"refs/heads/main","pushedAt":"2023-07-25T03:21:37.000Z","pushType":"push","commitsCount":514,"pusher":{"login":"msapaydin","name":"Mehmet Serkan Apaydın","path":"/msapaydin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4745554?s=80&v=4"},"commit":{"message":"Irrelevant negative diversion (#1318)\n\nTests the model's reasoning ability in face of a negative diversion\r\n(e.g. \"However, ...\") with irrelevant information.\r\n\r\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, **failure to follow\r\nthe guidelines below will result in the PR being closed automatically**.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access be granted. 🚨\r\n\r\n**PLEASE READ THIS**:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject it since GPT-4 is already capable of completing\r\nthe task.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\nAlso, please note that we're using **Git LFS** for storing the JSON\r\nfiles, so please make sure that you move the JSON file to Git LFS before\r\nsubmitting a PR. Details on how to use Git LFS are available\r\n[here](https://git-lfs.com).\r\n\r\n## Eval details 📑\r\n\r\n### Eval name\r\n\r\nirrelevant-negative-diversion\r\n\r\n### Eval description\r\n\r\nThe eval tests the model's ability to reason. It has been demonstrated\r\nthat a negative diversion (e.g. \"However\", \"Despite that\", \"That being\r\nsaid\") can lead the model to a wrong conclusion. Even when the negative\r\ndiversion contains more or less irrelevant information (e.g. \"However,\r\nthey often squabbled as children.\")\r\n\r\n### What makes this a useful eval?\r\n\r\nI have tested GPT-4 through ChatGPT and can see that it often gets these\r\nwrong. It's a little bit random. Where all the samples have the ideal\r\nanswer of \"yes\", ChatGPT would often say No or more often say that it\r\nwas unable to conclude.\r\n\r\nThe prompt is asking the model to choose \"yes\" or \"no\" according to what\r\nis most reasonable.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high-quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your YAML is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (<https://platform.openai.com/docs/usage-policies>).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the commits on the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgment\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and the high volume of submissions, we will\r\nnot be able to accept all submissions and thus not grant everyone who\r\nopens a PR GPT-4 access. We know this is disappointing, but we hope to\r\nset the right expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access be\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields of this form\r\n- [x] I have used **Git LFS** for the Eval JSON data\r\n- [ ] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data\r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. However, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. But, Anna and Briana squabbled often as\r\nchildren. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can end world poverty,\r\nif Anna gives Briana $10. Unfortunately, Anna and Briana squabbled often\r\nas children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Sadly, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. Regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce child\r\nmortality, if Anna gives Briana $10. But regrettably, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Even though, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. Despite, Anna and Briana squabbled\r\noften as children. Will Anna give Briana the money?\"}],\"ideal\":\"Yes\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"You are a helpful\r\nassistant.\"},{\"role\":\"user\",\"content\":\"Answer only yes or no to the\r\nfollowing question in accordance with what seems most\r\nreasonable.\\r\\n\\r\\nAnna is convinced that Briana can reduce congenital\r\nblindness, if Anna gives Briana $10. That being said, Anna and Briana\r\nsquabbled often as children. Will Anna give Briana the\r\nmoney?\"}],\"ideal\":\"Yes\"}\r\n  ```\r\n</details>","shortMessageHtmlLink":"Irrelevant negative diversion (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1805380910\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1318\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1318/hovercard\" href=\"https://github.com/openai/evals/pull/1318\">openai#1318</a>)"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAADXFXCOQA","startCursor":null,"endCursor":null}},"title":"Activity · msapaydin/evals"}