{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":628436998,"defaultBranch":"main","name":"evals","ownerLogin":"yuiseki","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-04-16T00:01:38.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/25507?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1704841941.0","currentOid":""},"activityList":{"items":[{"before":"c66b5c1337cf2b65b72045bcdcfaeeacc0eafad2","after":"82ec660eedb7a1d0a8fb787910cfb7fe7108dec6","ref":"refs/heads/main","pushedAt":"2024-02-28T23:43:32.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Suppress 'HTTP/1.1 200 OK' logs from openai library (#1468)\n\nSince [the `openai-python` library\r\nupdate](https://github.com/openai/evals/pull/1444), eval runs are\r\ngetting flooded with excessive \"HTTP/1.1. 200 OK\" logs from the openai\r\nlibrary:\r\n```\r\njunshern@JunSherns-MacBook-Pro ⚒ oaieval gpt-3.5-turbo 2d_movement\r\n[2024-02-15 12:22:08,549] [registry.py:262] Loading registry from /Users/junshern/projects/oss_evals/evals/evals/registry/evals\r\n[2024-02-15 12:22:08,898] [registry.py:262] Loading registry from /Users/junshern/.evals/evals\r\n[2024-02-15 12:22:08,900] [oaieval.py:211] Run started: 240215042208OCODJ2NY\r\n[2024-02-15 12:22:08,949] [data.py:94] Fetching /Users/junshern/projects/oss_evals/evals/evals/registry/data/2d_movement/samples.jsonl\r\n[2024-02-15 12:22:08,949] [eval.py:36] Evaluating 100 samples\r\n[2024-02-15 12:22:08,955] [eval.py:144] Running in threaded mode with 10 threads!\r\n 0%| | 0/100 [00:00openai#1468)"}},{"before":"2981e6544a4d9512ff4d3998d483673a7d03db2e","after":"c66b5c1337cf2b65b72045bcdcfaeeacc0eafad2","ref":"refs/heads/main","pushedAt":"2024-01-12T23:08:09.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Fix formatting/typing so pre-commit hooks pass (#1451)\n\n(Not an eval)\r\n\r\n**One-line summary**: Pre-commit hooks were failing. I identified the\r\nmain cause, and then fixed all secondary pre-commit issues. I only\r\nchanged the logic in one place, `oiaevalset.py`.\r\n\r\nI was having issues with type-hinting and identified that the old\r\n`typings` directory was causing the `from openai import OpenAI` import\r\nto register as an error. I decided to go through and fix all the issues\r\nthat appeared in `pre-commit run --all-files`.\r\n\r\nNOTE: \r\n- I changed the logic in `oaievalset.py` by adding a `continue`\r\nstatement if an `eval` or `eval.key` was missing.\r\n- As far as I can tell this should basically never happen, but is\r\ncorrect behavior.\r\n- Another option would be to assert that `eval` and `eval.key` are not\r\n`None` but forcing an error here doesn't match what I interpret as\r\nintended behavior.\r\n\r\nThe manual work involved was mainly:\r\n\r\n1. Deleting the `typings` directory, which was interfering with `openai`\r\ntype-hints (such as `from openai import OpenAI`)\r\n2. Fixing type issues in `oaievalset.py`.\r\n3. Moving the `client =\r\nOpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\"))` line below all the\r\nimports.\r\n4. Breaking lines of length >767 into smaller chunks using line\r\ncontinuation.\r\n\r\nThus this PR is broken into three parts:\r\n\r\n1. Deleting `typings` (first commit)\r\n2. Manually cleaning up issues (middle commits)\r\n3. Applying autofixes from the pre-commit hooks (last commit)","shortMessageHtmlLink":"Fix formatting/typing so pre-commit hooks pass (openai#1451)"}},{"before":"bccb32410918a5c30cf248ed55ed87ddb3c65847","after":null,"ref":"refs/heads/renovate/configure","pushedAt":"2024-01-09T23:12:21.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"renovate[bot]","name":null,"path":"/apps/renovate","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/2740?s=80&v=4"}},{"before":null,"after":"bccb32410918a5c30cf248ed55ed87ddb3c65847","ref":"refs/heads/renovate/configure","pushedAt":"2024-01-08T13:00:59.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"renovate[bot]","name":null,"path":"/apps/renovate","primaryAvatarUrl":"https://avatars.githubusercontent.com/in/2740?s=80&v=4"},"commit":{"message":"Add renovate.json","shortMessageHtmlLink":"Add renovate.json"}},{"before":"a53f52b3c104542ec1953d26a4bd551685da8bb4","after":"2981e6544a4d9512ff4d3998d483673a7d03db2e","ref":"refs/heads/main","pushedAt":"2024-01-08T13:00:45.000Z","pushType":"push","commitsCount":37,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Improve MMMU performance with prompt engineering (#1450)\n\nWith this improvement we now have a 0-shot performance of 59.6%\r\n(averaged over 3 eval runs) on the MMMU validation set, which beats the\r\n56.8% reported in the [MMMU paper](https://arxiv.org/pdf/2311.16502.pdf)","shortMessageHtmlLink":"Improve MMMU performance with prompt engineering (openai#1450)"}},{"before":"dd96814dd96bd64f3098afca8dc873aa8d8ce4c8","after":"a53f52b3c104542ec1953d26a4bd551685da8bb4","ref":"refs/heads/main","pushedAt":"2023-11-11T00:01:35.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Add new Solvers framework (#1397)\n\n# Solvers\r\n\r\nIn this PR, we introduce a new abstraction called \"Solvers\" as an\r\nintermediary interface between an Eval and a CompletionFn.\r\n\r\n## Motivation\r\nThis addresses some difficulties we previously had:\r\n- We want to be able to easily run and compare different kinds of model\r\nscaffolding approaches against a given Eval.\r\n- The current interface for CompletionFns requires users to pass a\r\n**prompt** to the CompletionFn, which encourages the eval designer to\r\nwrite a prompt that often privileges a particular kind of model over\r\nothers and often locks-in the scaffolding approach. e.g. If developing\r\nwith ChatCompletion models, the resulting prompt will usually work best\r\nfor ChatCompletion models.\r\n- It’s technically possible for eval designers to write solver-agnostic\r\nprompts, but the string format is hard to parse and reshape into new\r\nprompts. To enable flexibility, you want to provide instructions,\r\ninputs, previous interactions, and other task data separately rather\r\nthan just a single string.\r\n\r\n## Solution\r\n- In our proposed approach, we clearly separate the responsibilities of\r\ndefining the rules, inputs, and metrics for a task (the \"Eval\") from the\r\nresponsibility of solving the task (the \"Solver\").\r\n- An Eval's responsibility is to construct a structured TaskState object\r\ncontaining all the necessary information for the eval, but the Eval\r\nitself is unopinionated about how that information should be used. In\r\nother words, the Eval should be agnostic to the Solver that attempts it.\r\n- A Solver receives the TaskState object and decides how to use that\r\ninformation -- e.g. concatenating it into a prompt and passing that\r\nprompt into a CompletionFn. Note that a Solver can generate its response\r\nin any way, and may call any number of CompletionFn's, wait for human\r\ninput, or generate a response from a programmatic bot without any models\r\ninvolved.\r\n- When the Solver is done, it returns a SolverResult to be judged by the\r\nEval.\r\n\r\n## What's new\r\n- We introduce a `Solver` class that inherits from `CompletionFn`. This\r\nlooks largely the same as a CompletionFn except that its input is a\r\nstructured TaskState object instead of a plain string prompt.\r\n- Along with the Solver base class, we also introduce a variety of\r\nSolvers that are useful for various models including a HumanCLISolver,\r\nOpenAIChatCompletionSolver, OpenAICompletionSolver, and more!\r\n- We introduce a `SolverEval` class that inherits from `Eval`, which\r\nshould be used by any eval that wants to use solvers. Key features:\r\n- Allows us to be explicit about what kind of eval we're building, and\r\nenforces checks on the input completion_fn to see if it is a compatible\r\nSolver.\r\n- Creates a new copy of the solver for each run of `eval_sample`, to\r\nallow for stateful solvers (e.g. agents with memory) without interfering\r\nwith other sample runs.\r\n- Add new generic `MatchWithSolvers` class which is similar to a `Match`\r\nEval class but uses SolverEval instead.\r\n\r\n## Usage and Compatibility\r\nAs before, once a new SolverEval and Solver have been registered to\r\n`evals/registry/evals` and `evals/registry/completion_fns` respectively,\r\none can run an eval with:\r\n```bash\r\noaieval \r\n```\r\nwhere `` is a Solver and `` is a SolverEval.\r\n\r\nIn general, Solvers are not compatible with plain Evals, and SolverEvals\r\nare not compatible with plain CompletionFns (since the passing of the\r\nTaskState object is a breaking change on the interface). That said, we\r\nprovide wrappers for the common `OpenAICompletionFn` and\r\n`OpenAIChatCompletionFn` so that users can use these simple model-based\r\ncompletion_fns with SolverEvals out-of-the-box:\r\n```bash\r\noaieval gpt-4 \r\n```","shortMessageHtmlLink":"Add new Solvers framework (openai#1397)"}},{"before":"e49868e550babb7b1c5b4223c9b7a14511bf114d","after":"dd96814dd96bd64f3098afca8dc873aa8d8ce4c8","ref":"refs/heads/main","pushedAt":"2023-10-09T02:46:22.000Z","pushType":"push","commitsCount":4,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Adding ruff, running pre-commit hooks, small fixes and documentation (#1303)\n\nThis doesn't contribute an Eval but slightly improves the developer\r\nexperience for contributors.","shortMessageHtmlLink":"Adding ruff, running pre-commit hooks, small fixes and documentation (o…"}},{"before":"aff5e9ce09abc461b0fd3a768afd629da923ab90","after":"e49868e550babb7b1c5b4223c9b7a14511bf114d","ref":"refs/heads/main","pushedAt":"2023-09-21T12:59:24.000Z","pushType":"push","commitsCount":65,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Amend contribution statement for make_me_say (#1361)","shortMessageHtmlLink":"Amend contribution statement for make_me_say (openai#1361)"}},{"before":"1df05834aa52ee4265c226bf7fe85850b25df15c","after":"aff5e9ce09abc461b0fd3a768afd629da923ab90","ref":"refs/heads/main","pushedAt":"2023-07-07T23:29:38.828Z","pushType":"push","commitsCount":1,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Bug fix: gpt-4-base runs with ChatCompletion (#1300)\n\nIf you run an eval with `gpt-4-base` you get the following error:\r\n```\r\nopenai.error.InvalidRequestError: This is not a chat model and thus not supported in the v1/chat/completions endpoint. Did you mean to use v1/completions?\r\n```\r\n\r\nExample: run `oaieval gpt-4-base,gpt-4 multiturn` on\r\n[commit](https://github.com/openai/evals/commit/413402ecc5115a21710acbd4b844c3668052c874)\r\n\r\n---\r\n\r\nWith this fix, you can run evals with `gpt-4-base` without the error. \r\n\r\nExample: run `oaieval gpt-4-base,gpt-4`\r\n[commit](https://github.com/openai/evals/commit/e1230bdd82e15a4eeaea5f7ae726924bab72631d)","shortMessageHtmlLink":"Bug fix: gpt-4-base runs with ChatCompletion (openai#1300)"}},{"before":"1df05834aa52ee4265c226bf7fe85850b25df15c","after":"aff5e9ce09abc461b0fd3a768afd629da923ab90","ref":"refs/heads/main","pushedAt":"2023-07-07T23:29:38.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Bug fix: gpt-4-base runs with ChatCompletion (#1300)\n\nIf you run an eval with `gpt-4-base` you get the following error:\r\n```\r\nopenai.error.InvalidRequestError: This is not a chat model and thus not supported in the v1/chat/completions endpoint. Did you mean to use v1/completions?\r\n```\r\n\r\nExample: run `oaieval gpt-4-base,gpt-4 multiturn` on\r\n[commit](https://github.com/openai/evals/commit/413402ecc5115a21710acbd4b844c3668052c874)\r\n\r\n---\r\n\r\nWith this fix, you can run evals with `gpt-4-base` without the error. \r\n\r\nExample: run `oaieval gpt-4-base,gpt-4`\r\n[commit](https://github.com/openai/evals/commit/e1230bdd82e15a4eeaea5f7ae726924bab72631d)","shortMessageHtmlLink":"Bug fix: gpt-4-base runs with ChatCompletion (openai#1300)"}},{"before":"36c2c742650a1c7cade757255c8a496af6dd18d5","after":"1df05834aa52ee4265c226bf7fe85850b25df15c","ref":"refs/heads/main","pushedAt":"2023-07-05T06:27:21.264Z","pushType":"push","commitsCount":143,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"probabilities-word-problems (#941)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\n## Eval details 📑\r\n### Eval name\r\nprobabilities-word-problems\r\n\r\n### Eval description\r\n\r\nTest the model's ability to calculate probabilities given word problems.\r\n\r\n### What makes this a useful eval?\r\n\r\nThe ability of GPT models to solve probability word problems\r\ndemonstrates their capacity to reason through complex natural language\r\ntasks, requiring a strong understanding of probability and statistics.\r\nAs probability problems often require complex thought processes to\r\nsolve, the ability to accurately answer them highlights the model's\r\ncapability to learn and apply such concepts, similar to humans. Given\r\nthe frequent need to calculate probabilities in many contexts,\r\nreplicating this ability is crucial.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [] (Ignore if not submitting code) I have run `pip install pre-commit;\r\npre-commit install` and have verified that `black`, `isort`, and\r\n`autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n
\r\n View evals in JSON\r\n\r\n ### Eval\r\n ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a country club, 7% smoke cigars,\r\n28% smoke cigarettes and 5% smoke both. What percentage smoke neither\r\ncigars nor cigarettes?\"}],\"ideal\":\"70%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"During a visit to a primary care\r\nphysician’s office, the probability of having neither lab work nor\r\nreferral to a specialist is 0.21. Of those coming to that office, the\r\nprobability of having lab work is 0.41 and the probability of having a\r\nreferral is 0.53. What is the probability of having both lab work and a\r\nreferral?\"}],\"ideal\":\"15%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"For a certain car; blue, black,\r\nwhite and green are in demand. Three successive orders are placed for\r\ncars of this style. Find the probability that P( 2 Blue\r\n).\"}],\"ideal\":\"14%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"The chance of rain on a given day\r\nis 0.22. The chance of rain and thick clouds on a a given day is 0.11.\r\nThe chance of neither rain nor thick clouds on a given day is 0.55. What\r\nis the chance of thick clouds on a given day?\"}],\"ideal\":\"34%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A box contains 6 white and 4 red\r\nballs. We randomly (without replacement) draw two balls from the box.\r\nWhat is the probability that the second ball is red, given that the\r\nfirst ball is white?\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bowl A contains three red and two\r\nwhite chips, and bowl B contains four red and three white chips. A chip\r\nis drawn at random from bowl A and transferred to bowl B. Compute the\r\nprobability of then drawing a red chip from bowl B.\"}],\"ideal\":\"58%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A researcher finds that, of 982 men\r\nwho died in 2002, 221 died from some heart disease. Also, of the 982\r\nmen, 334 had at least one parent who had some heart disease. Of the\r\nlatter 334 men, 111 died from some heart disease. A man is selected from\r\nthe group of 982. Giventhat neither of his parents had some heart\r\ndisease, find the conditional probability that this man died of some\r\nheart disease.\"}],\"ideal\":\"17%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nwho are required to take a math course are also required to take a stat\r\ncourse?\"}],\"ideal\":\"38%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nrequired to take a stat course are also required to take a math\r\ncourse?\"}],\"ideal\":\"75%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"You buy a lottery ticket every day\r\nfor the next 5 consecutive days. The probability you win for each ticket\r\nis 0.20. What is the probability of having two winning tickets and three\r\nlosing tickets?\"}],\"ideal\":\"20%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Methods A and B are available for\r\nteaching a skill. The failure rate for A is 30%, and for B, 10%. B is\r\nmore expensive and is only used 20% of the time. A worker is taught the\r\nskill by one of the two methods but fails to learn it correctly. What is\r\nthe probability they were taught by A?\"}],\"ideal\":\"92%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly one player is\r\nsuccessful.\"}],\"ideal\":\"29%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly two players make\r\na field goal (i.e., one misses).\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Find\r\nthe probability P(G) that a seed selected at random from the mixed seeds\r\nwill germinate.\"}],\"ideal\":\"79%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Given\r\nthat a seed germinates, find the probability that the seed was purchased\r\nfrom supplier A.\"}],\"ideal\":\"43%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A test indicates the presence of a\r\nparticular disease 90% of the time when the disease is present and the\r\npresence of the disease 2% of the time when the disease is not present.\r\nIf 0.5% of the population has the disease, calculate the conditional\r\nprobability that a person selected at random has the disease if the test\r\nindicates the presence of the disease.\"}],\"ideal\":\"18%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Two processes of a company produce\r\nrolls of materials: The rolls of Process I are 3% defective and the\r\nrolls of Process II are 1% defective. Process I produces 60% of the\r\ncompany’s output, Process II 40%. A roll is selected at random from the\r\ntotal output. Given that this roll is defective, what is the conditional\r\nprobability that it is from Process I?\"}],\"ideal\":\"82%\"}\r\n```\r\n
","shortMessageHtmlLink":"probabilities-word-problems (openai#941)"}},{"before":"36c2c742650a1c7cade757255c8a496af6dd18d5","after":"1df05834aa52ee4265c226bf7fe85850b25df15c","ref":"refs/heads/main","pushedAt":"2023-07-05T06:27:21.000Z","pushType":"push","commitsCount":143,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"probabilities-word-problems (#941)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\n## Eval details 📑\r\n### Eval name\r\nprobabilities-word-problems\r\n\r\n### Eval description\r\n\r\nTest the model's ability to calculate probabilities given word problems.\r\n\r\n### What makes this a useful eval?\r\n\r\nThe ability of GPT models to solve probability word problems\r\ndemonstrates their capacity to reason through complex natural language\r\ntasks, requiring a strong understanding of probability and statistics.\r\nAs probability problems often require complex thought processes to\r\nsolve, the ability to accurately answer them highlights the model's\r\ncapability to learn and apply such concepts, similar to humans. Given\r\nthe frequent need to calculate probabilities in many contexts,\r\nreplicating this ability is crucial.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] **Include at least 15 high quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [] (Ignore if not submitting code) I have run `pip install pre-commit;\r\npre-commit install` and have verified that `black`, `isort`, and\r\n`autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n
\r\n View evals in JSON\r\n\r\n ### Eval\r\n ```jsonl\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a country club, 7% smoke cigars,\r\n28% smoke cigarettes and 5% smoke both. What percentage smoke neither\r\ncigars nor cigarettes?\"}],\"ideal\":\"70%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"During a visit to a primary care\r\nphysician’s office, the probability of having neither lab work nor\r\nreferral to a specialist is 0.21. Of those coming to that office, the\r\nprobability of having lab work is 0.41 and the probability of having a\r\nreferral is 0.53. What is the probability of having both lab work and a\r\nreferral?\"}],\"ideal\":\"15%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"For a certain car; blue, black,\r\nwhite and green are in demand. Three successive orders are placed for\r\ncars of this style. Find the probability that P( 2 Blue\r\n).\"}],\"ideal\":\"14%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"The chance of rain on a given day\r\nis 0.22. The chance of rain and thick clouds on a a given day is 0.11.\r\nThe chance of neither rain nor thick clouds on a given day is 0.55. What\r\nis the chance of thick clouds on a given day?\"}],\"ideal\":\"34%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A box contains 6 white and 4 red\r\nballs. We randomly (without replacement) draw two balls from the box.\r\nWhat is the probability that the second ball is red, given that the\r\nfirst ball is white?\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bowl A contains three red and two\r\nwhite chips, and bowl B contains four red and three white chips. A chip\r\nis drawn at random from bowl A and transferred to bowl B. Compute the\r\nprobability of then drawing a red chip from bowl B.\"}],\"ideal\":\"58%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A researcher finds that, of 982 men\r\nwho died in 2002, 221 died from some heart disease. Also, of the 982\r\nmen, 334 had at least one parent who had some heart disease. Of the\r\nlatter 334 men, 111 died from some heart disease. A man is selected from\r\nthe group of 982. Giventhat neither of his parents had some heart\r\ndisease, find the conditional probability that this man died of some\r\nheart disease.\"}],\"ideal\":\"17%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nwho are required to take a math course are also required to take a stat\r\ncourse?\"}],\"ideal\":\"38%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"At a certain college, 80% of\r\nstudents are required to take a math course, 40% are required to take a\r\nstatistics course and 30% are required to take both. What % of students\r\nrequired to take a stat course are also required to take a math\r\ncourse?\"}],\"ideal\":\"75%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"You buy a lottery ticket every day\r\nfor the next 5 consecutive days. The probability you win for each ticket\r\nis 0.20. What is the probability of having two winning tickets and three\r\nlosing tickets?\"}],\"ideal\":\"20%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Methods A and B are available for\r\nteaching a skill. The failure rate for A is 30%, and for B, 10%. B is\r\nmore expensive and is only used 20% of the time. A worker is taught the\r\nskill by one of the two methods but fails to learn it correctly. What is\r\nthe probability they were taught by A?\"}],\"ideal\":\"92%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly one player is\r\nsuccessful.\"}],\"ideal\":\"29%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Each of three football players will\r\nattempt to kick a field goal from the 25-yard line. Let A i denote the\r\nevent that the field goal is made by player i, i = 1, 2, 3. Assume that\r\nA1 , A2 , A3 are mutually independent and that P(A1 ) = 0.5, P(A2 ) =\r\n0.7, P(A3 ) = 0.6. Compute the probability that exactly two players make\r\na field goal (i.e., one misses).\"}],\"ideal\":\"44%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Find\r\nthe probability P(G) that a seed selected at random from the mixed seeds\r\nwill germinate.\"}],\"ideal\":\"79%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Bean seeds from supplier A have an\r\n85% germination rate and those from supplier B have a 75% germination\r\nrate. A seed-packaging company purchases 40% of its bean seeds from\r\nsupplier A and 60% from supplier B and mixes these seeds together. Given\r\nthat a seed germinates, find the probability that the seed was purchased\r\nfrom supplier A.\"}],\"ideal\":\"43%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"A test indicates the presence of a\r\nparticular disease 90% of the time when the disease is present and the\r\npresence of the disease 2% of the time when the disease is not present.\r\nIf 0.5% of the population has the disease, calculate the conditional\r\nprobability that a person selected at random has the disease if the test\r\nindicates the presence of the disease.\"}],\"ideal\":\"18%\"}\r\n{\"input\":[{\"role\":\"system\",\"content\":\"REPLY WITH THE FINAL NUMBER IN\r\nPERCENT ROUNDED TO THE NEAREST WHOLE NUMBER. DO NOT\r\nEXPLAIN.\"},{\"role\":\"user\",\"content\":\"Two processes of a company produce\r\nrolls of materials: The rolls of Process I are 3% defective and the\r\nrolls of Process II are 1% defective. Process I produces 60% of the\r\ncompany’s output, Process II 40%. A roll is selected at random from the\r\ntotal output. Given that this roll is defective, what is the conditional\r\nprobability that it is from Process I?\"}],\"ideal\":\"82%\"}\r\n```\r\n
","shortMessageHtmlLink":"probabilities-word-problems (openai#941)"}},{"before":"88f2d30c10cf439ab1fa5570d0387af21eb38759","after":"36c2c742650a1c7cade757255c8a496af6dd18d5","ref":"refs/heads/main","pushedAt":"2023-06-02T23:00:54.952Z","pushType":"push","commitsCount":190,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[unit test] Adding unit test for metrics.get_accuracy (#224)\n\nAdding a unit test to get the ball rolling, starting with metrics since\r\nthey are fundamental to evaluating performance. :) It would be great to\r\nadd some more tests when building out more, and also enable CI (e.g.,\r\nvia GitHub actions).\r\n\r\nThis also fixes an unused param to `get_bootstrap_accuracy_std`.","shortMessageHtmlLink":"[unit test] Adding unit test for metrics.get_accuracy (openai#224)"}},{"before":"88f2d30c10cf439ab1fa5570d0387af21eb38759","after":"36c2c742650a1c7cade757255c8a496af6dd18d5","ref":"refs/heads/main","pushedAt":"2023-06-02T23:00:54.889Z","pushType":"push","commitsCount":190,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[unit test] Adding unit test for metrics.get_accuracy (#224)\n\nAdding a unit test to get the ball rolling, starting with metrics since\r\nthey are fundamental to evaluating performance. :) It would be great to\r\nadd some more tests when building out more, and also enable CI (e.g.,\r\nvia GitHub actions).\r\n\r\nThis also fixes an unused param to `get_bootstrap_accuracy_std`.","shortMessageHtmlLink":"[unit test] Adding unit test for metrics.get_accuracy (openai#224)"}},{"before":"eaee0ec09e6e78c1a2ccab76f5c942c7dba224a0","after":"88f2d30c10cf439ab1fa5570d0387af21eb38759","ref":"refs/heads/main","pushedAt":"2023-05-15T21:09:24.583Z","pushType":"push","commitsCount":9,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"windows event viewer categorization eval (#169)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. We encourage partial PR's with\r\n~5-10 example that we can then run the evals on and share the results\r\nwith you so you know how your eval does with GPT-4 before writing all\r\n100 examples.\r\n\r\n## Eval details 📑\r\n### Eval name\r\nevent-categories\r\n\r\n### Eval description\r\n\r\nAscertain the criticality level of an event within the Windows Event\r\nViewer by analyzing the event message.\r\n\r\n### What makes this a useful eval?\r\n\r\nEstablishing the criticality level of an event prior to deployment is\r\nessential for Windows developers and system administrators in a\r\ncorporate setting.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] Include at least 100 high quality examples (it is okay to only\r\ncontribute 5-10 meaningful examples and have us test them with GPT-4\r\nbefore adding all 100)\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\nIn conjunction with Windows event logs, enabling GPT-4 to interpret\r\nevents within log files across an enterprise environment will accelerate\r\nthe mean time to resolution for engineering teams.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\nDistinct events were extracted from an Azure Windows virtual machine,\r\nreflecting authentic real-world situations.\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.jsonl`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [x] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n
\r\n View evals in JSON\r\n\r\n ### Eval\r\n ```jsonl\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\": \"The\r\naccess history in hive\r\n\\\\??\\\\C:\\\\Users\\\\user\\\\AppData\\\\Local\\\\Microsoft\\\\Windows\\\\UsrClass.dat\r\nwas cleared updating 2757 keys and creating 459 modified pages.\" }],\r\n\"ideal\": \"Information\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\": \"The\r\naccess history in hive \\\\??\\\\C:\\\\Users\\\\user\\\\ntuser.dat was cleared\r\nupdating 5722 keys and creating 793 modified pages.\" }], \"ideal\":\r\n\"Information\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\": \"The\r\naccess history in hive\r\n\\\\??\\\\C:\\\\Windows\\\\System32\\\\SMI\\\\Store\\\\Machine\\\\SCHEMA.DAT was cleared\r\nupdating 15939 keys and creating 2709 modified pages.\" }], \"ideal\":\r\n\"Information\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\": \"The\r\nRdAgent service terminated unexpectedly. It has done this 1 time(s). The\r\nfollowing corrective action will be taken in 0 milliseconds: Restart the\r\nservice.\" }], \"ideal\": \"Error\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\":\r\n\"Installation Successful: Windows successfully installed the following\r\nupdate: Security Intelligence Update for Microsoft Defender Antivirus -\r\nKB2267602 (Version 1.383.1751.0)\" }], \"ideal\": \"Information\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\":\r\n\"Installation Started: Windows has started installing the following\r\nupdate: Security Intelligence Update for Microsoft Defender Antivirus -\r\nKB2267602 (Version 1.383.1751.0)\" }], \"ideal\": \"Information\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\": \"The\r\naccess history in hive\r\n\\\\??\\\\C:\\\\Windows\\\\ServiceProfiles\\\\NetworkService\\\\AppData\\\\Local\\\\Microsoft\\\\Windows\\\\DeliveryOptimization\\\\State\\\\dosvcState.dat\r\nwas cleared updating 9 keys and creating 2 modified pages.\" }], \"ideal\":\r\n\"Information\" }\r\n{ \"input\": [{ \"role\": \"system\", \"content\": \"You are a Systems\r\nAdministrator for a large company. Answer questions accurately and label\r\neach question as Information, Warning, or Error. You will be graded on\r\nyour ability to answer questions\" }, { \"role\": \"user\", \"content\": \"While\r\ncanceling job 'PreSignInSettingsConfigJSON', BITS was unable to remove\r\nsome temporary files. To recover disk space, delete the files listed\r\nbelow. The job ID was {750fa50e-df1d-434f-b01b-baf677b3a558}.\r\nC:\\\\Users\\\\ADMINI~1\\\\AppData\\\\Local\\\\Temp\\\\BITDF79.tmp \" }], \"ideal\":\r\n\"Warning\" }\r\n ```\r\n
\r\n\r\n---------\r\n\r\nCo-authored-by: JavierSantanaNYC <28825631+JavierSantanaNYC@users.noreply.github.com>","shortMessageHtmlLink":"windows event viewer categorization eval (openai#169)"}},{"before":"00414586a26e13993e8828c82f067a0564bd612e","after":"eaee0ec09e6e78c1a2ccab76f5c942c7dba224a0","ref":"refs/heads/main","pushedAt":"2023-05-07T00:39:31.338Z","pushType":"push","commitsCount":1,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[evals] added pause/unpause to record.py (#898)\n\n- this allows more flexibly controlling when PromptFn etc records logs\r\nor not\r\n\r\n```\r\nwith recorder.as_default_recorder():\r\n evals.record.pause() \r\n # ... any record_event() is skipped\r\n evals.record.unpause()\r\n```","shortMessageHtmlLink":"[evals] added pause/unpause to record.py (openai#898)"}},{"before":"4a56d8f4f4e5500a5b9aa2e813a0fc669a5b21bd","after":"00414586a26e13993e8828c82f067a0564bd612e","ref":"refs/heads/main","pushedAt":"2023-05-03T21:35:46.388Z","pushType":"push","commitsCount":5,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[evals] fixes a few bugs in modelgraded (#891)\n\n- fixed classiy() important bug: mg.prompt -> prompt\r\n- fixed PromptFn's input kwargs rendering to use \"for_completion=False\"\r\n(for modelgraded eval)\r\n- added record_event() to record.py\r\n- changed classify() return info format to be more consistent with\r\nrecord_sampling() format","shortMessageHtmlLink":"[evals] fixes a few bugs in modelgraded (openai#891)"}},{"before":"4ee127c3595498ba860b7cd1438ce3e1b97c3afa","after":"4a56d8f4f4e5500a5b9aa2e813a0fc669a5b21bd","ref":"refs/heads/main","pushedAt":"2023-04-26T21:41:10.443Z","pushType":"push","commitsCount":2,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[evals] fixed unittest error (#828)","shortMessageHtmlLink":"[evals] fixed unittest error (openai#828)"}},{"before":"24dae81ae06ebc70808690c7a147f2710e3e54bf","after":"4ee127c3595498ba860b7cd1438ce3e1b97c3afa","ref":"refs/heads/main","pushedAt":"2023-04-25T00:50:56.981Z","pushType":"push","commitsCount":4,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[evals] simplify and extend modelgraded evals (#804)\n\n- removed expand_args option in modelgraded_spec yaml, and only accept\r\nmodelgraded_spec_args in eval yaml. only allow one args set per eval.\r\nThis is to simplify codes and improve clarity.\r\n- refactored modelgraded eval codes to be more modular\r\n- removed eval_completion_fn arg in ModelGradedClassify. make it\r\nimplicit in completion_fns.","shortMessageHtmlLink":"[evals] simplify and extend modelgraded evals (openai#804)"}},{"before":"038f5f8c82857abb96a69d406831b492fafcc677","after":"24dae81ae06ebc70808690c7a147f2710e3e54bf","ref":"refs/heads/main","pushedAt":"2023-04-23T00:19:50.752Z","pushType":"push","commitsCount":7,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"Compare countries by area (#623)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. We encourage partial PR's with\r\n~5-10 example that we can then run the evals on and share the results\r\nwith you so you know how your eval does with GPT-4 before writing all\r\n100 examples.\r\n\r\n## Eval details 📑\r\n### Eval name\r\nCompare countries by area\r\n\r\n### Eval description\r\n\r\nTest the model's ability to determine which country has the largest area\r\n\r\n### What makes this a useful eval?\r\n\r\nThe model should be able to factually determine which country has the\r\nlargest area based on accurate facts.\r\nIn this eval we use The World\r\nFactbook(https://www.cia.gov/the-world-factbook/field/area/country-comparison)\r\nthat is prepared by the CIA for the use of U.S. government officials,\r\nand four countries with similar sizes are compared to each other.\r\nSpecifically, four countries adjacent to each other in area ranking are\r\npresented as one option, and the dataset Includes data for countries\r\nranked 1\\~4, 2\\~5, ... 100\\~103. However, we excluded countries where\r\nsources and interpretations could cause fluctuations in rankings (e.g.,\r\nChina and Pakistan). This data set achieved less than 40% accuracy for\r\nboth gpt-4 and gpt-3.5-turbo, and the results tend to be worse for\r\ncomparisons between countries with smaller areas.\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] Include at least 100 high quality examples (it is okay to only\r\ncontribute 5-10 meaningful examples and have us test them with GPT-4\r\nbefore adding all 100)\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [x] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n
\r\n View evals in JSON\r\n\r\n ### Eval\r\n ```jsonl\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. Russia, Canada, United\r\nStates, Brazil\"}], \"ideal\": \"Russia\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. Canada, United States,\r\nBrazil, Australia\"}], \"ideal\": \"Canada\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. United States, Brazil,\r\nAustralia, India\"}], \"ideal\": \"United States\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. Brazil, Australia, India,\r\nArgentina\"}], \"ideal\": \"Brazil\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. Australia, India, Argentina,\r\nKazakhstan\"}], \"ideal\": \"Australia\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. India, Argentina, Kazakhstan,\r\nAlgeria\"}], \"ideal\": \"India\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. Argentina, Kazakhstan,\r\nAlgeria, Democratic Republic of the Congo\"}], \"ideal\": \"Argentina\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"You are a helpful\r\nassistant.\"}, {\"role\": \"user\", \"content\": \"You are presented with\r\nseveral countries. Answer the name of the country with the largest area\r\namong the given countries. Do not explain. Kazakhstan, Algeria,\r\nDemocratic Republic of the Congo, Saudi Arabia\"}], \"ideal\":\r\n\"Kazakhstan\"}\r\n ```\r\n
\r\n\r\n---------\r\n\r\nCo-authored-by: 乾陽平 ","shortMessageHtmlLink":"Compare countries by area (openai#623)"}},{"before":"db79fbb6783060d57db6cbe63bbbb13f8c6a2f87","after":"038f5f8c82857abb96a69d406831b492fafcc677","ref":"refs/heads/main","pushedAt":"2023-04-22T11:24:19.968Z","pushType":"push","commitsCount":27,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"added useage of Optional lib since it's already imported. (#150)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. We encourage partial PR's with\r\n~5-10 example that we can then run the evals on and share the results\r\nwith you so you know how your eval does with GPT-4 before writing all\r\n100 examples.\r\n\r\n## Eval details 📑\r\n### Eval name\r\n[Insert Eval name here]\r\n\r\n### Eval description\r\n\r\n[Insert a short description of what your eval does here]\r\n\r\n### What makes this a useful eval?\r\n\r\n[Insert why this eval is worth including and any additional context]\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [ ] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [ ] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [ ] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [ ] Include at least 100 high quality examples (it is okay to only\r\ncontribute 5-10 meaningful examples and have us test them with GPT-4\r\nbefore adding all 100)\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.jsonl`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [x] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n
\r\n View evals in JSON\r\n\r\n ### Eval\r\n ```jsonl\r\n INSERT_EVAL_HERE\r\n ```\r\n
","shortMessageHtmlLink":"added useage of Optional lib since it's already imported. (openai#150)"}},{"before":"a6fe832e7456ce82337f3f355b859c317a94c280","after":"db79fbb6783060d57db6cbe63bbbb13f8c6a2f87","ref":"refs/heads/main","pushedAt":"2023-04-17T22:44:29.351Z","pushType":"push","commitsCount":2,"pusher":{"login":"yuiseki","name":"yuiseki","path":"/yuiseki","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/25507?s=80&v=4"},"commit":{"message":"[Evals] Add choice of completion fn to args for modelgraded evals (#709)\n\nWe want to support setting the completion function used as the evaluator\r\nin modelgraded evals.","shortMessageHtmlLink":"[Evals] Add choice of completion fn to args for modelgraded evals (op…"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAECCyvsAA","startCursor":null,"endCursor":null}},"title":"Activity · yuiseki/evals"}