{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":618988081,"defaultBranch":"main","name":"evals","ownerLogin":"rahul-nath","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-03-25T23:27:19.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/5932305?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1679786846.643695","currentOid":""},"activityList":{"items":[{"before":"b928cd443f96e818781d9dce67b2ecfe83d13bf1","after":"f7ebbe8ae9cd1a94e061bdbd116127f130f4ed4c","ref":"refs/heads/main","pushedAt":"2023-04-10T17:49:56.214Z","pushType":"push","commitsCount":1,"pusher":{"login":"rahul-nath","name":"Rahul Nath","path":"/rahul-nath","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5932305?s=80&v=4"},"commit":{"message":"[evals] added format() to ModelGradedSpec (#597)\n\n- 'in_message' and 'out_message' formatting for modelgraded evals\r\n- factored out append_answer_prompt function","shortMessageHtmlLink":"[evals] added format() to ModelGradedSpec (openai#597)"}},{"before":"2486f7e3a3247ccc388c37f7459736af13a2dca1","after":"b928cd443f96e818781d9dce67b2ecfe83d13bf1","ref":"refs/heads/main","pushedAt":"2023-04-05T16:56:32.790Z","pushType":"push","commitsCount":2,"pusher":{"login":"rahul-nath","name":"Rahul Nath","path":"/rahul-nath","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5932305?s=80&v=4"},"commit":{"message":"Test generation hallucination in Russian (50% accuracy) (#157)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, __failure to follow\r\nthe guidelines below will result in the PR being closed automatically__.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access granted. 🚨\r\n\r\n__PLEASE READ THIS__:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject since GPT-4 is already capable of completing the\r\ntask.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. We encourage partial PR's with\r\n~5-10 example that we can then run the evals on and share the results\r\nwith you so you know how your eval does with GPT-4 before writing all\r\n100 examples.\r\n\r\n## Eval details 📑\r\n### Eval name\r\nRucola\r\n\r\n### Eval description\r\n\r\nSlice of Rucola dataset.\r\nClassify sentences in two categories: acceptable vs generation\r\nhallucination\r\n\r\n### What makes this a useful eval?\r\n\r\n- Most eval tasks are in Engish, useful to have some multilingual tasks\r\n- Useful to explicitly check for hallucinations\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [x] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [x] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [x] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [x] Include at least 100 high quality examples (it is okay to only\r\ncontribute 5-10 meaningful examples and have us test them with GPT-4\r\nbefore adding all 100)\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n- [x] Check that your data is in `evals/registry/data/{name}`\r\n- [x] Check that your yaml is registered at\r\n`evals/registry/evals/{name}.jsonl`\r\n- [x] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (https://platform.openai.com/docs/usage-policies).\r\n\r\n- [x] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the merged pull request.\r\n\r\n- [x] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgement\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and high volume of submissions, we will not\r\nbe able to accept all submissions and thus not grant everyone who opens\r\na PR GPT-4 access. We know this is disappointing, but we hope to set the\r\nright expectation before you open this PR.\r\n\r\n- [x] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [x] I have filled out all required fields in the evals PR form\r\n- [x] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `black`, `isort`,\r\nand `autoflake` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data \r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n
\r\n View evals in JSON\r\n\r\n ### Eval\r\n ```jsonl\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"К\r\nсожалению, я кричал на вас.\"}], \"ideal\": \"Acceptable\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Только в\r\n1000 метрах от порта, в 2 км от центра города и в 20 минутах от\r\nаэропорта Манисеса и Конвенционного центра.\"}], \"ideal\":\r\n\"Hallucination\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Не\r\nспособный пережить смерть матери Норман вырвался из тела женщины,\r\nмумифицируя труп, нередко обращаясь на останки, как на живого\r\nчеловека.\"}], \"ideal\": \"Hallucination\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Почему вы\r\nпродали свой дом?\"}], \"ideal\": \"Acceptable\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Зиму\r\nпришлось выдавать себя за ребенка с редким кожным заболеванием, в том\r\nчисле у девочки.\"}], \"ideal\": \"Hallucination\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Победитель\r\nчемпионата Румынии с ФК \\\"Униря\\\" стал победителем чемпионата\r\n2008-09.\"}], \"ideal\": \"Hallucination\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Программное\r\nобеспечение включает в себя зависимость в предварительно интегрированном\r\nсамоконтейнерном устройстве, так что пользователям больше не придется\r\nбеспокоиться о решении зависимости программного обеспечения.\"}],\r\n\"ideal\": \"Hallucination\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Защита\r\nконфиденциальности От произвольного и незаконного вмешательства в\r\nконфиденциальность защищена Четвёртой поправкой в Конституцию и\r\nфедеральными законами.\"}], \"ideal\": \"Hallucination\"}\r\n{\"input\": [{\"role\": \"system\", \"content\": \"Is it an acceptable sentence\r\nor result of language model hallucination?. Keep it short, respond\r\nAcceptable or Hallucination.\"}, {\"role\": \"user\", \"content\": \"Сообщение с\r\nним просто пешком.\"}], \"ideal\": \"Hallucination\"}\r\n ```\r\n
","shortMessageHtmlLink":"Test generation hallucination in Russian (50% accuracy) (openai#157)"}},{"before":"882a2af06748c7ce037648cf1a3538e98bdc5c93","after":"2486f7e3a3247ccc388c37f7459736af13a2dca1","ref":"refs/heads/main","pushedAt":"2023-03-31T12:47:23.084Z","pushType":"push","commitsCount":34,"pusher":{"login":"rahul-nath","name":"Rahul Nath","path":"/rahul-nath","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/5932305?s=80&v=4"},"commit":{"message":"[evals] added eval_model flag to modelgraded eval (#519)","shortMessageHtmlLink":"[evals] added eval_model flag to modelgraded eval (openai#519)"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAADFa1g-gA","startCursor":null,"endCursor":null}},"title":"Activity · rahul-nath/evals"}