Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples. ## Eval details 📑 ### Eval name Illinois Law Cases ### Eval description This eval tests the models ability to correctly identify the case conclusion for Trespassing, Battery, Assault, and Self-Defense. The dataset is consisted of 88 Illinois Historical cases representing 112 legal claims. Some cases have multiple claims, each coded as a different test case. ### What makes this a useful eval? This assesses the model's ability to correctly categorize these historical cases. Correctly identifying these results shows the models capability for a strong understanding of law. The GPT-3.5-turbo models currently receives an accuracy of 0.45. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100) If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) This work includes data from the Illinois Intentional Tort Qualitative Dataset, which was compiled by the Qualitative Reasoning Group at Northwestern University. The dataset is freely available under the Creative Commons Attribution 4.0 license from https://www.qrg.northwestern.edu/Resources/caselawcorpus.html. ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "You will be presented with a court claim and the criminal charge. Your role is to assess the case and return either \"Positive\" or \"Negative\" corresponding to the case conclusion for the criminal charge. Do not explain."}, {"role": "system", "content": "In 2007, the Cocrofts obtained a loan for $386,750 from Countrywide Bank, FSB secured by a mortgage on the home they already owned in Country Club Hills, Illinois. The loan closed on April 17, 2007. At the time of the closing, Countrywide improperly failed to inform [the Cocrofts] of the real source of funding for their loan. Plaintiffs also contend that Countrywide violated TILA by failing to inform them that they had three days to rescind the loan and by failing to disclose the total sale price and itemize the amount financed, as well as by failing to make unspecified prepayment disclosures. The Cocrofts claim that Countrywide understated the total finance charges for the loan by more than $5,000. Plaintiffs claim that they learned of Countrywide's misrepresentations in June 2009. They decided to exercise their right under the provisions of TILA to rescind the loan. On July 1, 2009, they mailed notice to that effect to BA, the successor to Countrywide, and to MERS. The Cocrofts do not say what if anything happened as a result of those notices, but on September 29, lawyers working for HSBC contacted them and stated that HSBC was ready to begin foreclosure. HSBC claimed that it was the trustee of a trust that included their loan. The Cocrofts, however, contend that the transfer of their loan into the trust was defective. They sent HSBC's lawyers two cease and desist letters, notifying HSBC that they had rescinded the loan. They allege that after receiving one of the cease and desist letters, HSBC informed them that it had no interest in the loan and that they needed to contact the loan's servicer, Roundpoint Mortgage. Plaintiffs also sent a copy of the rescission documents to BAC, which they identify as the actual servicer of the loan. HSBC brought a foreclosure action in Illinois state court on January 19, 2010. [From below:] defendants unlawfully entered [the plaintiffs'] home by conducting a self-help eviction of the plaintiffs and changing the locks on their home in August 2008. At the time, [plaintiffs] had made arrangements to rent the property in the short term and then to sell it, and defendants' actions disrupted the sale."}, {"role": "user", "content": "Trespass"}], "ideal": "Positive"} {"input": [{"role": "system", "content": "You will be presented with a court claim and the criminal charge. Your role is to assess the case and return either \"Positive\" or \"Negative\" corresponding to the case conclusion for the criminal charge. Do not explain."}, {"role": "system", "content": "Defendants, on January 15, 1915, with force and arms broke and entered the close of the plaintiff, to-wit, the southeast quarter of the northeast quarter of section 16, township 4, south, range 3, west, in Pike county, Illinois, and cut down and destroyed 500 hedge trees and a certain fence belonging to plaintiff situated on said land. Defendants cut down the south half of a hedge fence which for many years prior to February, 1915, stood upon the line between the southeast quarter of the northeast quarter of said section 16 (hereinafter referred to as the east forty) and the southwest quarter of the northeast quarter of said section 16 (hereinafter referred to as the west forty). On and prior to February 13, 1866, both of these forty-acre tracts belonged to a man named Teadrow. On February 13, 1866, Teadrow conveyed the west forty to Benjamin Newman, and on February 15, 1866, conveyed the east forty to Oliver P. Rice. When these conveyances were made there was a hedge fence on the north half of the line and a wooden fence on the south half of the line between the two tracts. In 1868 Benjamin Newman, the owner of the west forty, removed the wooden fence and set out a hedge fence on the south half of the line between the two tracts. Thereafter, during the separate ownership of the tracts, Banjamin Newman trimmed and otherwise cared for the hedge fence on the south half of the line and Rice trimmed and looked after the hedge fence on the north half of the line. In December, 1888, Rice conveyed the southeast quarter of the northeast quarter of said section 16 to Benjamin Newman, the latter thereby becoming the owner of both tracts. Thereafter, during the ownership of both tracts by Benjamin Newman, he required the tenants of the west forty to take care of the south half and the tenants of the east forty to take care of the north half of the hedge fence on the line between the two tracts. On June 22, 1904, Benjamin Newman executed and delivered to his daughter, F. Eva Newman, the plaintiff, who has since married J. O. Conklin, a warranty deed, conveying to her two hundred acres of land, including the southeast quarter of the northeast quarter of said section 16, referred to herein as the east forty, but not including the tract referred to herein as the west forty. This deed contained the provision that 'this deed is not to take effect until after the death of the grantor, Benjamin Newman.' The wife of Benjamin Newman, who is still living, did not join in the conveyance. At the same time plaintiff executed and delivered to her father the following written instrument signed by her: 'Whereas Benjamin Newman has this day conveyed to me certain tracts and parcels of land in Pike county, Illinois, to take effect after his death, I hereby agree to pay the taxes on said land in lieu of all rents that I would otherwise have to pay, (this does not affect any rent that is now due,) and in consideration of my paying said taxes I am to receive all the rents, profits, etc., that may accrue on said land.' When the conveyance was made to plaintiff the tract in controversy known as the east forty was in the possession of Joseph Gifford as tenant and the west forth was in the possession of John B. Newman, a grandson of Benjamin Newman, as tenant of Benjamin Newman. When [the plaitiff's] father delivered the deed of June 22, 1904, he took her upon the east forty and told her and Gifford, the tenant, that he was placing her in full possession of the tract; that she was to receive all the rents and profits from the land and was to keep up the repairs and pay the taxes; that she was to have the south half of the fence on the line between the east forth and the west forty and was to keep up that part of the fence, and that George Newman, his son, to whom he then intended to deed the west forty, should keep up the north half of the fence, and that thereafter plaintiff and her tenants kept the south half of the fence in repair while the tenants in possession of the west forty made repairs to the north half of the fence. During the month of January, 1915, a controversy arose between plaintiff and Defendants regarding the ownership of the hedge fence, each party claiming the south half of the fence. During the month of February, 1915, Defendants, over the protest of plaintiff, cut down the south half of the hedge fence on the line between the east forty and the west forty and erected a wire fence in the place thereof."}, {"role": "user", "content": "Trespass"}], "ideal": "Positive"} {"input": [{"role": "system", "content": "You will be presented with a court claim and the criminal charge. Your role is to assess the case and return either \"Positive\" or \"Negative\" corresponding to the case conclusion for the criminal charge. Do not explain."}, {"role": "system", "content": "The city of O'Fallon installed a sewer system in about 1926. In 1961, due to backups into homes serviced by the system, the city built an overflow outlet on East Madison Street. The overflow was to relieve pressure in the sewer system during periods of heavy rainfall; it proved successful in preventing backups into nearby homes. However, when water escaped through the overflow, raw sewage was also discharged into an open ditch that flowed into a neighboring pond. In December 1974, the city of O'Fallon closed the overflow. On January 10, 1975, and on subsequent dates, sewage backed up into the plaintiff's house following heavy rainfall. The January 10 backup was the worst, causing water to accumulate in the plaintiff's finished basement to a height of 25 1/2 inches. The lower level of the plaintiff's modern, ranch-style home contained a family room with fireplace and built-in bookshelves, bedroom, closet, bath and utility room with washer, dryer, furnace and water heater. The walls were watermarked, and the tile floor was damaged, as were the furnishings, appliances and many irreplaceable family items such as family photographs and slides. The lower level of the house was virtually unusable for a year, and the plaintiff had to expend considerable time, effort and money in repairing the floor, repainting the walls, and replacing and removing damaged personalty. The city knew the blocking of the overflow would cause some backup, although they were not aware that it would be as severe as it was. From January until April or May 1975, when the city reopened the overflow, the city attempted to alleviate the pressure in the sewer system by pumping the water from the sewers into open ditches during periods of heavy rain. The defendant used either large or small pumps, depending upon the amount of water in the system. The backups into Mrs. Dial's home ended after the overflow was reopened in April or May 1975."}, {"role": "user", "content": "Trespass"}], "ideal": "Positive"} {"input": [{"role": "system", "content": "You will be presented with a court claim and the criminal charge. Your role is to assess the case and return either \"Positive\" or \"Negative\" corresponding to the case conclusion for the criminal charge. Do not explain."}, {"role": "system", "content": "the plaintiff, his wife, Beatrice, and daughters, Aurora and Angela, lived at 313 East Marquette Street in Ottawa. The lot upon which their home was located was eighty-eight feet wide and one hundred thirty feet deep. The home of the defendant was immediately east and adjoining the Galvan lot, and their residences were about forty feet apart and separated by a hedge fence. According to the testimony of the plaintiff, he, the plaintiff, arrived at his home about five o'clock on Friday afternoon, June 19, 1953, from his work as a bricklayer's helper. After he had had his evening meal, he left home about seven o'clock, paid a coal bill to a Mr. Burke, and then he and Burke went to a tavern where they remained an hour and a half, during which time the plaintiff drank two bottles of beer. Mr. Burke went home, and the plaintiff proceeded to another tavern and remained there until after midnight. At the second tavern he had four or five bottles of beer. He than proceeded to another tavern, where he remained for fifteen minutes, and had a glass of beer there. He then proceeded homeward, entering his lot at the rear, and singing as he went along. Sitting upon the steps of the back porch of his home were his wife and daughter, Angela, and when the plaintiff arrived there he stopped singing. He refused his wife's suggestion to go into the house and go to bed but sat down on the porch step, took his shoes, socks, and hat off, cursed the mosquitoes, laid down on the ground under a pear tree three or four feet from the southeast corner of the steps of the rear porch and immediately went to sleep. The plaintiff's wife and daughter, Angela, remained on the porch steps after the plaintiff had laid down under the pear tree. About fifteen minutes after he had gone to sleep, the daughter observed the defendant coming very slowly through the hedge from his property onto the Galvan premises. He had a knife in his hand and, without a word, proceeded to cut the prostrate body of the plaintiff. The other daughter of the plaintiff, Aurora, was in the house asleep but was awakened by her sister and ran to the yard and saw the defendant 'slashing' at her father with a knife. She called to the defendant to stop and ran for help. Police officers arrived shortly thereafter, and they testified that they found the plaintiff lying on the ground about six feet from the porch of his home all covered with blood and with his hat and a pair of shoes and socks lying next to his body. The blood was all in one place and in the form of a pool near the pear tree. An ambulance was called, and the plaintiff was removed to the Ryburn-King Hospital."}, {"role": "user", "content": "Battery"}], "ideal": "Positive"} {"input": [{"role": "system", "content": "You will be presented with a court claim and the criminal charge. Your role is to assess the case and return either \"Positive\" or \"Negative\" corresponding to the case conclusion for the criminal charge. Do not explain."}, {"role": "system", "content": "Since September 2000, plaintiff regularly visited a patient at the Illinois Department of Human Services Treatment and Detention Facility ('Facility') in Jolict, Illinois. From May 4, 2005 to May 11, 2005, plaintiff was subjected to patdown searches by defendant Martin, a Security Therapist Aid II at the Facility, in which defendant Martin placed her fingers in plaintiff's vaginal area and required plaintiff to remove her shoes prior to being allowed to visit the patient. Such searches occurred at least four times during the aforementioned time period. After plaintiff's complaints to Bernard Akpan, an Exec. 11 at the Facility, and defendant Strock, the Assistant Security Director of the Facility, and facility patient Brad Lieberman's complaints to defendant Budz, Director of the Facility, defendant Sanders, Security Director of the Facility, and defendant Strock, plaintiff was no longer required to submit to patdown searches by defendant Martin. Rather, plaintiff's visits were preceded by a Rapiscan scan of her person. According to plaintiff's complaint, a Rapiscan machine is an electronic screening device used to scan a person's entire body. 'These machines produce a naked image of the person and can also produce evidence of highly sensitive details such as the following: mastectomies, colostomy appliances, penile implants, catheter tubes, and the size of a person's breasts and genitals' From May 17, 2005 to June 19, 2005, plaintiff was subjected to 20 to 25 Rapiscan scans. Plaintiff's complaint further alleges that other Facility staff members were allowed to view her scanned image, her scanned image was not erased from the machine, and staff members viewed her image hours after she was scanned, all without her consent. Additionally, while later told that she should have had the choice between the Rapiscan scan or a physical patdown prior to visiting a patient, plaintiff was never informed of such a choice during the two months she underwent the Rapiscan scans."}, {"role": "user", "content": "Assault"}], "ideal": "Positive"} ``` </details>
- Loading branch information