NAACL 2019 Repo

wangah · Apr 4, 2019 · 2924dd4 · 2924dd4
commit 2924dd4
Show file tree

Hide file tree

Showing 3,606 changed files with 2,661,699 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,14 @@
+*.ini
+*.pth
+*.swp
+*.pyc
+__pycache__/
+*.iml
+.idea
+/venv/
+# excluded because of size
+embeddings/
+*.log
+logs
+..gitignore.un~
+.gitignore~
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2019 Eric Lehman, Jay DeYoung, Regina Barzilay, Byron C. Wallace
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,74 @@
+# evidence-inference
+
+Data and code from our "Inferring Which Medical Treatments Work from Reports of Clinical Trials", NAACL 2019. This work concerns inferring the results reported in clinical trials from text. 
+
+The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches. For the sake of this task, we assume that a particular article will report that the intervention of interest either significantly increased, significantly decreased or had significant effect on the outcome, relative to the comparator.
+
+The dataset could be used for automatic data extraction of the results of a given RCT. This would enable readers to discover the effectiveness of different treatments without needing to read the paper.
+
+### Citation
+
+Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C. Wallace. Inferring Which Medical Treatments Work from Reports of Clinical Trials. In NAACL (2019).
+
+When citing this project, please use the following bibtex citation:
+
+@inproceedings{TBD,
+ title = {{Inferring Which Medical Treatments Work from Reports of Clinical Trials}},
+ author = {Lehman, Eric and DeYoung, Jay and Barzilay, Regina and Wallace, Byron C.},
+ booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL)},
+ year = {2019}
+}
+
+
+## Randomized Control Trials (RCTs), Prompts, and Answers
+There are three main types of data for this project, each of which will be described in depth in the following sections.
+
+### RCTs
+In this project, we use texts from RCTs, or randomized control trials. These RCTs are articles that directly compare two different treatments. For example, a given article might want to determine the effectiveness of ibuprofen in counteracting headaches in comparison to other treatments, such as tylenol. These papers often tend to compare multiple treatments (i.e. ibuprofen, tylenol), and the effects with respect to various outcomes (i.e. headaches, pain). 
+
+### Prompts
+A prompt will be of the given form: "With respect to *outcome*, characterize the reported difference between patients receiving *intervention* and those receiving *comparator*." The prompt has the 3 fill-in-the-blanks, each of which lines up nicely with the RCT. For instance, if we use the example described in the RCTs section, we get: 
+ * **Outcome** = 'number of headaches'
+ * **Intervention** = 'ibuprofen'
+ * **Comparator** = 'tylenol'
+ * "With respect to *number of headaches*, characterize the reported difference between patients receiving *ibuprofen* and those receiving *tylenol*"
+
+A given article might have 10+ of these comparisons within. For example, if the RCT article also compared *ibuprofen* and *tylenol* with respect to *side effects*, this could also be used a prompt. 
+
+### Answers
+Given a prompt, we must characterize how the relationship of two different treatments with respect to an outcome. Let us use the prompt described previously: 
+ * "With respect to *number of headaches*, characterize the reported difference between patients receiving *ibuprofen* and those receiving *tylenol*"
+
+There are three answers we could give: 'significantly increased', 'significantly decreased', 'no significant difference.' Take, for example, three sentences that *could* appear in an article, that would each result in a different outcome.
+ 1. **Significantly increased**: "Ibprofen relieved 60 headaches, while tylenol relieved 120; therefore ibuprofen is worse than tylenol for reducing the number of headaches (p < 0.05)."
+ * This can be seen as an answer of significantly increased, since ibuprofen technically *increases* the chance of having a headache if you use it instead of tylenol. We can see this because more people benefited from the use of tylenol in comparison to ibuprofen.
+ 2. **Significantly decreased**: "Ibuprofen reduced the 2-times the number of headaches than tylenol, and therefore reduced a greater number headaches (p < 0.05)."
+ * This is an answer of significantly decreased since ibuprofen **decreased** the number of headaches in comparison to tylenol.
+ 3. **No significant difference**: "Ibuprofen relieved more headaches than tylenol, but the difference was not statistically significant"
+ * We only care about statistical significance here. In this case it is clear that there is no statistical difference between the two, warranting an answer of no significant difference.
+
+As an answer, we would submit two things: 
+ 1. The answer (significantly inc./dec./no-diff.).
+ 2. A quote from the text that supports our answer (one of the sentences described above).
+
+
+## Process Description
+Gathering the data is contained in 3 main processes: prompt generation, annotation, and verification. We hire M.D.s from Upwork to work on only one of the processes. We use flask and AWS to host servers for the M.D.s to work on.
+
+### Prompt Generation
+A prompt generator is hired to look at a set of articles taken from PUBMED central open access. Each of these articles are RCT that are comparing multiple treatments with respect to various outcomes. These prompt generators look to find triplets of outcomes-interventions-comparators that fill in the following sentence: "With respect to outcome, characterize the reported difference between patients receiving intervention and those receiving comparator." In order to find these prompts, prompt generators will generally find sentences describing the actual result of the prompt they find. Thus, we ask prompt generators to not only select an answer to the prompt (how the relationship of the intervention and outcome w/ respect to comparator is defined), but also the reasoning of how they achieved their answer. The answer will be one of 'significantly increased', 'significantly decreased', or 'no significant difference', while the reasoning will be a direct quote from the text. For each article, the prompt generator attempts to find a max of five unique prompts. 
+
+The prompt generator instructions can be found here: http:https://www.ccs.neu.edu/home/lehmer16/prompt-gen-instruction/templates/instructions.html
+
+### Annotator
+An annotator is given the article, and a prompt. The answer will be one of 'significantly increased', 'significantly decreased', or 'no significant difference', while the reasoning will be a direct quote from the text. The annotator only has access to the prompt and the article, and therefore must search the article for the evidence and the answer. Also, if the prompt is incoherent or simply invalid, then it can be marked as such. The annotators will also attemmpt to look for the answer in the abstract. If it is not available, then the annotators may look at the remaining sections of the article to find the answer.
+
+The annotator instructions can be found here: http:https://www.ccs.neu.edu/home/lehmer16/annotation-instruction-written/templates/instructions.html
+
+### Verifier
+The verifier is given the prompt, the article, the reasoning and answer of the annotator, and the reasoning and answer of the prompt generator. However, both pairs of reasoning and answers are presented as if they were both annotators. This is to ensure that the verifier does not potentially side with the prompt generator over the intuition that their answer would be more accurate. The verifier determines if all answers, and reasonings are valid. Similarly, the verifier will determine if the prompt is valid.
+
+The verifier instructions can be found here : http:https://www.ccs.neu.edu/home/lehmer16/Verification-Instructions/instructions.html
+
+A link to the description of the data can be found here: https://github.com/jayded/evidence-inference/tree/master/annotations
+
diff --git a/SETUP.md b/SETUP.md
@@ -0,0 +1,10 @@
+Download embeddings to "embeddings" from
+http:https://evexdb.org/pmresources/vec-space-models/PubMed-w2v.bin
+
+You should create a virtualenv satisfying requirements.txt. We recommend using
+conda. You will need to install a very recent PyTorch (1.0 or later).
+
+Experiments were run on a mix of 1080Tis, K40ms, K80ms, and P100s.
+You should be able to (approximately) reproduce the main experiments via the
+programs in `scripts/paper/` (you may wish to modify the code to run multiple
+trials). The main results should finish in fewer than 10 hours.
diff --git a/annotations/README.md b/annotations/README.md
@@ -0,0 +1,28 @@
+## File Types
+There are 6 different files contained in this zip file:
+- The file annotations_merged.csv is a combination of the data found in points
+- The file annotations_pilot_run.csv contains data annotated, and verified by doctors, but with prompts generated by @elehman16 (an undergrad at Northeastern University).
+- The file annotations_doctor_generated.csv contains data annotated, and verified, soley by doctors. This file only has information concerning the valid answers and reasonings for a specific promptID.
+- The file prompts_merged.csv is a combination of the data found in points 5. and 6.
+- The file prompts_pilot_run.csv contains prompts generated by @elehman16. This file only contains the promptID and prompt information.
+- The file prompts_doctor_generated.csv contains prompts generated soley by doctors.
+
+## File Description (Annotations):
+- Annotations: The annotation files consist of the following headings: UserID, PromptID, PMCID, Valid Label, Valid Reasoning, Label, Annotations, Label Code, In Abstract, Start Evidence, End Evidence.
+- UserID: UserID is correlated to an ID of which doctor found the 'Label' and 'Annotations' column.
+- PromptID: PromptID determines which prompt the doctor is answering. The PromptID is also given in the prompt-csv files, in which a lookup can be used to find the corresponding outcome/intervention/comparator.
+- PMCID: This is the ID that we use to identify the articles. In order to find the correct article used, simply attach "PMC" + PMCID + '.nxml' and search within the xml_files folder.
+- Valid Label: This value will be either a 0/1. This will determine if the verifier certfies the multiple-choice response of the annotator. '0' correlates to a rejection, which '1' indicates acception.
+- Valid Reasoning: This value will be either a 0/1. This will determine if the verifier certfies the multiple-choice response of the annotator. '0' correlates to a rejection, which '1' indicates acception.
+- Label: This value will have a string value of 'significantly increased', 'significantly decreased', 'no significant difference' or 'invalid prompt.' This corresponds to the response that the annotator answered for the given PromptID.
+- Annotations: This value is a segment of strings, delimited by ",". This section consists of portions of the text that the annotator cited as to why they selected the label that they did.
+- Label Code: This is simply an integer version of the label. '0' corresponds to 'no significant difference', '1' corresponds to 'significantly increased', and '-1' corresponds to 'significantly decreased'.
+- In Abstract: This column consists of responses of '0' and '1'. This column reads '1' if the annotator got the answer from the abstract, and '0' if he or she used more than the abstract in order to answer the question.
+- Start Evidence: This column represents what index in the text that the “reasoning” from this row starts at (this is inclusive). 
+- End Evidence: This column represents what index in the text that the “reasoning” from this row ends at (this is also inclusive). 
+- Note: Some prompts will have 2 answers by a single doctor. This is because they might cite 2 different pieces of evidence. To properly identify this, look for rows with the same 'PromptID' and also the same 'UserID'.
+
+## File Description (Prompts):
+- PromptID: Like previously stated, this is an ID given to this specific row, including the PMCID, outcome, intervention and comparator.
+- PMCID: This is the ID that we use to identify the articles. In order to find the correct article used, simply attach "PMC" + PMCID + '.nxml' and search within the xml_files folder.
+- Outcome/Intervention/Comparator: The outcome/intervention/comparator columns represent the fill-in-the-blank inputs for the following prompt formed: With respect to outcome, characterize the reported difference between intervention and those receiving comparator.