Post naacl fixups #1

jayded · 2019-06-05T19:20:03Z

Includes plain text documents for user convenience
minor code cleanup
update READMEs
update instructions for experiment reproduction

…to the path were made in the python version of the file as well.

There were two things done in the patch: 1. Change the extraction of text method to only include the abstract once. 2. Update the corresponding start/end spans to reflect the change in text extraction, such that article_text[start:end] is equal to the span shown in the annotations_merged file.

There were a few major fixes to this code: 1. The addition of a try-catch to help deal with an outside library used for the heuristic crashing for no apparent reason. 2. Changed the python script that runs the code to run the correct experiment when run, as scan-net was not being run, but rather scan-net-ico. 3. Disabled the GPU code for the LR temporarily. Some GPUs are able to load in the data, while others are not able to. For now, it is disabled until batching is implemented.

The purpose of this commit is to update all of the files listed in the additional file section. These files still have old spans. All of the changes made in this commit do not affect any modeling or preprocessing steps, but rather are for data distribution purposes.

The purpose of this commit is to remove a file that is not useful, and whose sole purpose was to simply compile statistics and do simple calculations of data.

In this commit, we add an additional option for users to set a variable in order to extract plaintext versions of the xml files. Currently, the spans are not correct for this. This will be done in a future commit.

The previous commit enabled users to extract plaintext version of the XML files. However, there was no ability to extract section information from the plaintext article. This commit has the function that extracts the plaintext also return a dictionary of section titles mapped to start/end cordinates.

In this commit, we do 3 things: 1. Update the readme to reflect the rest of the changes in the commit. 2. Add plain-text versions of all the XML files. These have similar names, but with a different extension. 3. Update the annotations_merged.csv to have the offsets of evidence spans into these plaintext files. We do realize that this makes the additional_file/*.csv files out of date. However, we will be working to add the changes to those files as soon as possible.

With un-even XML files, the code currently breaks when attempting to parse section offsets. This code will work despite any malformed XMLs.

jayded and others added 16 commits June 5, 2019 13:21

Add missing classes to requirements.txt

37c1673

Add note about requiring a MySQL installation

162889d

Add missing break statement for attention pretraining

9286961

Add note about using pytorch nightly

039c5b1

Make the heuristics/baselines executable

9f9dddb

Fixing pathing issues for the run_baseline .sh file. Similar changes …

11aa855

…to the path were made in the python version of the file as well.

remove unnecessary file for statistic calculation

7cf8b05

The purpose of this commit is to remove a file that is not useful, and whose sole purpose was to simply compile statistics and do simple calculations of data.

add option for reading plaintext version of xmls

6ff88e2

In this commit, we add an additional option for users to set a variable in order to extract plaintext versions of the xml files. Currently, the spans are not correct for this. This will be done in a future commit.

Allow conversion of annotations to json format

900b0da

Separate and update READMEs

1f977ec

Use PubMed nxml by default

ce90d65

jayded self-assigned this Jun 5, 2019

ensure extraction of offsets despite malformed xml

f8da2aa

With un-even XML files, the code currently breaks when attempting to parse section offsets. This code will work despite any malformed XMLs.

jayded merged commit ea1769d into master Jun 6, 2019

jayded deleted the post-naacl-fixups branch June 6, 2019 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post naacl fixups #1

Post naacl fixups #1

jayded commented Jun 5, 2019

Post naacl fixups #1

Post naacl fixups #1

Conversation

jayded commented Jun 5, 2019