Several fixes #14

rafspiny · 2018-01-18T10:03:26Z

Hi!
Thanks for the great work you carry on doing. I am using this tool to have an up-to-date DB on which run a set of analysis.

While using it, I discovered a couple of unexpected behaviours that I reported. Most of them are issues that I fixed in my fork.
Two or them are more considerations than real issues. Feel free to discuss or ignore.

I am making this pull request in case you agree on the fixes I applied in my fork. Otherwise, I will using my branch until the tool got another fix for these issues.

Again, many thanks for your work. It is much appreciated.

…ot being added. Refactored the code to have more controls on query execution and log generation.

…a0 for Università. Since MySQL DB is in utf8mb4 and python3 support utf8 natively and the compressed xml files from pubmed are in UTF8, there is no need to force an encode.

- The abstract was containing some XML tags. I supposed that was not intended, so I get rid of them and joined just the text of the various abstract for a specific article. Fix MrMimic#8 - The pub_date month field was not read properly. It can be either expressed as a number '02' or as text like 'Feb'. Fix MrMimic#9 - Citation owner and citation status from the medlinecitation tag were not being extracted correctly. Fix MrMimic#10 - Apparently an article can have multiple language specified in the XML. For the sake of cleanliness of the generated DB I made sure to report all the languages. Fix MrMimic#11 - Apparently an article can have multiple citationsubset specified in the XML. For the sake of cleanliness of the generated DB I made sure to report all the values. Fix MrMimic#12 - Mesh descriptor regex is not capturing long IDs. There are some descriptive_ui like D000074606 for 'Smoking prevention'. Fix MrMimic#13 - The DB field type in the medline_comments_corrections table is not long enough. Increased from 20 to 25. Ran against a couple of xml files and seems to be solid. It may need further testing though. For the moment Fix MrMimic#7 - Making sure to employ utf8 when extracting the xml file from the gz archive

Always build absolute paths

rafspiny added 7 commits January 10, 2018 18:45

Regex to catch pmid longer than 4 in building a citation network

1434a54

Making sure to insert all the items in the DB. Remaining items were n…

53578da

…ot being added. Refactored the code to have more controls on query execution and log generation.

Further decoding in utf-8 cause the creation of artifacts like \xc3\x…

d19855d

…a0 for Università. Since MySQL DB is in utf8mb4 and python3 support utf8 natively and the compressed xml files from pubmed are in UTF8, there is no need to force an encode.

Forgot the improved regex for mesh ids

bae147a

Merge remote-tracking branch 'upstream/master'

4787652

Forgot to comment debug code

a346818

Always build absolute paths

rafspiny force-pushed the master branch from 2e92902 to a346818 Compare January 18, 2018 16:46

MrMimic merged commit 81ca46d into MrMimic:master Jan 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several fixes #14

Several fixes #14

rafspiny commented Jan 18, 2018

Several fixes #14

Several fixes #14

Conversation

rafspiny commented Jan 18, 2018