Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several fixes #14

Merged
merged 7 commits into from
Jan 20, 2018
Merged

Several fixes #14

merged 7 commits into from
Jan 20, 2018

Conversation

rafspiny
Copy link
Contributor

Hi!
Thanks for the great work you carry on doing. I am using this tool to have an up-to-date DB on which run a set of analysis.

While using it, I discovered a couple of unexpected behaviours that I reported. Most of them are issues that I fixed in my fork.
Two or them are more considerations than real issues. Feel free to discuss or ignore.

I am making this pull request in case you agree on the fixes I applied in my fork. Otherwise, I will using my branch until the tool got another fix for these issues.

Again, many thanks for your work. It is much appreciated.

…ot being added. Refactored the code to have more controls on query execution and log generation.
…a0 for Università. Since MySQL DB is in utf8mb4 and python3 support utf8 natively and the compressed xml files from pubmed are in UTF8, there is no need to force an encode.
- The abstract was containing some XML tags. I supposed that was not intended, so I get rid of them and joined just the text of the various abstract for a specific article. Fix MrMimic#8
- The pub_date month field was not read properly. It can be either expressed as a number '02' or as text like 'Feb'. Fix MrMimic#9
- Citation owner and citation status from the medlinecitation tag were not being extracted correctly. Fix MrMimic#10
- Apparently an article can have multiple language specified in the XML. For the sake of cleanliness of the generated DB I made sure to report all the languages. Fix MrMimic#11
- Apparently an article can have multiple citationsubset specified in the XML. For the sake of cleanliness of the generated DB I made sure to report all the values. Fix MrMimic#12
- Mesh descriptor regex is not capturing long IDs. There are some descriptive_ui like D000074606 for 'Smoking prevention'. Fix MrMimic#13
- The DB field type in the medline_comments_corrections table is not long enough. Increased from 20 to 25. Ran against a couple of xml files and seems to be solid. It may need further testing though. For the moment Fix MrMimic#7
- Making sure to employ utf8 when extracting the xml file from the gz archive
Always build absolute paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants