-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script that conducts probability matching between lobbying and manufacture #41
Conversation
Linked lobbying data to drugmaker file and added linking key field "company_key".
library(stringdist) | ||
library(stats) | ||
|
||
companies_drugs <- read.csv("U:https://Medicaid_Drug/drugdata_clean.csv", stringsAsFactors = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert in R, so I'll defer to others here; however, IMHO we should try to use relative file paths if we can so that people can run scripts without hitting environment-specific errors. That is to say, this code only works if you have the data files in a U:https://Medicaid_Drug
directory.
Again, not sure how best to handle this, but my Pythonic approach would be to download from data.world via code into a local directory (maybe even the current directory) and open the downloaded file with read.csv()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, data.world makes it unnecessary to download then read- it has a single link that you can just put in the read.csv() statement. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, or they have a query client now as well if that's easier.
That is a good idea- I can fix that! Just a moment. |
Fixed the inbound data paths!
I think the new commit should fix it- I tested it on my machine and all worked fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great - I'm learning a lot! Just have a couple potential improvements and a couple of questions to make sure I understand.
|
||
l_cut_name <- gsub(" INC| LTD| CORPORATION| CORP| USA| US| COMPANY| & CO| CO| PLC| LLC|\\.|\\,| PHARMACEUTICAL| PHARMACEUTICALS| PHARMA","", l_names2[,"l_names"]) | ||
l_names3 <- as.data.frame(cbind(l_names2, l_cut_name)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should " AND " get added to the list? Looking at Eli Lilly, for example. Might help a tad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. no special reason not to include, but I don't recall seeing "AND" in any of the actual names, and I don't think that being present would create enough variation between two strings to reduce the match probability below our value thresholds. The only issue would be that adding it now would change the order of the matches, requiring a whole new review of the matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. I didn't think about it requiring a whole redo! The thing that made me think of it was seeing "ELI LILLY AND COMPANY" somewhere, but you're right, definitely not worth redoing it all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unfortunately, there's not really any good way to do that matching in edge cases without human eyeballs reviewing to find the good and bad ones, unless a certain amount of error is just acceptable for the project - I don't want to have any false positives/missed matches here so I think the manual labor is worth it.
cd_list_table <- as.data.frame(cbind(cd_list, key_list[1:length(cd_list)])) | ||
companies_drugs_keyed <- merge(companies_drugs_keyed, cd_list_table, by.x=c("LABELER.NAME"), by.y=c("cd_list"), all=T) | ||
companies_drugs_keyed$company_key <- ifelse(is.na(companies_drugs_keyed$company_key), as.numeric(as.character(companies_drugs_keyed$V2)), companies_drugs_keyed$company_key) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand where the 117 and 1240 come from? I'm wondering if they could be made more reproducible vs hard-coded, or if it's just a random starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is an area I could use a reference for. It doesn't actually matter, because of the way I'm suffixing the keys, but when I was working I made the decision about key suffixes on the fly. The values came from the endpoint where the sequence had left off, fwiw. I'll adjust that, this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, unique()
was what I was missing when I tried to figure it out! Perfect.
I think this script is ready to go @skirmer! Would you mind pulling the master branch again since we updated the org structure yesterday? Sorry about the extra step - it'll be very helpful in the long run. This can just go in the same |
@skirmer Looks like you still need to delete |
Ah, right, I see that- just a folder mixup on my end. On it! |
Is that better? (Learning so much!) |
Look at that organizational beauty. Merging! |
I have the two output files as well- will post to data.world.