Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script that conducts probability matching between lobbying and manufacture #41

Merged
merged 7 commits into from
Feb 10, 2017

Conversation

skirmer
Copy link
Member

@skirmer skirmer commented Feb 8, 2017

I have the two output files as well- will post to data.world.

Stephanie Kirmer added 2 commits February 8, 2017 12:47
Linked lobbying data to drugmaker file and added linking key field
"company_key".
library(stringdist)
library(stats)

companies_drugs <- read.csv("U:https://Medicaid_Drug/drugdata_clean.csv", stringsAsFactors = FALSE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in R, so I'll defer to others here; however, IMHO we should try to use relative file paths if we can so that people can run scripts without hitting environment-specific errors. That is to say, this code only works if you have the data files in a U:https://Medicaid_Drug directory.

Again, not sure how best to handle this, but my Pythonic approach would be to download from data.world via code into a local directory (maybe even the current directory) and open the downloaded file with read.csv().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, data.world makes it unnecessary to download then read- it has a single link that you can just put in the read.csv() statement. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, or they have a query client now as well if that's easier.

@skirmer
Copy link
Member Author

skirmer commented Feb 8, 2017

That is a good idea- I can fix that! Just a moment.

Fixed the inbound data paths!
@skirmer
Copy link
Member Author

skirmer commented Feb 8, 2017

I think the new commit should fix it- I tested it on my machine and all worked fine.

Copy link
Contributor

@jenniferthompson jenniferthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - I'm learning a lot! Just have a couple potential improvements and a couple of questions to make sure I understand.


l_cut_name <- gsub(" INC| LTD| CORPORATION| CORP| USA| US| COMPANY| & CO| CO| PLC| LLC|\\.|\\,| PHARMACEUTICAL| PHARMACEUTICALS| PHARMA","", l_names2[,"l_names"])
l_names3 <- as.data.frame(cbind(l_names2, l_cut_name))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should " AND " get added to the list? Looking at Eli Lilly, for example. Might help a tad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. no special reason not to include, but I don't recall seeing "AND" in any of the actual names, and I don't think that being present would create enough variation between two strings to reduce the match probability below our value thresholds. The only issue would be that adding it now would change the order of the matches, requiring a whole new review of the matches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. I didn't think about it requiring a whole redo! The thing that made me think of it was seeing "ELI LILLY AND COMPANY" somewhere, but you're right, definitely not worth redoing it all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unfortunately, there's not really any good way to do that matching in edge cases without human eyeballs reviewing to find the good and bad ones, unless a certain amount of error is just acceptable for the project - I don't want to have any false positives/missed matches here so I think the manual labor is worth it.

cd_list_table <- as.data.frame(cbind(cd_list, key_list[1:length(cd_list)]))
companies_drugs_keyed <- merge(companies_drugs_keyed, cd_list_table, by.x=c("LABELER.NAME"), by.y=c("cd_list"), all=T)
companies_drugs_keyed$company_key <- ifelse(is.na(companies_drugs_keyed$company_key), as.numeric(as.character(companies_drugs_keyed$V2)), companies_drugs_keyed$company_key)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand where the 117 and 1240 come from? I'm wondering if they could be made more reproducible vs hard-coded, or if it's just a random starting point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is an area I could use a reference for. It doesn't actually matter, because of the way I'm suffixing the keys, but when I was working I made the decision about key suffixes on the fly. The values came from the endpoint where the sequence had left off, fwiw. I'll adjust that, this makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, unique() was what I was missing when I tried to figure it out! Perfect.

@jenniferthompson
Copy link
Contributor

I think this script is ready to go @skirmer! Would you mind pulling the master branch again since we updated the org structure yesterday? Sorry about the extra step - it'll be very helpful in the long run. This can just go in the same /R/datawrangling/manufacturers directory with your other scripts.

@mattgawarecki
Copy link
Contributor

@skirmer Looks like you still need to delete manufacturers/ in your repo and commit the deletion. I'm still seeing it show up in the file listing.

@skirmer
Copy link
Member Author

skirmer commented Feb 10, 2017

Ah, right, I see that- just a folder mixup on my end. On it!

@skirmer
Copy link
Member Author

skirmer commented Feb 10, 2017

Is that better? (Learning so much!)

@jenniferthompson
Copy link
Contributor

Look at that organizational beauty. Merging!

@jenniferthompson jenniferthompson merged commit 2563f03 into Data4Democracy:master Feb 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants