Script that conducts probability matching between lobbying and manufacture #41

skirmer · 2017-02-08T18:51:20Z

I have the two output files as well- will post to data.world.

Linked lobbying data to drugmaker file and added linking key field "company_key".

mattgawarecki · 2017-02-08T19:01:30Z

manufacturers/prob_match_manufacturers.R

+library(stringdist)
+library(stats)
+
+companies_drugs <- read.csv("U:https://Medicaid_Drug/drugdata_clean.csv", stringsAsFactors = FALSE)


I'm not an expert in R, so I'll defer to others here; however, IMHO we should try to use relative file paths if we can so that people can run scripts without hitting environment-specific errors. That is to say, this code only works if you have the data files in a U:https://Medicaid_Drug directory.

Again, not sure how best to handle this, but my Pythonic approach would be to download from data.world via code into a local directory (maybe even the current directory) and open the downloaded file with read.csv().

FWIW, data.world makes it unnecessary to download then read- it has a single link that you can just put in the read.csv() statement. :)

Yep, or they have a query client now as well if that's easier.

skirmer · 2017-02-08T19:10:26Z

That is a good idea- I can fix that! Just a moment.

Fixed the inbound data paths!

skirmer · 2017-02-08T19:29:06Z

I think the new commit should fix it- I tested it on my machine and all worked fine.

jenniferthompson

This looks great - I'm learning a lot! Just have a couple potential improvements and a couple of questions to make sure I understand.

jenniferthompson · 2017-02-09T17:08:37Z

manufacturers/prob_match_manufacturers.R

+
+l_cut_name <- gsub(" INC| LTD| CORPORATION| CORP| USA| US| COMPANY| & CO| CO| PLC| LLC|\\.|\\,| PHARMACEUTICAL| PHARMACEUTICALS| PHARMA","", l_names2[,"l_names"])
+l_names3 <- as.data.frame(cbind(l_names2, l_cut_name))
+


Should " AND " get added to the list? Looking at Eli Lilly, for example. Might help a tad.

Hmm.. no special reason not to include, but I don't recall seeing "AND" in any of the actual names, and I don't think that being present would create enough variation between two strings to reduce the match probability below our value thresholds. The only issue would be that adding it now would change the order of the matches, requiring a whole new review of the matches.

Ah, gotcha. I didn't think about it requiring a whole redo! The thing that made me think of it was seeing "ELI LILLY AND COMPANY" somewhere, but you're right, definitely not worth redoing it all.

Yeah, unfortunately, there's not really any good way to do that matching in edge cases without human eyeballs reviewing to find the good and bad ones, unless a certain amount of error is just acceptable for the project - I don't want to have any false positives/missed matches here so I think the manual labor is worth it.

jenniferthompson · 2017-02-09T17:22:51Z

manufacturers/prob_match_manufacturers.R

+cd_list_table <- as.data.frame(cbind(cd_list, key_list[1:length(cd_list)]))
+companies_drugs_keyed <- merge(companies_drugs_keyed, cd_list_table, by.x=c("LABELER.NAME"), by.y=c("cd_list"), all=T)
+companies_drugs_keyed$company_key <- ifelse(is.na(companies_drugs_keyed$company_key), as.numeric(as.character(companies_drugs_keyed$V2)), companies_drugs_keyed$company_key)
+


Can you help me understand where the 117 and 1240 come from? I'm wondering if they could be made more reproducible vs hard-coded, or if it's just a random starting point.

Ah, that is an area I could use a reference for. It doesn't actually matter, because of the way I'm suffixing the keys, but when I was working I made the decision about key suffixes on the fly. The values came from the endpoint where the sequence had left off, fwiw. I'll adjust that, this makes sense.

Ah, unique() was what I was missing when I tried to figure it out! Perfect.

jenniferthompson · 2017-02-09T23:06:22Z

I think this script is ready to go @skirmer! Would you mind pulling the master branch again since we updated the org structure yesterday? Sorry about the extra step - it'll be very helpful in the long run. This can just go in the same /R/datawrangling/manufacturers directory with your other scripts.

mattgawarecki · 2017-02-10T03:46:28Z

@skirmer Looks like you still need to delete manufacturers/ in your repo and commit the deletion. I'm still seeing it show up in the file listing.

skirmer · 2017-02-10T15:15:11Z

Ah, right, I see that- just a folder mixup on my end. On it!

skirmer · 2017-02-10T15:17:58Z

Is that better? (Learning so much!)

jenniferthompson · 2017-02-10T17:24:57Z

Look at that organizational beauty. Merging!

Stephanie Kirmer added 2 commits February 8, 2017 12:47

Merge remote-tracking branch 'refs/remotes/Data4Democracy/master'

7949a57

Probability Matching Manufacturer Names

e5ad48a

Linked lobbying data to drugmaker file and added linking key field "company_key".

mattgawarecki reviewed Feb 8, 2017

View reviewed changes

Improvement to reproducibility

bf9dd77

Fixed the inbound data paths!

skirmer mentioned this pull request Feb 8, 2017

Create keys to join Medicare Part D spending, manufacturer, and lobbying info #37

Closed

jenniferthompson requested changes Feb 9, 2017

View reviewed changes

A few more reproducibility fixes!

ecdddb9

jenniferthompson approved these changes Feb 9, 2017

View reviewed changes

Stephanie Kirmer added 2 commits February 9, 2017 17:22

Merge remote-tracking branch 'refs/remotes/Data4Democracy/master'

ab2e734

Putting things in the right folders

2eaa276

Deleting a misplaced folder.

37c4e48

jenniferthompson closed this Feb 10, 2017

jenniferthompson reopened this Feb 10, 2017

jenniferthompson merged commit 2563f03 into Data4Democracy:master Feb 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script that conducts probability matching between lobbying and manufacture #41

Script that conducts probability matching between lobbying and manufacture #41

skirmer commented Feb 8, 2017

mattgawarecki Feb 8, 2017

skirmer Feb 8, 2017

jenniferthompson Feb 8, 2017

skirmer commented Feb 8, 2017

skirmer commented Feb 8, 2017

jenniferthompson left a comment

jenniferthompson Feb 9, 2017

skirmer Feb 9, 2017

jenniferthompson Feb 9, 2017

skirmer Feb 9, 2017

jenniferthompson Feb 9, 2017

skirmer Feb 9, 2017

jenniferthompson Feb 9, 2017

jenniferthompson commented Feb 9, 2017

mattgawarecki commented Feb 10, 2017

skirmer commented Feb 10, 2017

skirmer commented Feb 10, 2017

jenniferthompson commented Feb 10, 2017


		l_cut_name <- gsub(" INC\| LTD\| CORPORATION\| CORP\| USA\| US\| COMPANY\| & CO\| CO\| PLC\| LLC\|\\.\|\\,\| PHARMACEUTICAL\| PHARMACEUTICALS\| PHARMA","", l_names2[,"l_names"])
		l_names3 <- as.data.frame(cbind(l_names2, l_cut_name))

Script that conducts probability matching between lobbying and manufacture #41

Script that conducts probability matching between lobbying and manufacture #41

Conversation

skirmer commented Feb 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skirmer commented Feb 8, 2017

skirmer commented Feb 8, 2017

jenniferthompson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenniferthompson commented Feb 9, 2017

mattgawarecki commented Feb 10, 2017

skirmer commented Feb 10, 2017

skirmer commented Feb 10, 2017

jenniferthompson commented Feb 10, 2017