Data4Democracy · proof-by-accident · Jan 16, 2018 · Jan 16, 2018 · Jan 18, 2018 · Jan 18, 2018
diff --git a/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.Rmd b/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.Rmd
@@ -0,0 +1,60 @@
+---
+title: "Analyzing the FDA_NDC_Product Dataset"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+library(reshape2)
+library(plyr)
+library(data.world)
+```
+
+##Addressing Issue #70 Questions
+
+Read in the FDA_NDC data and Medicare spending data...
+```{r}
+ds <- 'https://data.world/data4democracy/drug-spending'
+ndc.dat <- data.world::query( data.world::qry_sql( 'SELECT proprietaryname, nonproprietaryname, substancename, productid FROM fda_ndc_product_tidy' ) , dataset=ds )
+spd.dat <- data.world::query( data.world::qry_sql( 'SELECT brand_name, generic_name FROM spending_part_d_2011to2015_tidy' ) , dataset=ds )
+
+````
+
+
+###How many drugs have multiple active ingredients?
+Let's see how many drugs total in the `fda_ndc_product_tidy` have more than one active ingredient:
+```{r}
+pid.agg.ndc <- dcast(ndc.dat, productid ~ .)
+names( pid.agg.ndc ) <- c('productid', 'occurences')
+print('About 26.5k:')
+length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
+````
+
+```{r}
+pid.agg.ndc <- dcast(ndc.dat, proprietaryname ~ .)
+names( pid.agg.ndc ) <- c('propname', 'occurences')
+print('About 26.5k:')
+length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
+````
+
+How about drugs that match the `spending_201x` dataset?
+```{r}
+pid.agg
+
+````
+
+###How many total of the drugs can be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
+Joining the spending and FDA_NDC datasets `join()` matches two columns with the same name in different dataframes, so we need to first renamed `drugname_brand` to `proprietaryname` in `spd.dat`.
+```{r}
+names(spd.dat)[2] <- 'proprietaryname'
+ndc.spd.join <- join(spd.dat, ndc.dat, type="inner", by=c('proprietaryname'))
+dim(ndc.spd.join)
+````
+
+What's the relationship between `drugname_generic` and `nonproprietaryname` and also `substancename`?
+```{r}
+sum( ndc.spd.join$drugname_generic == ndc.spd.join$nonproprietaryname )
+sum( ndc.spd.join$drugname_generic[ !is.na(ndc.spd.join$substancename) ] == ndc.spd.join$substancename[ !is.na(ndc.spd.join$substancename) ] )
+`````
+
+It looks like there's a good number of matches between the two.
diff --git a/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.html b/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.html
diff --git a/R/analysis-vis/explore_fda_nda_product/fda_ndc_product.csv b/R/analysis-vis/explore_fda_nda_product/fda_ndc_product.csv
diff --git a/R/analysis-vis/explore_fda_nda_product/spending_2011.csv b/R/analysis-vis/explore_fda_nda_product/spending_2011.csv
diff --git a/R/datawrangling/FDA_NDC_tidy/FDA_NDC_Product.csv b/R/datawrangling/FDA_NDC_tidy/FDA_NDC_Product.csv
diff --git a/R/datawrangling/FDA_NDC_tidy/Untitled Document b/R/datawrangling/FDA_NDC_tidy/Untitled Document
diff --git a/R/datawrangling/FDA_NDC_tidy/fda_ndc_product.csv b/R/datawrangling/FDA_NDC_tidy/fda_ndc_product.csv
diff --git a/R/datawrangling/FDA_NDC_tidy/tidy_fda_ndc.R b/R/datawrangling/FDA_NDC_tidy/tidy_fda_ndc.R
@@ -0,0 +1,180 @@
+#IMPORT LIBRARIES
+library(reshape)
+
+#READ-IN CSV AND SET HANDY VARIABLES
+dat <- read.csv('FDA_NDC_Product.csv')
+names(dat) <- tolower(names(dat))
+
+#uncomment to speed up code runtime by like 100x
+#dat <- dat[1:1000,]
+
+untidy.names <- c('nonproprietaryname', 'substancename', 'active_numerator_strength', 'active_ingred_unit')
+tidy.names <- names( subset( dat[1,], select = -c(nonproprietaryname, substancename, active_numerator_strength, active_ingred_unit) ) )
+
+#Count num of drugs w/ multiple active ingreds
+act.ingreds <- strsplit( unlist(sapply( dat$active_ingred_unit, as.character )), c(', |; |,|;| and ') )
+num.act.ingreds <- unlist( lapply( act.ingreds, length ) )
+#length(num.act.ingreds[ num.act.ingreds > 1])
+
+#PRELIMS FOR TIDYING
+
+#Initialize a data frame that we will add rows to. If we make the first row of the DF an integer vector it seems to handle rbinding character rows better
+#>>Not sure why that is and couldn't find something simpler, the rest of the script is similarly hacky
+tidy.dat <- as.data.frame( matrix( 1:(dim(dat)[2]) , ncol=dim(dat)[2] ) )
+names(tidy.dat) <- c( tidy.names, untidy.names )
+
+#These are going to hold which rows in the tidy.dat flagged an error as they were processed
+np.mis <- c()
+ai.mis <- c()
+
+#Passing some data values to tolower() tossed weird, corner-case errors => wrapped tolower() in a generic error catcher which simply passes back original char vector
+#>>if anything goes wrong
+safe.lower <- function(x){
+ result <- tryCatch({
+ sapply( x, tolower)
+ },
+ error = function(e){
+ return( as.character(x) )
+ })
+ return( result )
+}
+
+
+#Load up Civ V or something b/c you are going to be waiting for a little while...
+print('starting split...')
+for (i in 1:dim(dat)[1] ){
+ #Pull the row to-be-split
+ row <- dat[i,]
+
+ #Subsection the elements that we need to Tidy from the ones we don't
+ tidy.elems <- row[,tidy.names]
+ untidy.elems <- row[,untidy.names] #NB: these are the elements we will be Tidying
+
+ #This basically greps through the a character vector searching for matches to the char vec c(', |; |,|;| and ')
+ #>>NB: the | symbol reads as "OR" for grep, so we are matching on ', ' OR '; OR' ',' OR ';' OR ',' which were the only joining symbols I could find 
+ split.elems <- strsplit( sapply( untidy.elems, as.character ), c(', |; |,|;| and ') )
+
+ #>>split.elems is a list-of-lists, each list elem is itself a list of the split strings
+
+ N <- length( split.elems[[1]] ) #>> N gives us the number of proprietarynames that we need to convert to new rows
+
+ #Let's hold on to which rows mismatch in their number of proprietarynames vs. substancenames
+ if( length(split.elems[[1]]) != length(split.elems[[2]]) ){
+ print('nonproprietaryname mismatch!')
+ np.mis <- append(np.mis, dim(tidy.dat)[1] )
+ }
+
+ #Also check to make sure that the number of active_ingreds and number of active_numers match up
+ #>>Spoiler alert: they all do!
+ if( length(split.elems[[3]]) != length(split.elems[[4]]) ){
+ print('active_ingred mismatch!')
+ ai.mis <- append( ai.mis, dim(tidy.dat)[1] )
+ } 
+
+ #Okay now we start assembling and appending the new rows from the split.elems
+ for (j in 1:N){
+ new.row <- unlist( c( unlist( sapply( tidy.elems, safe.lower ) ), split.elems[[1]][j], split.elems[[2]][j], split.elems[[3]][j], split.elems[[4]][j] ) )
+
+ #If all of the untidy elements were NA then new.row will only have 14 elements, so needs to be rebuilt
+ if (length(new.row) == 14){
+ new.row <- c( new.row, NA, NA, NA, NA )
+ }
+
+ #Append the new.row to tidy.dat
+ names(new.row) <- c( tidy.names, untidy.names )
+ tidy.dat <- rbind( tidy.dat, sapply( new.row, safe.lower) )
+ }
+}
+#Some nonproprietarynames still start with an "and " for some reason, so let's clean them up
+to.fix <- sapply( dat$nonproprietaryname, function(x) { (substring(x,1,4)=='and ')*!is.na(x) } )
+to.fix <- unlist( sapply( to.fix, function(x) { if( is.na(x) ){ return(FALSE) } } ) )
+fixed <- sapply( dat[to.fix,]$nonproprietaryname, function(x) { substring(x,5) } )
+dat[ to.fix,]$nonproprietaryname <- fixed
+
+#Save tidy data just in case of a crash or something
+write.csv(tidy.dat, 'fda_ndc_product.csv', row.names = FALSE)
+
+#Scrape off that first integer line
+tidy.dat <- tidy.dat[2:dim(tidy.dat)[1], ]
+
+#Convert data column to ISO format
+to.iso <- function(x) {
+ date <- as.character(x)
+
+ if ( is.na( date ) ){
+ return(date)
+ }
+
+ else {
+ iso.date <- paste( substring(date,1,4), substring(date,5,6), substring(date,7), sep = '-')
+ return(iso.date)
+ }
+}
+
+tidy.dat$startmarketingdate <- sapply( tidy.dat$startmarketingdate, to.iso )
+tidy.dat$endmarketingdate <- sapply( tidy.dat$endmarketingdate, to.iso )
+
+#See which rows threw errors in the splitting process
+tidy.dat[ np.mis[sample(1:length(np.mis),10)], untidy.names]
+length(ai.mis) #length is 0
+
+#Check if any strings weren't split
+np.missed <- grep( ' and ', tidy.dat[,untidy.names[1]] )
+sn.missed <- grep( ' and ', tidy.dat[,untidy.names[2]] )
+
+act_num.missed <- grep( ' and |, |; | , | ; ', tidy.dat[,untidy.names[3]] )
+act_ingred.missed <- grep( ' and |, |; | , | ; ', tidy.dat[,untidy.names[4]] )
+#No hanging ', ', ' and ' or '; ' joins in active_numerator_strength" or "active_ingred_unit" => hopefully everything has an unambiguous dosage
+#Only unusual joins (ie. ', ' or '; ') in "nonpropname" or "substancename" are things like "sennosides a and b", and have unamibugous dosage so I'm assuming
+#it's meant to imply that both substances are present in some canonical ratio
+
+np.sn.missed <- tidy.dat$nonproprietaryname == tidy.dat$substancename
+name.mismatch <- tidy.dat[!np.sn.misses, untidy.names]
+
+#Let's see what some random rows of the name mismatches look like...
+name.mismatch[sample( 1:dim(name.mismatch)[1], 100 ), untidy.names[1:2]]
+#again just a lot of sloppy entry mistmatch stuff, so we're probably good
+
+#Save tidy data just in case of a crash or something
+write.csv(tidy.dat, 'fda_ndc_product.csv', row.names = FALSE)
+
+###################################################################
+#TRASH
+###################################################################
+#split.column <- function( col.name, dat ) {
+# hold <- dat
+# dat.col <- hold[,col.name]
+#
+# split.col <- strsplit( tolower( as.character( dat.col ) ), c( ', |; | and |,|;' ) )
+#
+# var.nums <- sapply( split.col, length )
+# max.iter <- max( var.nums )
+#
+# if ( max.iter == 1) {
+# return(hold)
+# }
+# 
+# for( i in 1:max.iter ){
+# new.name <- paste( col.name, as.character( i ), sep='' ) 
+# hold[ ,new.name] <- sapply( split.col, '[', i)
+# }
+#
+# return( hold[ , names(hold) != col.name ] )
+#}
+#
+#
+#split.dat <- small.dat
+#for (s in untidy.names){
+# print(s)
+# split.dat <- split.column(s, split.dat)
+#}
+#
+#to.copy <- split.dat[ !is.na(split.dat$nonproprietaryname2), ]
+#
+#to.copy$nonproprietaryname1 <- to.copy$nonproprietaryname2
+#to.copy$substancename1 <- to.copy$substancename2
+#to.copy$active_numerator_strength1 <- to.copy$active_numerator_strength2
+#to.copy$active_ingred_unit1 <- to.copy$active_ingred_unit2
+#
+#split.dat <- rbind(split.dat, to.copy)
+#split.dat <- split.dat[ , c(1:15,17,19,21)]
diff --git a/datadictionaries/#README.md# b/datadictionaries/#README.md#
@@ -0,0 +1,150 @@
+## Data Central: How to Contribute, Sources, and Data Dictionaries
+
+As our work continues to expand, this will be a central repository to document summaries, sources,
+and field names for all our data sets. Data is housed in our [repo on data.world](https://data.world/data4democracy/drug-spending).
+
+### How Do I Contribute Data?
+
+---
+
+We're glad you asked!
+
+If you have a data source that would help with our [objectives](../docs/objectives.md),
+we'd be grateful to have it. Here's an overview of how to most effectively contribute. Please join
+the discussion on our [Slack channel](https://datafordemocracy.slack.com/messages/drug-spending/) -
+our group would love to work with you. (If you're not already in the Data for Democracy Slack team,
+you'll need an invitation - more info [here](https://github.com/Data4Democracy/read-this-first).)
+
+1. [Tidy the data](https://en.wikipedia.org/wiki/Tidy_data), using `lower_snake_case` for variable
+and file names and ISO format (YYYY-MM-DD) for dates. Also keep in mind these [best practices](https://docs.google.com/document/d/1p5A2DQ5gFC7XVKNVDw_ifKnycv_j1udmqY1M0rjbcxo/edit) from data.world. We prefer CSV format; [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) format is also very useful (feel free to add both).
+1. Fork this repo and request to be a [contributor to our dataset on data.world](https://data.world/data4democracy/drug-spending/contributors), if you haven't already.
+1. Submit a pull request to this repo including the following:
+ * In either [`python/datawrangling`](../python/datawrangling) or [`R/datawrangling`](../R/datawrangling), as appropriate, add any script(s) you used to scrape, tidy, etc. (If you have multiple scripts, feel free to create a subdirectory.) Be specific when you name the scripts and directories - eg, `scrape_druglist_from_genomejp.py` is better than `drugscraping.py`.
+ * In `/datadictionaries`, add a data dictionary for your data source named `[datasource].md`. We have a [data dictionary template](TEMPLATE.md); for more specifics, check out the other dictionaries available in this folder.
+ * Edit this README with a short overview of your dataset.
+1. Once the PR is reviewed by our maintainers and merged, upload your final data set to data.world and label it "clean data" (click on Edit). Add a link to the data dictionary in the Description field. *(If you'd rather not join data.world, a maintainer can do this as well. It's a fun place, though!)*
+ - If you'd like to add the raw data as well (eg, XLSX files), feel free; make sure to label it "raw data."
+ - Bonus points: Edit the info for each field in your data.world dataset with a detailed description.
+1. Submit a PR to update this overview file (this can be done by you or maintainers).
+1. Receive our grateful thanks, likely including emoji.
+
+### Overview of Currently Available Datasets
+
+All datasets are available in our [repo on data.world](https://data.world/data4democracy/drug-spending). If individual datasets can be queried, direct links are included.
+
+---
+
+#### 1. [Medicare Part D Spending Data, 2011-2015](https://data.world/data4democracy/drug-spending/query/?query=--+Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx%2FMethods+%28Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx%29%0ASELECT+%2A+FROM+%60Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx%2FMethods%60)
+
+###### Formats: XLSX (original); CSV, feather (tidied)
+###### Original Source: US Centers for Medicare and Medicaid Services ([CMS.gov](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/Downloads/Part_D_All_Drugs_2015.zip))
+
+This is the data that initially inspired our project.
+
+The Excel file contains aggregate data for total and average spending by Medicare and by consumers,
+as well as total and average number of claims, for each brand name drug by year. Generic names are
+also included.
+
+In our data.world repo, the original file has been tidied and split into one dataset per year,
+available in both .csv and .feather format; these are titled, for example, `spending-2011.feather`.
+We also have a `feather` file containing solely the unique brand names + generic names included in
+all five years of data (`drugnames.feather`).
+
+Links to full data dictionaries:
+[2011](part-d_spending_2011.md)
+[2012](part-d_spending_2012.md)
+[2013](part-d_spending_2013.md)
+[2014](part-d_spending_2014.md)
+[2015](part-d_spending_2015.md)
+
+---
+
+#### 2. [ATC Codes](https://data.world/data4democracy/drug-spending/query/?query=--+atc-codes.csv%2Fatc-codes+%28atc-codes.csv%29%0ASELECT+%2A+FROM+%60atc-codes.csv%2Fatc-codes%60+LIMIT+5000)
+
+###### Formats: KEG (original); CSV (tidied)
+###### Original Source: www.genome.jp
+
+The [Anatomical Therapeutic Chemical Classification System](https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System), maintained by the WHO, is used to classify drugs based on both the organ or system on which they act and their therapeutic, pharmacological and chemical properties. Procuring the codes from WHO is prohibitively expensive; our dataset is scraped from www.genome.jp.
+
+Link to full data dictionary [in progress]
+
+---
+
+#### 3. FDA-Approved Drugs
+
+###### Formats: JSON
+###### Original Source: [Center Watch](https://www.centerwatch.com/drug-information/fda-approved-drugs/therapeutic-areas)
+
+This dataset contains a list of FDA-approved drugs, their approval date, manufacturer, and specific
+purpose.
+
+---
+
+#### 4. [Drug Uses](https://data.world/data4democracy/drug-spending/query/?query=--+drug_uses.csv%2Fdrug_uses+%28drug_uses.csv%29%0ASELECT+%2A+FROM+%60drug_uses.csv%2Fdrug_uses%60+LIMIT+5000)
+
+###### Formats: CSV, feather
+###### Original Source: n/a
+
+This is a first pass at a crosswalk between the ATC codes and Medicare Part D spending data. Work to
+finalize this is welcome!
+
+Link to full data dictionary [in progress]
+
+---
+
+#### 5. [Cleaned manufacturer data](https://data.world/data4democracy/drug-spending/query/?query=--+drugdata_clean.csv%2Fdrugdata_clean+%28drugdata_clean.csv%29%0ASELECT+%2A+FROM+%60drugdata_clean.csv%2Fdrugdata_clean%60+LIMIT+5000)
+
+###### Formats: CSV
+###### Original Source: CMS.gov
+
+This dataset contains the information you'd need to link specific drugs and their dosages to the manufacturer - helpful for creating a path from Medicaid spending to lobbying efforts. Brand name and generic or descriptive names are both offered, as well as dosage and package size. Further, there are identifying codes for each drug (HCPCS and NDC).
+
+---
+
+#### 6. Medical Expenditure Panel Survey *(too large for direct query link)*
+
+###### Formats: zip, CSV, feather
+###### Original Source: meps.ahrq.gov
+
+I'll need Alex to write this one, and/or I'll look at it later.
+
+Link to full data dictionary [in progress]
+
+---
+
+#### 7. [Pharmaceutical Lobbying Transactions](https://data.world/data4democracy/drug-spending/query/?query=--+Pharma_Lobby.csv%2FPharma_Lobby+%28Pharma_Lobby.csv%29%0ASELECT+%2A+FROM+%60Pharma_Lobby.csv%2FPharma_Lobby%60+LIMIT+5000)
+
+###### Formats: CSV
+###### Original Source: [OpenSecrets](https://www.opensecrets.org/lobby/indusclient.php?id=h04&year=2016)
+
+OpenSecrets has data on lobbying transactions from pharmaceutical companies and their subsidiaries, totaled by year.
+
+Link to full data dictionary [in progress]
+
+---
+
+#### 8. [USP Drug Classification](https://data.world/data4democracy/drug-spending/query/?query=--+usp_drug_classification.csv%2Fusp_drug_classification+%28usp_drug_classification.csv%29%0ASELECT+%2A+FROM+%60usp_drug_classification.csv%2Fusp_drug_classification%60)
+
+###### Formats: text, CSV
+###### Original Source: [KEGG](https://www.genome.jp/kegg-bin/get_htext?htext=br08302.keg) ("USP drug classification" in the drop-down menu)
+
+The US Pharmacopeial Convention Drug Classification system. Contains category and class information on outpatient
+drugs available in the US market. TBD if data also contains information on Part D eligible
+drugs only, though it seems like it likely doesn't: "The USP DC is intended to be complementary to
+the [USP MMG](https://www.usp.org/usp-healthcare-professionals/usp-medicare-model-guidelines) and
+is developed with similar guiding principles, taxonomy, and structure of the USP Categories and Classes."
+
+Link to full data dictionary: [usp_drug_classification.md](usp_drug_classification.md)
+
+---
+
+#### 9. FDA NDC Product
+###### Formats: CSV
+###### Original Source: [FDA](https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm)
+
+This dataset provides a link between drug names or ID numbers and the active ingredients contained in each drug.
+
+This dataset was created as part of The Drug Listing Act of 1972, which requires drug manufacturers/distributors to provide a full list of currently marketed drugs. The information is submitted by the labeler (manufacturer, repackager, or distributor) to the FDA. It seems that inclusion in the NDC directory does not mean that the drug is FDA approved. The dataset includes information about active ingredients in a drug and their dosing, who produces the drug, and the pharmacological mechanism by which it acts.
+
+Link to full data dictionary: [usp_drug_classification.md](usp_drug_classification.md)
+