Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #70 work branch #79

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: "Analyzing the FDA_NDC_Product Dataset"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reshape2)
library(plyr)
library(data.world)
```

##Addressing Issue #70 Questions

Read in the FDA_NDC data and Medicare spending data...
```{r}
ds <- 'https://data.world/data4democracy/drug-spending'
ndc.dat <- data.world::query( data.world::qry_sql( 'SELECT proprietaryname, nonproprietaryname, substancename, productid FROM fda_ndc_product_tidy' ) , dataset=ds )
spd.dat <- data.world::query( data.world::qry_sql( 'SELECT brand_name, generic_name FROM spending_part_d_2011to2015_tidy' ) , dataset=ds )

````


###How many drugs have multiple active ingredients?
Let's see how many drugs total in the `fda_ndc_product_tidy` have more than one active ingredient:
```{r}
pid.agg.ndc <- dcast(ndc.dat, productid ~ .)
names( pid.agg.ndc ) <- c('productid', 'occurences')
print('About 26.5k:')
length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
````

```{r}
pid.agg.ndc <- dcast(ndc.dat, proprietaryname ~ .)
names( pid.agg.ndc ) <- c('propname', 'occurences')
print('About 26.5k:')
length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
````

How about drugs that match the `spending_201x` dataset?
```{r}
pid.agg

````

###How many total of the drugs can be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
Joining the spending and FDA_NDC datasets `join()` matches two columns with the same name in different dataframes, so we need to first renamed `drugname_brand` to `proprietaryname` in `spd.dat`.
```{r}
names(spd.dat)[2] <- 'proprietaryname'
ndc.spd.join <- join(spd.dat, ndc.dat, type="inner", by=c('proprietaryname'))
dim(ndc.spd.join)
````

What's the relationship between `drugname_generic` and `nonproprietaryname` and also `substancename`?
```{r}
sum( ndc.spd.join$drugname_generic == ndc.spd.join$nonproprietaryname )
sum( ndc.spd.join$drugname_generic[ !is.na(ndc.spd.join$substancename) ] == ndc.spd.join$substancename[ !is.na(ndc.spd.join$substancename) ] )
`````

It looks like there's a good number of matches between the two.
293 changes: 293 additions & 0 deletions R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.html

Large diffs are not rendered by default.

202,799 changes: 202,799 additions & 0 deletions R/analysis-vis/explore_fda_nda_product/fda_ndc_product.csv

Large diffs are not rendered by default.

3,584 changes: 3,584 additions & 0 deletions R/analysis-vis/explore_fda_nda_product/spending_2011.csv

Large diffs are not rendered by default.

113,158 changes: 113,158 additions & 0 deletions R/datawrangling/FDA_NDC_tidy/FDA_NDC_Product.csv

Large diffs are not rendered by default.

Empty file.
202,799 changes: 202,799 additions & 0 deletions R/datawrangling/FDA_NDC_tidy/fda_ndc_product.csv

Large diffs are not rendered by default.

180 changes: 180 additions & 0 deletions R/datawrangling/FDA_NDC_tidy/tidy_fda_ndc.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
#IMPORT LIBRARIES
library(reshape)

#READ-IN CSV AND SET HANDY VARIABLES
dat <- read.csv('FDA_NDC_Product.csv')
names(dat) <- tolower(names(dat))

#uncomment to speed up code runtime by like 100x
#dat <- dat[1:1000,]

untidy.names <- c('nonproprietaryname', 'substancename', 'active_numerator_strength', 'active_ingred_unit')
tidy.names <- names( subset( dat[1,], select = -c(nonproprietaryname, substancename, active_numerator_strength, active_ingred_unit) ) )

#Count num of drugs w/ multiple active ingreds
act.ingreds <- strsplit( unlist(sapply( dat$active_ingred_unit, as.character )), c(', |; |,|;| and ') )
num.act.ingreds <- unlist( lapply( act.ingreds, length ) )
#length(num.act.ingreds[ num.act.ingreds > 1])

#PRELIMS FOR TIDYING

#Initialize a data frame that we will add rows to. If we make the first row of the DF an integer vector it seems to handle rbinding character rows better
#>>Not sure why that is and couldn't find something simpler, the rest of the script is similarly hacky
tidy.dat <- as.data.frame( matrix( 1:(dim(dat)[2]) , ncol=dim(dat)[2] ) )
names(tidy.dat) <- c( tidy.names, untidy.names )

#These are going to hold which rows in the tidy.dat flagged an error as they were processed
np.mis <- c()
ai.mis <- c()

#Passing some data values to tolower() tossed weird, corner-case errors => wrapped tolower() in a generic error catcher which simply passes back original char vector
#>>if anything goes wrong
safe.lower <- function(x){
result <- tryCatch({
sapply( x, tolower)
},
error = function(e){
return( as.character(x) )
})
return( result )
}


#Load up Civ V or something b/c you are going to be waiting for a little while...
print('starting split...')
for (i in 1:dim(dat)[1] ){
#Pull the row to-be-split
row <- dat[i,]

#Subsection the elements that we need to Tidy from the ones we don't
tidy.elems <- row[,tidy.names]
untidy.elems <- row[,untidy.names] #NB: these are the elements we will be Tidying

#This basically greps through the a character vector searching for matches to the char vec c(', |; |,|;| and ')
#>>NB: the | symbol reads as "OR" for grep, so we are matching on ', ' OR '; OR' ',' OR ';' OR ',' which were the only joining symbols I could find
split.elems <- strsplit( sapply( untidy.elems, as.character ), c(', |; |,|;| and ') )

#>>split.elems is a list-of-lists, each list elem is itself a list of the split strings

N <- length( split.elems[[1]] ) #>> N gives us the number of proprietarynames that we need to convert to new rows

#Let's hold on to which rows mismatch in their number of proprietarynames vs. substancenames
if( length(split.elems[[1]]) != length(split.elems[[2]]) ){
print('nonproprietaryname mismatch!')
np.mis <- append(np.mis, dim(tidy.dat)[1] )
}

#Also check to make sure that the number of active_ingreds and number of active_numers match up
#>>Spoiler alert: they all do!
if( length(split.elems[[3]]) != length(split.elems[[4]]) ){
print('active_ingred mismatch!')
ai.mis <- append( ai.mis, dim(tidy.dat)[1] )
}

#Okay now we start assembling and appending the new rows from the split.elems
for (j in 1:N){
new.row <- unlist( c( unlist( sapply( tidy.elems, safe.lower ) ), split.elems[[1]][j], split.elems[[2]][j], split.elems[[3]][j], split.elems[[4]][j] ) )

#If all of the untidy elements were NA then new.row will only have 14 elements, so needs to be rebuilt
if (length(new.row) == 14){
new.row <- c( new.row, NA, NA, NA, NA )
}

#Append the new.row to tidy.dat
names(new.row) <- c( tidy.names, untidy.names )
tidy.dat <- rbind( tidy.dat, sapply( new.row, safe.lower) )
}
}
#Some nonproprietarynames still start with an "and " for some reason, so let's clean them up
to.fix <- sapply( dat$nonproprietaryname, function(x) { (substring(x,1,4)=='and ')*!is.na(x) } )
to.fix <- unlist( sapply( to.fix, function(x) { if( is.na(x) ){ return(FALSE) } } ) )
fixed <- sapply( dat[to.fix,]$nonproprietaryname, function(x) { substring(x,5) } )
dat[ to.fix,]$nonproprietaryname <- fixed

#Save tidy data just in case of a crash or something
write.csv(tidy.dat, 'fda_ndc_product.csv', row.names = FALSE)

#Scrape off that first integer line
tidy.dat <- tidy.dat[2:dim(tidy.dat)[1], ]

#Convert data column to ISO format
to.iso <- function(x) {
date <- as.character(x)

if ( is.na( date ) ){
return(date)
}

else {
iso.date <- paste( substring(date,1,4), substring(date,5,6), substring(date,7), sep = '-')
return(iso.date)
}
}

tidy.dat$startmarketingdate <- sapply( tidy.dat$startmarketingdate, to.iso )
tidy.dat$endmarketingdate <- sapply( tidy.dat$endmarketingdate, to.iso )

#See which rows threw errors in the splitting process
tidy.dat[ np.mis[sample(1:length(np.mis),10)], untidy.names]
length(ai.mis) #length is 0

#Check if any strings weren't split
np.missed <- grep( ' and ', tidy.dat[,untidy.names[1]] )
sn.missed <- grep( ' and ', tidy.dat[,untidy.names[2]] )

act_num.missed <- grep( ' and |, |; | , | ; ', tidy.dat[,untidy.names[3]] )
act_ingred.missed <- grep( ' and |, |; | , | ; ', tidy.dat[,untidy.names[4]] )
#No hanging ', ', ' and ' or '; ' joins in active_numerator_strength" or "active_ingred_unit" => hopefully everything has an unambiguous dosage
#Only unusual joins (ie. ', ' or '; ') in "nonpropname" or "substancename" are things like "sennosides a and b", and have unamibugous dosage so I'm assuming
#it's meant to imply that both substances are present in some canonical ratio

np.sn.missed <- tidy.dat$nonproprietaryname == tidy.dat$substancename
name.mismatch <- tidy.dat[!np.sn.misses, untidy.names]

#Let's see what some random rows of the name mismatches look like...
name.mismatch[sample( 1:dim(name.mismatch)[1], 100 ), untidy.names[1:2]]
#again just a lot of sloppy entry mistmatch stuff, so we're probably good

#Save tidy data just in case of a crash or something
write.csv(tidy.dat, 'fda_ndc_product.csv', row.names = FALSE)

###################################################################
#TRASH
###################################################################
#split.column <- function( col.name, dat ) {
# hold <- dat
# dat.col <- hold[,col.name]
#
# split.col <- strsplit( tolower( as.character( dat.col ) ), c( ', |; | and |,|;' ) )
#
# var.nums <- sapply( split.col, length )
# max.iter <- max( var.nums )
#
# if ( max.iter == 1) {
# return(hold)
# }
#
# for( i in 1:max.iter ){
# new.name <- paste( col.name, as.character( i ), sep='' )
# hold[ ,new.name] <- sapply( split.col, '[', i)
# }
#
# return( hold[ , names(hold) != col.name ] )
#}
#
#
#split.dat <- small.dat
#for (s in untidy.names){
# print(s)
# split.dat <- split.column(s, split.dat)
#}
#
#to.copy <- split.dat[ !is.na(split.dat$nonproprietaryname2), ]
#
#to.copy$nonproprietaryname1 <- to.copy$nonproprietaryname2
#to.copy$substancename1 <- to.copy$substancename2
#to.copy$active_numerator_strength1 <- to.copy$active_numerator_strength2
#to.copy$active_ingred_unit1 <- to.copy$active_ingred_unit2
#
#split.dat <- rbind(split.dat, to.copy)
#split.dat <- split.dat[ , c(1:15,17,19,21)]
150 changes: 150 additions & 0 deletions datadictionaries/#README.md#
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
## Data Central: How to Contribute, Sources, and Data Dictionaries

As our work continues to expand, this will be a central repository to document summaries, sources,
and field names for all our data sets. Data is housed in our [repo on data.world](https://data.world/data4democracy/drug-spending).

### How Do I Contribute Data?

---

We're glad you asked!

If you have a data source that would help with our [objectives](../docs/objectives.md),
we'd be grateful to have it. Here's an overview of how to most effectively contribute. Please join
the discussion on our [Slack channel](https://datafordemocracy.slack.com/messages/drug-spending/) -
our group would love to work with you. (If you're not already in the Data for Democracy Slack team,
you'll need an invitation - more info [here](https://github.com/Data4Democracy/read-this-first).)

1. [Tidy the data](https://en.wikipedia.org/wiki/Tidy_data), using `lower_snake_case` for variable
and file names and ISO format (YYYY-MM-DD) for dates. Also keep in mind these [best practices](https://docs.google.com/document/d/1p5A2DQ5gFC7XVKNVDw_ifKnycv_j1udmqY1M0rjbcxo/edit) from data.world. We prefer CSV format; [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) format is also very useful (feel free to add both).
1. Fork this repo and request to be a [contributor to our dataset on data.world](https://data.world/data4democracy/drug-spending/contributors), if you haven't already.
1. Submit a pull request to this repo including the following:
* In either [`python/datawrangling`](../python/datawrangling) or [`R/datawrangling`](../R/datawrangling), as appropriate, add any script(s) you used to scrape, tidy, etc. (If you have multiple scripts, feel free to create a subdirectory.) Be specific when you name the scripts and directories - eg, `scrape_druglist_from_genomejp.py` is better than `drugscraping.py`.
* In `/datadictionaries`, add a data dictionary for your data source named `[datasource].md`. We have a [data dictionary template](TEMPLATE.md); for more specifics, check out the other dictionaries available in this folder.
* Edit this README with a short overview of your dataset.
1. Once the PR is reviewed by our maintainers and merged, upload your final data set to data.world and label it "clean data" (click on Edit). Add a link to the data dictionary in the Description field. *(If you'd rather not join data.world, a maintainer can do this as well. It's a fun place, though!)*
- If you'd like to add the raw data as well (eg, XLSX files), feel free; make sure to label it "raw data."
- Bonus points: Edit the info for each field in your data.world dataset with a detailed description.
1. Submit a PR to update this overview file (this can be done by you or maintainers).
1. Receive our grateful thanks, likely including emoji.

### Overview of Currently Available Datasets

All datasets are available in our [repo on data.world](https://data.world/data4democracy/drug-spending). If individual datasets can be queried, direct links are included.

---

#### 1. [Medicare Part D Spending Data, 2011-2015](https://data.world/data4democracy/drug-spending/query/?query=--+Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx%2FMethods+%28Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx%29%0ASELECT+%2A+FROM+%60Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx%2FMethods%60)

###### Formats: XLSX (original); CSV, feather (tidied)
###### Original Source: US Centers for Medicare and Medicaid Services ([CMS.gov](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/Downloads/Part_D_All_Drugs_2015.zip))

This is the data that initially inspired our project.

The Excel file contains aggregate data for total and average spending by Medicare and by consumers,
as well as total and average number of claims, for each brand name drug by year. Generic names are
also included.

In our data.world repo, the original file has been tidied and split into one dataset per year,
available in both .csv and .feather format; these are titled, for example, `spending-2011.feather`.
We also have a `feather` file containing solely the unique brand names + generic names included in
all five years of data (`drugnames.feather`).

Links to full data dictionaries:
[2011](part-d_spending_2011.md)
[2012](part-d_spending_2012.md)
[2013](part-d_spending_2013.md)
[2014](part-d_spending_2014.md)
[2015](part-d_spending_2015.md)

---

#### 2. [ATC Codes](https://data.world/data4democracy/drug-spending/query/?query=--+atc-codes.csv%2Fatc-codes+%28atc-codes.csv%29%0ASELECT+%2A+FROM+%60atc-codes.csv%2Fatc-codes%60+LIMIT+5000)

###### Formats: KEG (original); CSV (tidied)
###### Original Source: www.genome.jp

The [Anatomical Therapeutic Chemical Classification System](https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System), maintained by the WHO, is used to classify drugs based on both the organ or system on which they act and their therapeutic, pharmacological and chemical properties. Procuring the codes from WHO is prohibitively expensive; our dataset is scraped from www.genome.jp.

Link to full data dictionary [in progress]

---

#### 3. FDA-Approved Drugs

###### Formats: JSON
###### Original Source: [Center Watch](https://www.centerwatch.com/drug-information/fda-approved-drugs/therapeutic-areas)

This dataset contains a list of FDA-approved drugs, their approval date, manufacturer, and specific
purpose.

---

#### 4. [Drug Uses](https://data.world/data4democracy/drug-spending/query/?query=--+drug_uses.csv%2Fdrug_uses+%28drug_uses.csv%29%0ASELECT+%2A+FROM+%60drug_uses.csv%2Fdrug_uses%60+LIMIT+5000)

###### Formats: CSV, feather
###### Original Source: n/a

This is a first pass at a crosswalk between the ATC codes and Medicare Part D spending data. Work to
finalize this is welcome!

Link to full data dictionary [in progress]

---

#### 5. [Cleaned manufacturer data](https://data.world/data4democracy/drug-spending/query/?query=--+drugdata_clean.csv%2Fdrugdata_clean+%28drugdata_clean.csv%29%0ASELECT+%2A+FROM+%60drugdata_clean.csv%2Fdrugdata_clean%60+LIMIT+5000)

###### Formats: CSV
###### Original Source: CMS.gov

This dataset contains the information you'd need to link specific drugs and their dosages to the manufacturer - helpful for creating a path from Medicaid spending to lobbying efforts. Brand name and generic or descriptive names are both offered, as well as dosage and package size. Further, there are identifying codes for each drug (HCPCS and NDC).

---

#### 6. Medical Expenditure Panel Survey *(too large for direct query link)*

###### Formats: zip, CSV, feather
###### Original Source: meps.ahrq.gov

I'll need Alex to write this one, and/or I'll look at it later.

Link to full data dictionary [in progress]

---

#### 7. [Pharmaceutical Lobbying Transactions](https://data.world/data4democracy/drug-spending/query/?query=--+Pharma_Lobby.csv%2FPharma_Lobby+%28Pharma_Lobby.csv%29%0ASELECT+%2A+FROM+%60Pharma_Lobby.csv%2FPharma_Lobby%60+LIMIT+5000)

###### Formats: CSV
###### Original Source: [OpenSecrets](https://www.opensecrets.org/lobby/indusclient.php?id=h04&year=2016)

OpenSecrets has data on lobbying transactions from pharmaceutical companies and their subsidiaries, totaled by year.

Link to full data dictionary [in progress]

---

#### 8. [USP Drug Classification](https://data.world/data4democracy/drug-spending/query/?query=--+usp_drug_classification.csv%2Fusp_drug_classification+%28usp_drug_classification.csv%29%0ASELECT+%2A+FROM+%60usp_drug_classification.csv%2Fusp_drug_classification%60)

###### Formats: text, CSV
###### Original Source: [KEGG](https://www.genome.jp/kegg-bin/get_htext?htext=br08302.keg) ("USP drug classification" in the drop-down menu)

The US Pharmacopeial Convention Drug Classification system. Contains category and class information on outpatient
drugs available in the US market. TBD if data also contains information on Part D eligible
drugs only, though it seems like it likely doesn't: "The USP DC is intended to be complementary to
the [USP MMG](https://www.usp.org/usp-healthcare-professionals/usp-medicare-model-guidelines) and
is developed with similar guiding principles, taxonomy, and structure of the USP Categories and Classes."

Link to full data dictionary: [usp_drug_classification.md](usp_drug_classification.md)

---

#### 9. FDA NDC Product
###### Formats: CSV
###### Original Source: [FDA](https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm)

This dataset provides a link between drug names or ID numbers and the active ingredients contained in each drug.

This dataset was created as part of The Drug Listing Act of 1972, which requires drug manufacturers/distributors to provide a full list of currently marketed drugs. The information is submitted by the labeler (manufacturer, repackager, or distributor) to the FDA. It seems that inclusion in the NDC directory does not mean that the drug is FDA approved. The dataset includes information about active ingredients in a drug and their dosing, who produces the drug, and the pharmacological mechanism by which it acts.

Link to full data dictionary: [usp_drug_classification.md](usp_drug_classification.md)

Loading