Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #70 work branch #79

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Prev Previous commit
added data.world references to exploration Rmd
  • Loading branch information
[email protected] committed Feb 6, 2018
commit 541428fb49b46d90f7942ddae073b11d4d5b31dd
104 changes: 22 additions & 82 deletions R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,115 +6,55 @@ output: html_document
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reshape2)
library(plyr)
library(data.world)
```

##Addressing Issue #70 Questions

Read in the FDA_NDC data and Medicare spending data...
```{r}
ndc.dat <- read.csv('fda_ndc_product.csv')
spd.dat <- read.csv('spending_2011.csv')
````

What's in there?
```{r}
print('FDA_NDC names:')
names(ndc.dat)
print('')
print('spending_2011 names:')
names(spd.dat)
````


###Preliminaries
Every entry in the original dataset corresponded to a unique **productid**...
```{r}
length( unique(ndc.dat$productid) )
````
ds <- 'https://data.world/data4democracy/drug-spending'
ndc.dat <- data.world::query( data.world::qry_sql( 'SELECT proprietaryname, nonproprietaryname, substancename, productid FROM fda_ndc_product_tidy' ) , dataset=ds )
spd.dat <- data.world::query( data.world::qry_sql( 'SELECT brand_name, generic_name FROM spending_part_d_2011to2015_tidy' ) , dataset=ds )

...and there are a couple fewer **productndcs**. It seems like the **productid** would be the best way through which to cross-reference the datasets.
```{r}
length( unique(ndc.dat$productndc) )
length( unique(ndc.dat$productid) ) - length( unique(ndc.dat$productndc) )
````

For reference there are far fewer **nonproprietarynames** and **substancenames** than **productids**
```{r}
length( unique(ndc.dat$nonproprietaryname) )
length( unique(ndc.dat$substancename) )
````

###How many drugs have multiple active ingredients?
Let's see how many drugs have more than one active ingredient:
Let's see how many drugs total in the `fda_ndc_product_tidy` have more than one active ingredient:
```{r}
pid.agg.ndc <- dcast(ndc.dat, productid ~ .)
names( pid.agg.ndc ) <- c('productid', 'occurences')
print('About 26.5k:')
length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
````

###How many (total) of the drugs can be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
The spending dataset needs to be converted to lowercase for a few columns, in order to match with the FDA_NDC data. To do this I'm going to wrap the tolower() function in an error catcher, as this was used to lowercase the FDA_NDC dataset during tidying and things won't get derailed by weird cornercases.
```{r}
safe.lower <- function(x){
result <- tryCatch({
sapply( x, tolower)
},
error = function(e){
return( as.character(x) )
})
return( result )
}

upper.cols <- c('drugname_brand', 'drugname_generic')
spd.dat[, upper.cols] <- sapply( spd.dat[, upper.cols], safe.lower )

head(spd.dat)
````

How many matches are there between speeding_2011 and FDA_NDC?
```{r}
sum( spd.dat$drugname_brand %in% ndc.dat$proprietaryname)
pid.agg.ndc <- dcast(ndc.dat, proprietaryname ~ .)
names( pid.agg.ndc ) <- c('propname', 'occurences')
print('About 26.5k:')
length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
````


###How many of the drugs found in the Medicare spending datasets have multiple active ingredients?
Let's match the **proprietaryname** column to the **drugname_brand** column. If there are any matches I'm going to pull the corresponding **productids** and hold them in a vector.
How about drugs that match the `spending_201x` dataset?
```{r}
match.ids <- ndc.dat[ ndc.dat$proprietaryname %in% spd.dat$drugname_brand, c('productid') ]
match.ids <- as.character( match.ids )
length(match.ids)
````
pid.agg

How many unique match ids are there?
```{r}
length(unique(match.ids))
````

Since there are about 32.5k total **match.ids** and 29.5k unique **match.ids** then there are around 3k spending_2011 **drugname_brands** that have multiple active ingredients.
###How many total of the drugs can be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
Joining the spending and FDA_NDC datasets `join()` matches two columns with the same name in different dataframes, so we need to first renamed `drugname_brand` to `proprietaryname` in `spd.dat`.
```{r}
length(match.ids) - length(unique(match.ids))
names(spd.dat)[2] <- 'proprietaryname'
ndc.spd.join <- join(spd.dat, ndc.dat, type="inner", by=c('proprietaryname'))
dim(ndc.spd.join)
````

How well does **drugname_generic** match against the **nonproprietaryname** column? There 1741 of the generic drugnames in the **nonproprietaryname** column, and almost as many in the **substancename** column. A pretty good amount of those are associated with a **drugname_brand** that also matches in the **proprietaryname** column.
What's the relationship between `drugname_generic` and `nonproprietaryname` and also `substancename`?
```{r}
#Number of drugname_generics that match in nonproprietaryname
sum(spd.dat$drugname_generic %in% ndc.dat$nonproprietaryname)
sum( ndc.spd.join$drugname_generic == ndc.spd.join$nonproprietaryname )
sum( ndc.spd.join$drugname_generic[ !is.na(ndc.spd.join$substancename) ] == ndc.spd.join$substancename[ !is.na(ndc.spd.join$substancename) ] )
`````

#Number of drugname_generics that match in substancename
sum(spd.dat$drugname_generic %in% ndc.dat$substancename)

#Number of drugname_brands that match in proprietaryname that also share a row with a drugname_generic that matches a nonproprietaryname
sum(spd.dat[ spd.dat$drugname_brand %in% ndc.dat$proprietaryname, c('drugname_generic') ] %in% ndc.dat$nonproprietaryname )

#Number of drugname_brands that match in proprietaryname that also share a row with a drugname_generic that matches a substancename
sum(spd.dat[ spd.dat$drugname_brand %in% ndc.dat$proprietaryname, c('drugname_generic') ] %in% ndc.dat$substancename )
````

I'm going to create a subvector of **match.ids** which contains only ids that occur more than once, to see if the spending data **drugname_generic** also contains multiple names.
```{r}
id.occs <- as.data.frame( table( match.ids ) )
mult.act.ingd <- id.occs[ id.occs$Freq > 1, ]$match.ids
````

(IN PROGRESS)
It looks like there's a good number of matches between the two.
7 changes: 3 additions & 4 deletions R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.html
Original file line number Diff line number Diff line change
Expand Up @@ -231,12 +231,13 @@ <h3>How many (total) of the drugs can be matched between the Medicare spending d
</div>
<div id="how-many-of-the-drugs-found-in-the-medicare-spending-datasets-have-multiple-active-ingredients" class="section level3">
<h3>How many of the drugs found in the Medicare spending datasets have multiple active ingredients?</h3>
<p>Let’s join the spending and FDA_NDC datasets</p>
<p>Let’s match the <strong>proprietaryname</strong> column to the <strong>drugname_brand</strong> column. If there are any matches I’m going to pull the corresponding <strong>productids</strong> and hold them in a vector.</p>
<pre class="r"><code>match.ids &lt;- ndc.dat[ ndc.dat$proprietaryname %in% spd.dat$drugname_brand, c('productid') ]
match.ids &lt;- as.character( match.ids )
length(match.ids)</code></pre>
<pre><code>## [1] 32567</code></pre>
<p>How many unique match ids are there?</p>
<p>How many unique match ids are there? Testing the <code>proprietaryname</code></p>
<pre class="r"><code>length(unique(match.ids))</code></pre>
<pre><code>## [1] 29585</code></pre>
<p>Since there are about 32.5k total <strong>match.ids</strong> and 29.5k unique <strong>match.ids</strong> then there are around 3k spending_2011 <strong>drugname_brands</strong> that have multiple active ingredients.</p>
Expand All @@ -255,9 +256,7 @@ <h3>How many of the drugs found in the Medicare spending datasets have multiple
<pre class="r"><code>#Number of drugname_brands that match in proprietaryname that also share a row with a drugname_generic that matches a substancename
sum(spd.dat[ spd.dat$drugname_brand %in% ndc.dat$proprietaryname, c('drugname_generic') ] %in% ndc.dat$substancename )</code></pre>
<pre><code>## [1] 1292</code></pre>
<p>I’m going to create a subvector of <strong>match.ids</strong> which contains only ids that occur more than once, to see if the spending data <strong>drugname_generic</strong> also contains multiple names.</p>
<pre class="r"><code>id.occs &lt;- as.data.frame( table( match.ids ) )
mult.act.ingd &lt;- id.occs[ id.occs$Freq &gt; 1, ]$match.ids</code></pre>
<pre class="r"><code>#ndc.spd.join &lt;- join(spd.dat, ndc.dat, type=&quot;inner&quot;, by=c('proprietaryname'))</code></pre>
<p>(IN PROGRESS)</p>
</div>
</div>
Expand Down