added data.world references to exploration Rmd

Data4Democracy · proof-by-accident · Jan 16, 2018 · Jan 16, 2018 · Jan 18, 2018 · Jan 18, 2018
commit 541428fb49b46d90f7942ddae073b11d4d5b31dd
diff --git a/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.Rmd b/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.Rmd
@@ -6,115 +6,55 @@ output: html_document
 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 library(reshape2)
+library(plyr)
+library(data.world)
 ```
 
 ##Addressing Issue #70 Questions
 
 Read in the FDA_NDC data and Medicare spending data...
 ```{r}
-ndc.dat <- read.csv('fda_ndc_product.csv')
-spd.dat <- read.csv('spending_2011.csv')
-````
-
-What's in there?
-```{r}
-print('FDA_NDC names:')
-names(ndc.dat)
-print('')
-print('spending_2011 names:')
-names(spd.dat)
-````
-
-
-###Preliminaries
-Every entry in the original dataset corresponded to a unique **productid**...
-```{r}
-length( unique(ndc.dat$productid) )
-````
+ds <- 'https://data.world/data4democracy/drug-spending'
+ndc.dat <- data.world::query( data.world::qry_sql( 'SELECT proprietaryname, nonproprietaryname, substancename, productid FROM fda_ndc_product_tidy' ) , dataset=ds )
+spd.dat <- data.world::query( data.world::qry_sql( 'SELECT brand_name, generic_name FROM spending_part_d_2011to2015_tidy' ) , dataset=ds )
 
-...and there are a couple fewer **productndcs**.  It seems like the **productid** would be the best way through which to cross-reference the datasets.
-```{r}
-length( unique(ndc.dat$productndc) )
-length( unique(ndc.dat$productid) ) - length( unique(ndc.dat$productndc) )
 ````
 
-For reference there are far fewer **nonproprietarynames** and **substancenames** than **productids**
-```{r}
-length( unique(ndc.dat$nonproprietaryname) )
-length( unique(ndc.dat$substancename) )
-````
 
 ###How many drugs have multiple active ingredients?
-Let's see how many drugs have more than one active ingredient:
+Let's see how many drugs total in the `fda_ndc_product_tidy` have more than one active ingredient:
 ```{r}
 pid.agg.ndc <- dcast(ndc.dat, productid ~ .)
 names( pid.agg.ndc ) <- c('productid', 'occurences')
 print('About 26.5k:')
 length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
 ````
 
-###How many (total) of the drugs can be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
-The spending dataset needs to be converted to lowercase for a few columns, in order to match with the FDA_NDC data.  To do this I'm going to wrap the tolower() function in an error catcher, as this was used to lowercase the FDA_NDC dataset during tidying and things won't get derailed by weird cornercases.
 ```{r}
-safe.lower <- function(x){
-    result <- tryCatch({
-        sapply( x, tolower)
-    },
-    error = function(e){
-        return( as.character(x) )
-    })
-    return( result )
-    }
-
-upper.cols <- c('drugname_brand', 'drugname_generic')
-spd.dat[, upper.cols] <- sapply( spd.dat[, upper.cols], safe.lower )
-
-head(spd.dat)
-````
-
-How many matches are there between speeding_2011 and FDA_NDC?
-```{r}
-sum( spd.dat$drugname_brand %in% ndc.dat$proprietaryname)
+pid.agg.ndc <- dcast(ndc.dat, proprietaryname ~ .)
+names( pid.agg.ndc ) <- c('propname', 'occurences')
+print('About 26.5k:')
+length( pid.agg.ndc[ pid.agg.ndc$occurences > 1, 1] )
 ````
 
-
-###How many of the drugs found in the Medicare spending datasets have multiple active ingredients?
-Let's match the **proprietaryname** column to the **drugname_brand** column.  If there are any matches I'm going to pull the corresponding **productids** and hold them in a vector.
+How about drugs that match the `spending_201x` dataset?
 ```{r}
-match.ids <- ndc.dat[ ndc.dat$proprietaryname %in% spd.dat$drugname_brand, c('productid') ]
-match.ids <- as.character( match.ids )
-length(match.ids)
-````
+pid.agg
 
-How many unique match ids are there?
-```{r}
-length(unique(match.ids))
 ````
 
-Since there are about 32.5k total **match.ids** and 29.5k unique **match.ids** then there are around 3k spending_2011 **drugname_brands** that have multiple active ingredients.
+###How many total of the drugs can be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
+Joining the spending and FDA_NDC datasets `join()` matches two columns with the same name in different dataframes, so we need to first renamed `drugname_brand` to `proprietaryname` in `spd.dat`.
 ```{r}
-length(match.ids) - length(unique(match.ids))
+names(spd.dat)[2] <- 'proprietaryname'
+ndc.spd.join <- join(spd.dat, ndc.dat, type="inner", by=c('proprietaryname'))
+dim(ndc.spd.join)
 ````
 
-How well does **drugname_generic** match against the **nonproprietaryname** column?  There 1741 of the generic drugnames in the **nonproprietaryname** column, and almost as many in the **substancename** column.  A pretty good amount of those are associated with a **drugname_brand** that also matches in the **proprietaryname** column.
+What's the relationship between `drugname_generic` and `nonproprietaryname` and also `substancename`?
 ```{r}
-#Number of drugname_generics that match in nonproprietaryname
-sum(spd.dat$drugname_generic %in% ndc.dat$nonproprietaryname)
+sum( ndc.spd.join$drugname_generic == ndc.spd.join$nonproprietaryname )
+sum( ndc.spd.join$drugname_generic[ !is.na(ndc.spd.join$substancename) ] == ndc.spd.join$substancename[ !is.na(ndc.spd.join$substancename) ] )
+`````
 
-#Number of drugname_generics that match in substancename
-sum(spd.dat$drugname_generic %in% ndc.dat$substancename)
-
-#Number of drugname_brands that match in proprietaryname that also share a row with a drugname_generic that matches a nonproprietaryname
-sum(spd.dat[ spd.dat$drugname_brand %in% ndc.dat$proprietaryname, c('drugname_generic') ] %in% ndc.dat$nonproprietaryname )
-
-#Number of drugname_brands that match in proprietaryname that also share a row with a drugname_generic that matches a substancename
-sum(spd.dat[ spd.dat$drugname_brand %in% ndc.dat$proprietaryname, c('drugname_generic') ] %in% ndc.dat$substancename )
-````
-
-I'm going to create a subvector of **match.ids** which contains only ids that occur more than once, to see if the spending data **drugname_generic** also contains multiple names.
-```{r}
-id.occs <- as.data.frame( table( match.ids ) )
-mult.act.ingd <- id.occs[ id.occs$Freq > 1, ]$match.ids
-````
-
-(IN PROGRESS)
+It looks like there's a good number of matches between the two.
diff --git a/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.html b/R/analysis-vis/explore_fda_nda_product/anlyz_fda_ndc.html
@@ -231,12 +231,13 @@ <h3>How many (total) of the drugs can be matched between the Medicare spending d
 </div>
 <div id="how-many-of-the-drugs-found-in-the-medicare-spending-datasets-have-multiple-active-ingredients" class="section level3">
 <h3>How many of the drugs found in the Medicare spending datasets have multiple active ingredients?</h3>
+<p>Let’s join the spending and FDA_NDC datasets</p>
 <p>Let’s match the <strong>proprietaryname</strong> column to the <strong>drugname_brand</strong> column. If there are any matches I’m going to pull the corresponding <strong>productids</strong> and hold them in a vector.</p>
 <pre class="r"><code>match.ids &lt;- ndc.dat[ ndc.dat$proprietaryname %in% spd.dat$drugname_brand, c('productid') ]
 match.ids &lt;- as.character( match.ids )
 length(match.ids)</code></pre>
 <pre><code>## [1] 32567</code></pre>
-<p>How many unique match ids are there?</p>
+<p>How many unique match ids are there? Testing the <code>proprietaryname</code></p>
 <pre class="r"><code>length(unique(match.ids))</code></pre>
 <pre><code>## [1] 29585</code></pre>
 <p>Since there are about 32.5k total <strong>match.ids</strong> and 29.5k unique <strong>match.ids</strong> then there are around 3k spending_2011 <strong>drugname_brands</strong> that have multiple active ingredients.</p>
@@ -255,9 +256,7 @@ <h3>How many of the drugs found in the Medicare spending datasets have multiple
 <pre class="r"><code>#Number of drugname_brands that match in proprietaryname that also share a row with a drugname_generic that matches a substancename
 sum(spd.dat[ spd.dat$drugname_brand %in% ndc.dat$proprietaryname, c('drugname_generic') ] %in% ndc.dat$substancename )</code></pre>
 <pre><code>## [1] 1292</code></pre>
-<p>I’m going to create a subvector of <strong>match.ids</strong> which contains only ids that occur more than once, to see if the spending data <strong>drugname_generic</strong> also contains multiple names.</p>
-<pre class="r"><code>id.occs &lt;- as.data.frame( table( match.ids ) )
-mult.act.ingd &lt;- id.occs[ id.occs$Freq &gt; 1, ]$match.ids</code></pre>
+<pre class="r"><code>#ndc.spd.join &lt;- join(spd.dat, ndc.dat, type=&quot;inner&quot;, by=c('proprietaryname'))</code></pre>
 <p>(IN PROGRESS)</p>
 </div>
 </div>