The bigvis package provides tools for exploratory data analysis of large datasets (10-100 million obs). The aim is to have most operations take less than 5 seconds on commodity hardware, even for 100,000,000 data points.
Since bigvis is not currently available on CRAN, the easiest way to try it out is to:
# install.packages("devtools")
devtools::install_github("hadley/bigvis")
The bigvis package is structured around the following workflow:
-
bin()
andcondense()
to get a compact summary of the data -
if the estimates are rough, you might want to
smooth()
. Seebest_h()
andrmse_cvs()
to figure out a good starting bandwidth -
if you're working with counts, you might want to
standardise()
-
visualise the results with
autoplot()
(you'll need to loadggplot2
to use this)
Bigvis also provides a number of standard statistics efficiently implemented on weighted/binned data: weighted.median
, weighted.IQR
, weighted.var
, weighted.sd
, weighted.ecdf
and weighted.quantile
.
This package wouldn't be possible without:
-
the fantastic Rcpp package, which makes it amazingly easy to integrate R and C++
-
JJ Allaire and Carlos Scheidegger who have indefatigably answered my many C++ questions
-
the generous support of Revolution Analytics who supported the early development.
-
Yue Hu, who implemented a proof of concepts that showed that it might be possible to work with this much data in R.