This package provides a collection of custom ggplot2
extensions - Geoms
, Stats
, Coords
, Themes
, etc. The name grcdr
is a contraction of Graphs with Code and Data in R and is a reference to the Graphs with Code and Data blog, to which this package is loosely related.
You can install grcdr
from github with:
# install.packages("devtools")
devtools::install_github("bdilday/grcdr")
The below will use dplyr
and ggplot2
library(grcdr)
## load basics
library(dplyr)
library(ggplot2)
## set the theme
ggplot2::theme_set(theme_minimal(base_size = 14))
geom_excursion
plots running quantities as connected scatter plots. It requires an x
and y
aesthetic and also a t
aesthetic to give the ordering (i.e. "time")
Here's some example data provided with the package that gives team-level stats for 4 MLB teams (Cleveland 1999, New York Yankees 1998, Houston Astros 2017, Milwaukee Brewers 1982).
# load some example data from the package
teams_df = read.csv(system.file("extdata/team_stats.csv", package = "grcdr"),
stringsAsFactors = FALSE)
teams_df$game_date = as.Date(teams_df$game_date)
head(teams_df, 2)
#> k game_key game_source game_date game_number site_key
#> 1 CLE_1999 ANA199904060 evt 1999-04-06 0 ANA01
#> 2 CLE_1999 ANA199904070 evt 1999-04-07 0 ANA01
#> season_phase team_alignment team_key opponent_key r_g r_w r_l r_t b_g
#> 1 R 0 CLE ANA 1 0 1 0 1
#> 2 R 0 CLE ANA 1 1 0 0 1
#> b_pa b_ab b_r b_h b_tb b_2b b_3b b_hr b_hr4 b_rbi b_gw b_bb b_ibb b_so
#> 1 41 36 5 10 16 1 1 1 0 4 NA 4 0 6
#> 2 47 40 9 13 17 4 0 0 0 9 NA 6 0 7
#> b_gdp b_hp b_sh b_sf b_sb b_cs b_xi b_lob p_g p_gs p_cg p_sho p_csho
#> 1 0 0 1 0 0 0 0 9 1 1 0 0 0
#> 2 1 1 0 0 2 0 0 11 1 1 0 0 0
#> p_gf p_w p_l p_sv p_out p_tbf p_ab p_r season season_game_number
#> 1 1 0 1 0 24 35 32 6 1999 1
#> 2 1 1 0 0 27 34 31 1 1999 2
We can plot runs scored (b_r
) on the x-axis and runs allowed (p_r
) on the y-axis. The time coordinate is the season game number.
base_plot = teams_df %>%
ggplot() + labs(x="runs scored", y="runs allowed")
By default the data aren't averaged (the run length is 1)
p = base_plot +
geom_excursion(aes(x=b_r, y=p_r, t=season_game_number)) +
facet_wrap(~k)
print(p)
The run_length
parameter controls how many items get summed. The following uses run_length = 10
p = base_plot +
geom_excursion(aes(x=b_r, y=p_r, t=season_game_number), run_length = 10) +
facet_wrap(~k)
print(p)
If you give x_weight
or y_weight
aesthetics then weighted averages are computed instead of sums. Passing _weight = 1
therefore results in straight averages.
p = base_plot +
geom_excursion(aes(x=b_r, y=p_r, t=season_game_number, x_weight=1, y_weight=1), run_length = 10) +
facet_wrap(~k)
print(p)
Additionally, the time ordering can be changed
set.seed(101)
random_idx = sample(1:nrow(teams_df), nrow(teams_df))
# order by random
p = teams_df %>% cbind.data.frame(random_idx=random_idx) %>%
ggplot() + labs(x="runs scored", y="runs allowed") +
geom_excursion(aes(x=b_r, y=p_r, t=random_idx,
x_weight=1, y_weight=1),
run_length = 10) +
facet_wrap(~k)
print(p)
stat_run
is a lower-level utility than geom_excursion
. It can change the geom
(from path
) and also can plot running line charts in addition to connected scatterplots.
Here I set the y aesthetic but not x, which is effectively a line chart.
p = teams_df %>% ggplot() +
stat_run(aes(y=b_r, t=season_game_number)) +
facet_wrap(~k) + labs(x="game number", y="runs scored")
print(p)
However, unlike a traditional line chart, I can average or sum the y variable
p = teams_df %>% ggplot() +
stat_run(aes(y=b_r, t=season_game_number), run_length = 10) +
facet_wrap(~k) + labs(x="game number", y="runs scored")
print(p)
The default is to generate running totals by using the cumulative sum function and taking differences. A different cumulative aggregation function can be specified, however. Note that following is for illustration and that the run_fun doesn't correspond to any particularly useful quantity(that I'm aware of, anyway).
fun_with_cumsum_fun = function(x) {
cumsum(x * (x - 1) * sin(x / 5 * pi))
}
p = teams_df %>% ggplot() +
stat_run(aes(y=p_r, t=season_game_number),
run_length = 10,
y_run_fun = fun_with_cumsum_fun) +
facet_wrap(~k) + labs(x="game number", y="runs scored")
print(p)
Setting the x
aesthetic reproduces a geom_excursion
p = teams_df %>% ggplot() +
stat_run(aes(x=b_r, y=p_r, t=season_game_number), run_length = 10) +
facet_wrap(~k) + labs(x="runs scored", y="runs allowed")
print(p)
If the time coordinate is missing values, we can fill in the corresponding x and y.
censored_df = teams_df %>% filter(season_game_number < 40 | season_game_number >60)
p = censored_df %>%
ggplot() +
stat_run(aes(y=b_r, t=season_game_number, x=season_game_number)) + facet_wrap(~k)
print(p)
p = censored_df %>%
ggplot() +
stat_run(aes(y=b_r, t=season_game_number, x=season_game_number),
run_length = 10) +
facet_wrap(~k)
print(p)
p = censored_df %>%
ggplot() +
stat_run(aes(y=b_r, t=season_game_number, x=season_game_number),
run_length = 10, y_run_fill_value = 20, x_run_fill_value = 50) +
facet_wrap(~k)
print(p)
The geom_excursion
layer forces a path
Geom
, but the lower-level stat_run
layer can change the Geom
. For example it can use polygon
(although unclear what the interpretation is)
p = teams_df %>%
ggplot() +
stat_run(aes(x=b_r, y=p_r, t=game_date), geom='polygon') +
facet_wrap(~k)
print(p)
This stat applies dimensionality reduction using multi-dimensional scaling. As of this writing the available algorithms are principal components analysis (pca
) or t-distributed stochastic neighbor embedding (tsne
). The variables to use in the dimensionality reduction are passed in the aesthetics x#
where #
is an arbitrary integer. The default geom
is GeomPoint
.
set.seed(101)
df1 = data.frame(x1 = rnorm(100))
for (i in 2:10) {
k = sprintf("x%d", i)
df1[,k] = rnorm(100)
}
# now, for the last 25 add a constant to create two well separated groups
df1[75:100, ] = df1[75:100,] + 2
Use only 2 variables
set.seed(101)
p = df1 %>% ggplot(aes(x1=x1, x2=x2)) +
stat_mds(mds_method = "pca")
print(p)
Use them all
set.seed(101)
p = df1 %>% ggplot(aes(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5,
x6=x6, x7=x7, x8=x8, x9=x9, x10=x10)) +
stat_mds(mds_method = "pca")
print(p)
Use them all and label them
set.seed(101)
p = df1 %>% mutate(rn=row_number()) %>%
ggplot(aes(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5,
x6=x6, x7=x7, x8=x8, x9=x9, x10=x10)) +
stat_mds(mds_method = "pca", geom="text", aes(label=rn))
print(p)
Apply t-SNE. This requires the Rtsne
package.
set.seed(101)
p = df1 %>% mutate(rn=row_number()) %>%
ggplot(aes(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5,
x6=x6, x7=x7, x8=x8, x9=x9, x10=x10)) +
stat_mds(mds_method = "tsne", geom="text", aes(label=rn))
print(p)
#> Loading required package: Rtsne
This geom implements a tail scatter plot. It is inspired by the xenographics project. The x
and y
aesthetics are points in a two-d plane. Subsequent variables are passed in aesthetics named x#
where x is an arbitrary integer. They do not need to start at 1
, however, the order will be interpreted lexigraphically. The x#
variables are mapped to lines extending at an angle of -(15 + 30 * i)
degrees. This means that variables trying to use 12 or more variables in addition to x
and y
is not supported at this time and will result in lines that overlap.
Some simulated data
set.seed(101)
df1 = data.frame(x1 = rnorm(100), x2 = rnorm(100))
df1$x3 = with(df1, x1**2 + abs(x2))
df1$x4 = 100 * df1$x1 ** 2
# make a categorical var
df1$g = factor(sample(c(0,1), 100, replace = TRUE))
Plot with geom_tailscatter
p = df1 %>%
ggplot(aes(x=x1, y=x2, x3=x3, x4=x4)) +
geom_tailscatter(size=2)
print(p)
The parameter tail_scale
controls the length of the tail lines
p = df1 %>%
ggplot(aes(x=x1, y=x2, x3=x3, x4=x4)) +
geom_tailscatter(size=2, tail_scale = 0.5)
print(p)
Color by group
p = df1 %>%
ggplot(aes(x=x1, y=x2, x3=x3, x4=x4, color=g)) +
geom_tailscatter(size=2)
print(p)
tsne_linked
is an htmlwidget
. It takes a data set, projects it into 2-dimensions using the t-SNE
algorithm, and then plots a 2-d scatter plot. The points in the scatter plot are linked to a bar graph that shows the values of the coordinates that went into the t-SNE
calculation. The scatter plot uses a Voronoi tessellation to make the mouse-over highlighting smoother.
simulated data
set.seed(101)
df1 = data.frame(x1 = rnorm(100), x2 = rnorm(100))
df1$x3 = rnorm(100)
df1$x4 = rnorm(100)
df1$id = row.names(df1)
df1$g = ifelse(df1$x1 > 0, 1, 0)
tsne_coords = c("x1", "x2", "x3", "x4")
tsne_linked(df1,
tsne_coords = tsne_coords,
label_var = "id",
group_var = "g")