-
Notifications
You must be signed in to change notification settings - Fork 66
/
17-poisson-negbin-glmm.Rmd
364 lines (270 loc) · 14.1 KB
/
17-poisson-negbin-glmm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
# Count-data GLMMs
# Goals
- Learn to fit Poisson and negative binomial GLMMs with various packages in R
- Learn how to criticize GLMM fit
- Learn how to check for overdispersion in GLMMs
- Learn how to interpret and plot output from count GLMMs
# Data
We are going to work with one of the data sets from:
Artelle, K. A., S. C. Anderson, J. D. Reynolds, A. B. Cooper, P. C. Paquet, and C. T. Darimont. 2016. Ecology of conflict: marine food supply affects human-wildlife interactions on land. Scientific Reports 6:25936.
The data set represents the number of grizzly bears that were killed because of conflict with humans (`late_kills`) across various Grizzly Bear Population Units (`gbpu`). There are a number of possible predictors. First, the annual predictors: an index of salmon biomass, the number of conflict kills in the previous 3 years, the numeric year, and the mean annual temperature. Then a number of group-level predictors: mean temperature, mean annual precipitation, human population, and the number of grizzly bears in the unit. There is an offset term for usable area for grizzly bears in the population unit so that the number of kills is scaled to be per available area. These predictors have been transformed (e.g. log), centered, and scaled by 2 times their standard deviations to make the magnitude of their effects approximately comparable. This is a slightly simplified version of the data set and model used in the paper.
Let's read in the data and look at it:
```{r}
library(tidyverse)
d <- readRDS("data/raw/conflict-salmon-for-modelling.rds") %>%
select(late_kills,
gbpu,
salmon_biomass_geo_mean,
prev_3yr_conflict_scaled,
year_centered,
log_humanpop_scaled,
log_grizzly_pop_est_scaled,
gbpu_usable_area) %>%
mutate(
year_scaled = year_centered / (2 * sd(year_centered)),
gbpu_usable_area = gbpu_usable_area / 1e9) %>%
rename(salmon_scaled = salmon_biomass_geo_mean)
head(d)
```
We are going to use the below formula a number of times. So let's store it in a variable we will call `f`.
```{r}
f <- late_kills ~
salmon_scaled +
prev_3yr_conflict_scaled +
year_scaled +
log_humanpop_scaled +
log_grizzly_pop_est_scaled +
offset(log(gbpu_usable_area))
```
Let's look at the data in a few ways.
```{r}
ggplot(d, aes(year_centered, late_kills, color = salmon_scaled)) +
geom_point() +
facet_wrap(~gbpu)
```
```{r}
ggplot(d, aes(salmon_scaled, log(late_kills+0.5), color = gbpu)) +
geom_point(position = position_jitter(height = 0.4, width = 0))
```
```{r}
ggplot(d, aes(salmon_scaled, log((late_kills+0.5)/gbpu_usable_area),
color = gbpu)) +
geom_point(position = position_jitter(height = 0.4, width = 0)) +
geom_smooth(method = "lm", se = FALSE)
```
```{r}
ggplot(d, aes(salmon_scaled, log(late_kills+0.5), color = gbpu)) +
geom_point(position = position_jitter(height = 0.4, width = 0), alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, colour = "black")
```
```{r}
ggplot(d, aes(salmon_scaled, log(late_kills+0.5),
color = gbpu, group = gbpu)) +
geom_point(position = position_jitter(height = 0.4, width = 0),
alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, alpha = 0.7, colour = "black")
```
```{r}
ggplot(d, aes(salmon_scaled, late_kills, color = gbpu, group = gbpu)) +
geom_point(position = position_jitter(height = 0.4, width = 0),
alpha = 0.3) +
geom_smooth(method = "glm", method.args = c(family = poisson), se = FALSE, colour = "black")
```
```{r}
ggplot(d, aes(salmon_scaled, late_kills, color = gbpu)) +
geom_point(position = position_jitter(height = 0.4, width = 0),
alpha = 0.3) +
geom_smooth(method = "glm", method.args = c(family = poisson), se = FALSE,
colour = "black")
```
Are there other useful ways you can think of looking at the data?
# Considering various response distributions
We can get a feeling for what distribution might work best by plotting the mean and variance for each group level:
```{r}
dg <- group_by(d, gbpu) %>%
summarise(m = mean(late_kills), v = var(late_kills))
ggplot(dg, aes(m, v)) +
geom_smooth(method = "lm", formula = y ~ x - 1, se = F, color = "blue") +
geom_smooth(method = "lm",
formula = y ~ I(x^2) + offset(x) - 1, colour = "red", se = F) +
geom_abline(intercept = 0, slope = 1, lty = 2) +
geom_point()
```
In the above plot, the dashed line represents the Poisson (one-to-one), the blue line represents the NB1 negative binomial parameterization, and the red line represents the NB2 negative binomial parameterization (the more usual one we have been working with previously).
The NB2 looks like it fits the smaller values the best, although the NB1 better represents some of the larger values. We can tell that the Poisson doesn't represent the data very well. The variance is growing faster as the mean increases than it allows. Nonetheless, we will start off with the simpler Poisson distribution as a learning exercise and build from there.
# Modeling
Let's start by using the lme4 package as we have before. Instead of using the `lmer` function we will use the `glmer` function. (Although `lmer` will work too.)
We have count data. So one obvious place to start is with the Poisson distribution and a log link.
```{r}
library(lme4)
f_lme4 <- paste(c(f, "+ (1 | gbpu)"), collapse = " ") %>% as.formula()
f_lme4
m_glmer <- glmer(f_lme4, data = d, family = poisson(link = "log"))
summary(m_glmer, correlation = FALSE)
```
We can look at these default plots, but the default residual plot is not that helpful here.
```{r}
plot(m_glmer)
lattice::dotplot(ranef(m_glmer, condVar = TRUE))
```
We can also extract fixed effects, random effects, and their standard errors with:
```{r}
fixef(m_glmer)
arm::se.fixef(m_glmer)
ranef(m_glmer)
arm::se.ranef(m_glmer)
```
There are 3 kinds of confidence intervals we can get out of lme4, but only one that will run quickly right now.
```{r}
confint(m_glmer, method = "Wald")
# confint(m_glmer, method = "profile")
# confint(m_glmer, method = "boot")
```
One useful summary function is
```{r}
coef(summary(m_glmer))
```
# Checking for overdispersion
Now we should check for overdispersion. We can do that by comparing the residual deviance and the sum of the pearson residuals squared with the residual degrees of freedom. These should be approximately the same. If the degrees of freedom are considerably smaller then we have evidence of over dispersion. If the residual deviance and sum of Pearson residuals squared disagree, then use the Pearson residuals. See Ben Bolker's worked examples here for more details <https://rpubs.com/bbolker/glmmchapter>.
We can see the degrees of freedom from the output of `summary()`. And we can calculate the sum of the Pearson residuals squared:
```{r}
sum(residuals(m_glmer, "pearson")^2)
aods3::gof(m_glmer) # alternative shortcut
```
So it looks like we have some overdispersion here.
We can try dealing with this by adding an observation-level random intercept or by switching to a negative binomial distribution.
Theoretically, we could fit a negative binomial GLMM with the function `lme4::glmer.nb`, but the following fails to converge for me.
```{r, eval=FALSE}
m_glmer_nb <- glmer.nb(f_lme4, data = d)
```
We could try removing the offset term (or simplified by removing any other terms). Removing the offset term did not help with convergence for me.
So instead, let's start by adding an observation-level random intercept. This simply requires making a dummy variable with a level for each row of data:
```{r}
d$obs <- seq_len(nrow(d))
```
We then fit the model:
```{r}
f_lme4_obs <- paste(c(f, "+ (1 | gbpu) + (1 | obs)"), collapse = " ") %>%
as.formula()
m_lme4_obs <- glmer(f_lme4_obs, data = d, family = poisson(link = "log"))
summary(m_lme4_obs)
```
# MASS package penalized quasi-likelihood version
We can try the function `glmmPQL` in the package MASS, which lets us fit the model with something called penalized quasi-likelihood. "Penalized quasi-likelihood" doubles as a great conservation starter at cocktail parties.
```{r}
m_pql <- MASS::glmmPQL(f, random = ~ 1 | gbpu/obs, data = d,
family = poisson(link = "log"))
summary(m_pql)
```
This one converges. In the background the function is using nlme. Therefore, we have to work with the object as if it was from nlme. For example:
```{r}
nlme::intervals(m_pql, which = "fixed")
```
# glmmTMB with negative binomial distributions
It's always a good idea to fit GLMMs with multiple methods if possible to ensure the conclusions are not sensitive to the optimization method or that your model has not converged. These are not nearly as simple as GLMs or linear-mixed effect models!
One of the greatest things since sliced bread is the TMB package for R. TMB normally requires you to write your own model template, but the package glmmTMB comes with a number of pre-written and tested GLMMs.
TMB, and therefore glmmTMB, is *fast* and very robust. <https://github.com/glmmTMB>
```{r}
library(glmmTMB)
m_tmb_pois <- glmmTMB(f_lme4, data = d, family = poisson(link = "log"))
summary(m_tmb_pois)
m_tmb_obs <- glmmTMB(f_lme4_obs, data = d, family = poisson(link = "log"))
summary(m_tmb_obs)
m_tmb_nb <- glmmTMB(f_lme4, data = d, family = nbinom2(link = "log"))
summary(m_tmb_nb)
m_tmb_nb1 <- glmmTMB(f_lme4, data = d, family = nbinom1(link = "log"))
summary(m_tmb_nb1)
bbmle::AICtab(m_tmb_pois, m_tmb_nb, m_tmb_nb1)
```
So the NB2 parameterization has the lowest AIC.
The residuals are hard to interpret still:
```{r}
plot(log(fitted(m_tmb_nb)), residuals(m_tmb_nb, type = "response"))
plot(log(fitted(m_tmb_nb)), residuals(m_tmb_nb, type = "pearson"))
plot(log(fitted(m_tmb_nb)), sqrt(abs(residuals(m_tmb_nb, type = "pearson"))))
```
Let's plot the residuals by group:
```{r}
d$res_nb <- residuals(m_tmb_nb, type = "response")
ggplot(d, aes(gbpu, res_nb)) + geom_boxplot() + coord_flip()
```
# Brief introduction to a Bayesian Stan version
If we want to accurately explore the full uncertainty on all parameters, including what we have been calling random effects, then we can fit a Bayesian version. The new package rstanarm makes this relatively simple with a function that mimics the function `lme4::lmer` but fits a Stan model in the background. Stan is a programming language and software implementation of the Bayesian Hamiltonian Monte Carlo No-U-Turn Sampler (not a good cocktail party topic). Any more than that is well beyond the focus of this workshop.
Warning: If you are considering using a similar model yourself you will have to do some considerable reading first or find somebody to work with. Note that because this is a Bayesian model you have to choose prior distributions on your parameters. rstanarm has carefully thought out priors that apply to most conditions, but it's important that you understand what these are and whether they make sense for your data.
```{r stan, warning=FALSE}
library(rstanarm)
options(mc.cores = parallel::detectCores()) # setup parallel processing
mstan_nb <- rstanarm::stan_glmer.nb(f_lme4, data = d, chains = 4, iter = 400)
summary(mstan_nb)
rstan::stan_plot(mstan_nb)
rstan::stan_trace(mstan_nb)
rstan::stan_hist(mstan_nb)
ggplot2::theme_set(theme_grey()) # reset theme
save(mstan_nb, file = "data/generated/stan-models.rda")
```
```{r}
load("data/generated/stan-models.rda")
stan_est <- broom.mixed::tidyMCMC(mstan_nb, conf.int = TRUE)
stan_est %>% filter(grepl("gbpu", term)) %>%
ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange() +
coord_flip()
stan_re_int <- stan_est %>% filter(grepl("gbpu", term))
qqnorm(stan_re_int$estimate)
qqline(stan_re_int$estimate)
```
# Comparing estimates across methods
Let's compare confidence intervals:
```{r compare}
get_tmb_cis <- function(x, name) {
ci <- confint(x)
ci <- ci[!grepl("Std.Dev|sigma", row.names(ci)), ]
ci <- data.frame(ci)
names(ci) <- c("conf.low", "conf.high", "estimate")
ci$model <- name
ci$term <- names(fixef(x)$cond)
ci
}
cis <- list()
cis[[1]] <- get_tmb_cis(m_tmb_pois, "TMB Poisson")
cis[[2]] <- get_tmb_cis(m_tmb_obs, "TMB Poisson w obs RE")
cis[[3]] <- get_tmb_cis(m_tmb_nb, "TMB Negative binomial (NB2)")
cis[[4]] <- get_tmb_cis(m_tmb_nb1, "TMB Negative binomial (NB1)")
cis[[5]] <- broom.mixed::tidy(m_glmer, conf.int = TRUE) %>%
dplyr::select(conf.low, conf.high, estimate, term) %>%
mutate(model = "lme4::glmer Poisson") %>%
filter(term != "sd__(Intercept)")
cis[[6]] <- stan_est %>%
dplyr::select(conf.low, conf.high, estimate, term) %>%
filter(!grepl("Intercept", term), term != "reciprocal_dispersion") %>%
mutate(model = "Stan Negative binomial (NB2)")
cis_df <- bind_rows(cis) %>% filter(term != "(Intercept)")
```
```{r plot-comparison}
ggplot(cis_df, aes(term, estimate,
ymin = conf.low, ymax = conf.high, colour = model)) +
geom_pointrange(position = position_dodge(width = 0.5)) +
coord_flip() +
geom_hline(yintercept = 0, linetype = 2)
filter(cis_df, term == "salmon_scaled") %>%
ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high,
colour = model)) +
geom_pointrange(position = position_dodge(width = 0.5)) +
coord_flip() +
geom_hline(yintercept = 0, linetype = 2)
```
What do you notice about the various models?
# Additional information
Very useful worked examples of GLMMs by Ben Bolker:
<https://rpubs.com/bbolker/glmmchapter>
Using observation-level random effects for overdispersion:
Harrison, X. A. 2014. Using observation-level random effects to model overdispersion in count data in ecology and evolution. PeerJ 2:e616. https://peerj.com/articles/616/
My favourite introduction to Bayesian modelling book:
Hobbs, N.T., Hooten, M.B. 2015. Bayesian Models: A Statistical Primer for Ecologists.
A good introduction to Bayesian data analysis, although the current version uses WinBUGS/JAGS:
Gelman, A., and J. Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge, UK.
The Bayesian data analysis bible, but not an easy read:
Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. 2014. Bayesian Data Analysis. Chapman & Hall, Boca Raton, FL.
A great recent and modern introduction to Bayesian modeling with Stan:
McElreath, R. 2016. Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press.
Stan: <http:https://mc-stan.org/>