-
Notifications
You must be signed in to change notification settings - Fork 6
/
LFO-CV.Rmd
1507 lines (1349 loc) · 70.2 KB
/
LFO-CV.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Approximate leave-future-out cross-validation for Bayesian time series models"
author: "Paul-Christian Bürkner $^{1*}$, Jonah Gabry $^2$, & Aki Vehtari $^1$"
date: |
$^1$ Department of Computer Science, Aalto University, Finland\break
$^2$ Applied Statistics Center and ISERP, Columbia University, USA \break
$^*$ Corresponding author, Email: [email protected]
abstract: |
One of the common goals of time series analysis is to use the observed series
to inform predictions for future observations. In the absence of any actual
new data to predict, cross-validation can be used to estimate a model's future
predictive accuracy, for instance, for the purpose of model comparison or
selection. Exact cross-validation for Bayesian models is often
computationally expensive, but approximate cross-validation methods have been
developed, most notably methods for leave-one-out cross-validation (LOO-CV).
If the actual prediction task is to predict the future given the past, LOO-CV
provides an overly optimistic estimate because the information from future
observations is available to influence predictions of the past. To properly
account for the time series structure, we can use
leave-future-out cross-validation (LFO-CV). Like exact LOO-CV, exact LFO-CV
requires refitting the model many times to different subsets of the data.
Using Pareto smoothed importance sampling, we propose a method for
approximating exact LFO-CV that drastically reduces the computational costs
while also providing informative diagnostics about the quality of the
approximation.\linebreak\linebreak
Keywords: Time Series Analysis, Cross-Validation, Bayesian Inference, Pareto Smoothed Importance Sampling
lang: en-US
class: man
# figsintext: true
numbersections: true
encoding: UTF-8
bibliography: LFO-CV
biblio-style: apalike
output:
bookdown::pdf_document2:
citation_package: natbib
keep_tex: true
toc: false
header-includes:
- \usepackage{amsmath}
- \usepackage[utf8]{inputenc}
- \usepackage[T1]{fontenc}
- \usepackage{setspace}
- \onehalfspacing
# - \DeclareMathOperator{\istar}{i^\star}
# - \setcitestyle{round}
- \newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}
editor_options:
chunk_output_type: console
---
```{r setup, cache = FALSE, include = FALSE}
knitr::opts_chunk$set(
cache = TRUE,
echo = FALSE,
warning = FALSE,
message = FALSE
)
options(knitr.kable.NA = '')
```
```{r packages, cache = FALSE, include = FALSE}
library(knitr)
library(kableExtra)
library(latex2exp)
library(tidyverse)
library(brms)
library(loo)
source("sim_functions.R")
# set ggplot theme
theme_set(bayesplot::theme_default())
# set rstan options
rstan::rstan_options(auto_write = TRUE)
options(mc.cores = max(1, parallel::detectCores() - 1))
```
```{r functions}
fmt <- function(x, digits = 1, ...) {
format(x, digits = digits, nsmall = digits, ...)
}
```
# Introduction
A wide range of statistical models for time series have been developed, finding
applications in industry and nearly all empirical sciences
[e.g., see @brockwell2002; @hamilton1994].
One common goal of a time series analysis is to use the observed
series to inform predictions for future time points. In this paper we will
assume a Bayesian approach to time series modeling, in which case if it is
possible to sample from the posterior _predictive_ distribution implied by a
given time series model, then it is straightforward to generate predictions as
far into the future as we want. When working in discrete time we will refer to
the task of predicting a sequence of $M$ future observations as $M$-step-ahead
prediction ($M$-SAP).
It is easy to evaluate the $M$-SAP performance of a time series
model by comparing the predictions to the observed sequence of $M$ future data
points once they become available. However, we would often like to estimate
the future predictive performance of a model _before_ we are able to collect
additional observations. If there are many competing models we may also need to
first decide which model (or which combination of the models) to
rely on for prediction [@geisser1979; @hoeting1999; @vehtari2002; @ando2010;
@vehtari2012].
In the absence of new data, one general approach for evaluating a model's
predictive accuracy is cross-validation. The data is first split into two
subsets, then we fit the statistical model to the first subset and evaluate
predictive performance with the second subset. We may do this once or many
times, each time leaving out a different subset.
If the data points are not ordered in time, or if the goal is to assess the
non-time-dependent part of the model, then we can use leave-one-out
cross-validation (LOO-CV). For a data set with $N$ observations, we refit the
model $N$ times, each time leaving out one of the $N$ observations and assessing
how well the model predicts the left-out observation. Due to the number of
required refits, exact LOO-CV is computationally expensive, in particular when
performing full Bayesian inference and refitting the model means estimating a
new posterior distribution rather than a point estimate. But it is possible to
approximate exact LOO-CV using Pareto smoothed importance sampling
[PSIS; @vehtari2017loo; @vehtari2019psis].
PSIS-LOO-CV only requires a single fit of the full model and has sensitive
diagnostics for assessing the validity of the approximation.
However, using LOO-CV with times series models is problematic if the goal is to
estimate the predictive performance for future time points. Leaving out only one
observation at a time will allow information from the future to influence
predictions of the past (i.e., data from times $t+1, t+2, \ldots,$ would inform
predictions for time $t$). Instead, to apply the idea of cross-validation to the
$M$-SAP case we can use what we will refer to as leave-*future*-out
cross-validation (LFO-CV). LFO-CV does not refer to one particular prediction
task but rather to various possible cross-validation approaches that all involve
some form of prediction of future time points.
Like exact LOO-CV, exact LFO-CV requires refitting the model many times to
different subsets of the data, which is computationally expensive, in particular
for full Bayesian inference. In this paper, we extend the ideas from PSIS-LOO-CV
and present PSIS-LFO-CV, an algorithm that typically only requires refitting a
time series model a small number times. This will make LFO-CV tractable for many
more realistic applications than previously possible, including time series
model averaging using stacking of predictive distributions [@yao2018].
The efficiency of PSIS-LFO-CV compared to exact LFO-CV relies on the
ability to compute samples from the posterior predictive distribution (required
for the importance sampling) in much less time than it takes to fully refit the
model. This assumption is most likely justified when estimating a model using
full Bayesian inference via MCMC, variational inference, or related methods as
they are very computationally intensive. We do not make any
assumptions about *how* samples from the posterior or the posterior predictive
density at a given point in time have been obtained.
Our proposed algorithm was motivated by the practical need for efficient
cross-validation tools for Bayesian time series models fit using generic
probabilistic programming frameworks, such as Stan [@carpenter2017], JAGS
[@jags], PyMC3 [@pymc3] and Pyro [@pyro]. These frameworks have become very
popular in recent years also for analysis of time series models. For many
models, inference could also be performed using sequential Monte Carlo (SMC)
[e.g.,@doucet2000; @andrieu2010] using, for example, the SMC-specific framework
Birch [@birch]. The implementation details of LFO-CV for SMC algorithms are
beyond the scope of this paper.\footnote{Most SMC algorithms use importance
sampling and LFO-CV could be obtained as a by-product, with computation
resembling the approach we present here. The proposal distribution at each step
and the applied "refit" approach (when the importance sampling weights become
degenerate) are specific to each SMC algorithm.}
The structure of the paper is as follows. In Section \@ref(m-sap), we introduce
the idea and various forms of $M$-step-ahead prediction and how to approximate
them using PSIS. In Section \@ref(simulations), we evaluate the accuracy of the
approximation using extensive simulations. Then, in Section \@ref(case-studies),
we provide two case studies demonstrating the application of PSIS-LFO-CV methods
to real data sets. In the first we model changes in the water level of Lake
Huron and in the second the date of the yearly cherry blossom in Kyoto. We end
in Section \@ref(discussion) with a discussion of the usefulness and limitations
of our approach.
# $M$-step-ahead predictions {#m-sap}
Assume we have a time series of observations $y = (y_1, y_2, \ldots, y_N)$
and let $L$ be the _minimum_ number of observations from the series that
we will require before making predictions for future data. Depending on the
application and how informative the data are, it may not be possible to make
reasonable predictions for $y_{i+1}$ based on $(y_1, \dots, y_{i})$ until $i$ is
large enough so that we can learn enough about the time series to predict future
observations. Setting $L=10$, for example, means that we will only assess
predictive performance starting with observation $y_{11}$, so that we
always have at least 10 previous observations to condition on.
In order to assess $M$-SAP performance we would like to compute the
predictive densities
\begin{equation}
p(y_{i+1:M} \,|\, y_{1:i}) =
p(y_{i+1}, \ldots, y_{i + M} \,|\, y_{1},...,y_{i})
\end{equation}
for each $i \in \{L, \ldots, N - M\}$, where we use $y_{1:i} = (y_{1}, \ldots, y_{i})$
and $y_{i+1:M} = y_{(i+1):(i+M)} = (y_{i+1}, \ldots, y_{i + M})$
to shorten the notation[^operator]. As a global measure of predictive accuracy, we
can use the expected log predictive density [ELPD; @vehtari2017loo], which,
for M-SAP, can be defined as
[^operator]: Note that the here-used "$:$" operator has precedence over the
"$+$" operator following the R programming language definition.
\begin{equation}
\label{ELPD}
{\rm ELPD} = \sum_{i=L}^{N - M}
\int p_t(\tilde{y}_{i+1:M}) \log p(\tilde{y}_{i+1:M} \,|\, y_{1:i})
\, {\rm d} \, \tilde{y}_{i+1:M}.
\end{equation}
The distribution $p_t(\tilde{y}_{i+1:M})$ describes the true data generating
process for new data $\tilde{y}_{i+1:M}$. As these true data generating
processes are unknown, we approximate the ELPD using LFO-CV of the observed
responses $y_{i+1:M}$, which constitute a particular realization of
$\tilde{y}_{i+1:M}$. This approach of approximationg the true data generating
process with observed data is fundamental to all cross-validation procedures.
As we have no direct access to the underlying truth, the error implied by
this approximation is hard to quantify but also unavoidable [c.f., @bernardo1994].
Plugging in the realization $y_{i+1:M}$ for $\tilde{y}_{i+1:M}$ leads
to [c.f., @bernardo1994; @vehtari2012]:
\begin{equation}
{\rm ELPD}_{\rm LFO} = \sum_{i=L}^{N - M} \log p(y_{i+1:M} \,|\, y_{1:i}).
\end{equation}
The quantities $p(y_{i+1:M} \,|\, y_{1:i})$ can be computed with the help of the
posterior distribution $p(\theta \,|\, y_{1:i})$ of the parameters $\theta$
conditional on only the first $i$ observations of the time series:
\begin{equation}
\label{Lpred}
p(y_{i+1:M} \,| \, y_{1:i}) =
\int p(y_{i+1:M} \,| \, y_{1:i}, \theta) \,
p(\theta\,|\,y_{1:i}) \, {\rm d} \theta.
\end{equation}
In order to evaluate predictive performance of future data, it is important to
predict $y_{i+1:M}$ only conditioning on the past data $y_{1:i}$ and not on the
present data $y_{i+1:M}$. Including the present data in the posterior
estimation, that is, using the posterior $p(\theta\,|\,y_{1:(i+M)})$ in
\eqref{Lpred}, would result in evaluating in-sample fit. This corresponds to
what @vehtari2017loo call *log-predictive density* (LPD), which overestimates
predictive performance for future data [@vehtari2017loo].
Most time series models do not have conditionally independent observations, that
is, $y_{i+1:M}$ depend on $y_{1:i}$ even after conditioning on $\theta$. As
such, we cannot simplify the integrand in (\ref{Lpred}) and always need to take
$y_{1:i}$ into account when computing the predictive density of $y_{i+1:M}$. The
concept of conditional independence of observations is closely related to the
concept of factorizability of likelihoods. For the purpose of LFO-CV, we can
safely use the time-ordering naturally present in time-series data to obtain a
factorized likelihood even if observations are not conditionally independent.
See @buerkner:non-factorizable for discussion on computing predictive densities
of non-factorized models and factorizability in general.
In practice, we will not be able to directly solve the integral in
(\@ref(Lpred)), but instead have to use Monte-Carlo methods to approximate it.
Having obtained $S$ random draws $(\theta_{1:i}^{(1)}, \ldots, \theta_{1:i}^{(S)})$
from the posterior distribution $p(\theta\,|\,y_{1:i})$, we can estimate
$p(y_{i+1:M} | y_{1:i})$ as
\begin{equation}
p(y_{i+1:M} \,|\, y_{1:i}) \approx \frac{1}{S}
\sum_{s=1}^S p(y_{i+1:M} \,|\, y_{1:i}, \theta_{1:i}^{(s)}).
\end{equation}
In this paper we use ELPD as a measure of predictive accuracy, but $M$-SAP (and
the approximations we introduce below) may also be based on other global
measures of accuracy such as the root mean squared error (RMSE) or the median
absolute deviation (MAD). The reason for our focus on ELPD is that it evaluates
a distribution rather than a point estimate (like the mean or median) to provide
a measure of out-of-sample predictive performance, which we see as favorable
from a Bayesian perspective [@vehtari2012]. The code we provide on GitHub
(https://github.com/paul-buerkner/LFO-CV-paper) is modularized to support
arbitrary measures of accuracy as long as they can be represented in a pointwise
manner, that is, as increments per observation. In Appendix C we also provide
additional simulation results using RMSE instead of ELPD.
## Approximate $M$-step-ahead predictions {#approximate-MSAP}
The equations above make use of posterior distributions from many
different fits of the model to different subsets of the data. To obtain
the predictive density $p(y_{i+1:M} \,|\, y_{1:i})$, a model is fit to
only the first $i$ data points, and we need to do this for every value of
$i$ under consideration (all $i \in \{L, \ldots, N - M\}$).
Below, we present a new algorithm to reduce the number of models that need
to be fit for the purpose of obtaining each of the densities
$p(y_{i+1:M} \,|\, y_{1:i})$. This algorithm relies in a central manner on
Pareto smoothed importance sampling [@vehtari2017loo; @vehtari2019psis], which
we will briefly review next.
### Pareto smoothed importance sampling {#psis}
Importance sampling is a technique for computing expectations with
respect to some target distribution using an approximating proposal distribution
that is easier to draw samples from than the actual target. If $f(\theta)$ is
the target and $g(\theta)$ is the proposal distribution, we can write any
expectation of some function $h(\theta)$ with respect to $f$ as
\begin{equation}
\mathbb{E}_f[h(\theta)] = \int h(\theta) f(\theta) \,d\, \theta
= \frac{\int [h(\theta) f(\theta) / g(\theta)] g(\theta) \,d\, \theta}
{\int [f(\theta) / g(\theta)] g(\theta) \,d\, \theta}
= \frac{\int h(\theta) r(\theta) g(\theta) \,d\, \theta}
{\int r(\theta) g(\theta) \,d\, \theta}
\end{equation}
with importance ratios
\begin{equation}
r(\theta) = \frac{f(\theta)}{g(\theta)}.
\end{equation}
Accordingly, if $\theta^{(s)}$ are $S$ random draws from $g(\theta)$, we can
approximate
\begin{equation}
\mathbb{E}_f[h(\theta)] \approx
\frac{\sum_{s=1}^S h(\theta^{(s)}) r(\theta^{(s)})}{\sum_{s=1}^S r(\theta^{(s)})},
\end{equation}
provided that we can compute the raw importance ratios $r(\theta^{(s)})$ up to
some multiplicative constant. The raw importance ratios serve as weights on the
corresponding random draws in the approximation of the quantity of interest.
The main problem with this approach is that the raw importance ratios tend to
have large or infinite variance and results can be highly unstable. In order to
stabilize the computations, we can perform the additional step of regularizing
the largest raw importance ratios using the corresponding quantiles of the
generalized Pareto distribution fitted to the largest raw importance ratios.
This procedure is called Pareto smoothed importance sampling
[PSIS; @vehtari2017loo; @vehtari2019psis; @loo2019] and has been demonstrated to
have a lower error and faster convergence rate than other commonly used
regularization techniques [@vehtari2019psis].
In addition, PSIS comes with a useful diagnostic to evaluate
the quality of the importance sampling approximation. The shape parameter $k$
of the generalized Pareto distribution fit to the largest importance ratios provides
information about the number of existing moments of the distribution of the
weights (smoothed ratios) and the actual importance sampling estimate.
When $k<0.5$, the weight distribution has finite variance and the central limit
theorem ensures fast convergence of the importance sampling estimate with
increasing number of draws. This implies that approximate LOO-CV via PSIS is
highly accurate for $k<0.5$ [@vehtari2019psis]. For $0.5 \leq k < 1$, a
generalized central limit theorem holds, but the convergence rate drops quickly
as $k$ increases [@vehtari2019psis]. Using both mathematical analysis
and numerical experiments, PSIS has been shown to be
relatively robust for $k < 0.7$ [@vehtari2017loo; @vehtari2019psis]. As such,
the default threshold is set to $0.7$ when performing
PSIS LOO-CV [@vehtari2017loo;@loo2019].
### PSIS applied to $M$-step-ahead predictions {#psis-MSAP}
We now turn back to the task of estimating $M$-step-ahead performance for
time series models. First, we refit the model using the first $L$ observations
of the time series and then perform a single exact $M$-step-ahead prediction
step for $p(y_{L+1:M} \,|\, y_{1:L})$ as per \eqref{Lpred}.
Recall that $L$ is the minimum number of observations we have deemed
acceptable for making predictions (setting $L=0$ means the first data point will
be predicted only based on the prior). We define $i^\star = L$ as the current
point of refit. Next, starting with $i = i^\star + 1$, we
approximate each $p(y_{i+1:M} \,|\, y_{1:i})$ via
\begin{equation}
p(y_{i+1:M} \,|\, y_{1:i}) \approx
\frac{ \sum_{s=1}^S w_i^{(s)}\, p(y_{i+1:M} \,|\, y_{1:i}, \theta^{(s)})}
{ \sum_{s=1}^S w_i^{(s)}},
\end{equation}
where $\theta^{(s)} = \theta^{(s)}_{1:i^\star}$ are draws from the
posterior distribution based on the first $i^\star$ observations
and $w_i^{(s)}$ are the PSIS weights obtained in two steps.
First, we compute the raw importance ratios
\begin{equation}
r_i^{(s)} = r_i(\theta^{(s)}) =
\frac{f_{1:i}(\theta^{(s)})}{f_{1:i^\star}(\theta^{(s)})}
\propto \frac{
p(\theta^{(s)}) \prod_{j \in 1:i} p(y_j \,|\, y_{1:(j-1)}, \theta^{(s)})
}{
p(\theta^{(s)}) \prod_{j \in 1:i^\star} p(y_j \,|\, y_{1:(j-1)}, \theta^{(s)})
}
= \prod_{j \in (i^\star + 1):i} p(y_j \,|\, y_{1:(j-1)}, \theta^{(s)}),
\end{equation}
and then stabilize them using PSIS as described in Section \ref{psis}. The
function $f_{1:i}$ denotes the posterior distribution based on the first $i$
observations, that is, $f_{1:i} = p(\theta \,|\, y_{1:i})$, with $f_{1:i^\star}$
defined analogously. The index set $(i^\star + 1):i$ indicates all observations
which are part of the data for the model $f_{1:i}$ whose predictive performance
we are trying to approximate but not for the actually fitted model
$f_{1:i^\star}$. The proportional statement arises from the fact that
we ignore the normalizing constants $p(y_{1:i})$ and $p(y_{1:i^\star})$
of the compared posteriors, which leads to a self-normalized variant of
PSIS [c.f. @vehtari2017loo].
<!--
The index sets $J_i$
contain the indices of all observations up to and including observation $i$,
that is, $J_i = \{1, \ldots, i\}$ and $J_L = \{1, \ldots, L\}$. The index set
$J_i^\star$ contains the indices of all observations
which are part of the data for the model whose predictive performance we are
trying to approximate but not for the actually fitted model. That is, until a refit
becomes necessary (see below), we will have
$J_i^\star = J_i \backslash J_L = \{L + 1, \ldots, i\}$.
-->
Continuing with the next observation, we gradually increase $i$ by $1$ (we move
forward in time) and repeat the process. At some observation $i$, the
variability of the importance ratios $r_i^{(s)}$ will become too large and
importance sampling will fail. We will refer to this particular value of $i$ as
$i^\star_1$. To identify the value of $i^\star_1$, we check for which value of
$i$ does the estimated shape parameter $k$ of the generalized Pareto
distribution first cross a certain threshold $\tau$ [@vehtari2019psis]. Only
then do we refit the model using the observations up to $i^\star_1$ and restart
the process from there by setting $\theta^{(s)} = \theta^{(s)}_{1:i^\star_1}$
and $i^\star = i^\star_1$ until the next refit.
An illustration of this procedure is shown in Figure
\@ref(fig:vis-msap).
<!--
Until the next refit, we have $J_i^\star = J_i \backslash
J_{i^\star_1} = \{i^\star_1+1, \ldots, i \}$ for $i^\star_1 < i$, as the
refitted model only contains the observations up to and including index
$i^\star_1$.
-->
This bears some resemblance to Sequential Monte Carlo, also known as
particle or Monte Carlo filtering [e.g., @gordon1993; @kitagawa1996;
@doucet2000; @andrieu2010], in that applying SMC to state space models also
entails moving forward in time and using importance sampling to approximate the
next state from the information in the previous states [@kitagawa1996;
@andrieu2010]. However, in our case we are assuming we can sample from the
posterior distribution and are instead concerned with estimating the model's
predictive performance. Unlike SMC, PSIS-LFO-CV also entails a full recomputation
of the model via Markov chain Monte Carlo (MCMC) once importance sampling fails.
In some cases we may only need to refit once and in other cases we will find a
value $i^\star_2$ that requires a second refitting, maybe an $i^\star_3$ that
requires a third refitting, and so on. We refit as many times as is required
(only when $k > \tau$) until we arrive at observation $i = N - M$. A detailed
description of the algorithm in the form of pseudo code is provided in Appendix
A. If the data are comprised of multiple *independent* time series, the algorithm can
be applied to each of the time series separately and the resulting ELPD values
can be summed up afterwards. If the data are comprised of multiple *dependent*
time series, the algorithm should be applied to the joint likelihood across all
time-series for each observation $i$ in order to take the dependency into
account.
```{r vis-msap, fig.width=8, fig.height=3, fig.cap="Visualisation of PSIS approximated one-step-ahead predictions. Predicted observations are indicated by **X**. In the shown example, the model was last refit at the $i^\\star = 4$th observation."}
status_levels <- c("included", "included (PSIS)", "left out")
df <- data.frame(
obs = rep(1:9, 3),
i = factor(rep(3:5, each = 9)),
Status = c(
rep("included", 4), rep("left out", 5),
rep("included", 4), rep("included (PSIS)", 1), rep("left out", 4),
rep("included", 4), rep("included (PSIS)", 2), rep("left out", 3)
)
) %>%
mutate(Status = factor(Status, levels = status_levels))
msap_colors <- c(
bayesplot::color_scheme_get("viridis")$light,
bayesplot::color_scheme_get("viridis")$light_highlight,
bayesplot::color_scheme_get("viridis")$dark
)
ggplot(df, aes(obs, i, fill = Status)) +
geom_tile(height = 0.9, width = 1, col = "black") +
annotate(
'text', x = 5:7, y = c(1, 2, 3),
label = "X", parse = TRUE,
size = 10, color = "white"
) +
labs(x = "Observation", y = "Predicted observation") +
scale_x_continuous(breaks = 1:9) +
scale_fill_manual(values = msap_colors) +
bayesplot::theme_default() +
NULL
```
Instead of moving forward in time, it is also possible to do
PSIS-LFO-CV moving backwards in time. However, in that case the target posterior
is approximated by a distribution based on more observations and, as such,
the proposal distribution is narrower than the target. This can result in
highly influential importance weights more often than when
the proposal is wider than the target, which is the case
for the forward PSIS-LFO-CV described above. In Appendix B, we show
that moving backwards indeed requires more refits than moving forward, and
without any increase in accuracy. When we refer to the PSIS-LFO-CV algorithm
in the main text we are referring to the forward version.
The threshold $\tau$ is crucial to the accuracy and speed of the PSIS-LFO-CV
algorithm. If $\tau$ is too large then we need fewer refits but accuracy will
suffer. If $\tau$ is too small then accuracy will be higher but many refits will
be required and the computation time may be impractical. Limiting the number of
refits without sacrificing too much accuracy is essential since almost all of
the computation time for exact cross-validation of Bayesian models is spent
fitting the models (not calculating the predictions). Therefore, any reduction
we can achieve in the number of refits essentially implies a proportional
reduction in the overall time required for cross-validation of Bayesian models.
We will come back to the issue of setting appropriate thresholds in Section
\@ref(simulations).
# Simulations {#simulations}
To evaluate the quality of the PSIS-LFO-CV approximation, we performed a
simulation study in which the following conditions were systematically varied:
* The number $M$ of future observations to be predicted took on values of $M =
1$ and $M = 4$.
* The threshold $\tau$ of the Pareto $k$ estimates was varied between $k = 0.5$
to $k = 0.7$ in steps of $0.1$.
* Six different data generating models were evaluated, with linear and/or
quadratic terms and/or autoregressive terms of order 2 (see Figure
\@ref(fig:simmodels) for an illustration).
In all cases the time series consisted of $N = 200$ observations and the minimal
number of observations required before make predictions was set to $L = 25$.
We ran $100$ simulation trials per condition.
```{r, include = FALSE}
seed <- 1234
set.seed(seed)
N <- 200
time <- seq_len(N)
stime <- scale_unit_interval(time)
models <- c(
"constant", "linear", "quadratic",
"AR2-only", "AR2-linear", "AR2-quadratic"
)
fits <- preds <- setNames(vector("list", length(models)), models)
for (m in names(fits)) {
file <- paste0("models/fit_", m)
fits[[m]] <- fit_model(model = m, N = N, seed = seed, file = file)
pred <- posterior_predict(fits[[m]])
preds[[m]] <- fits[[m]]$data %>%
mutate(
time = time,
stime = stime,
Estimate = colMeans(pred),
Q5 = apply(pred, 2, quantile, probs = 0.05),
Q95 = apply(pred, 2, quantile, probs = 0.95),
model = m
)
}
preds <- as_tibble(bind_rows(preds)) %>%
mutate(model = factor(model, levels = models)) %>%
select(-`I(stime^2)`)
```
```{r simmodels, fig.height=4, fig.cap="Illustration of the models used in the simulations. Black points are observed data. The blue line represents posterior predictions of the model resembling the true data-generating process with 90% prediction intervals shown in gray. More details are provided in the text."}
ggplot(preds, aes(x = time, y = Estimate)) +
facet_wrap(~model) +
geom_smooth(aes(ymin = Q5, ymax = Q95), stat = "identity", size = 0.5) +
geom_point(aes(y = y), size = 0.5) +
labs(y = "y")
```
Autoregressive (AR) models are some of the most commonly used time series models.
An AR(p) model -- an autoregressive model of order $p$ -- can be defined as
\begin{equation}
y_i = \eta_i + \varepsilon_i \quad \text{with} \quad
\varepsilon_i = \sum_{k = 1}^p \varphi_k \varepsilon_{i - k} + e_i,
\end{equation}
where $\eta_i$ is the linear predictor for the $i$th observation, $\varphi_k$
are the autoregressive parameters on the residuals $\varepsilon_i$, and $e_i$ are
pairwise independent errors, which are usually assumed to be normally
distributed with equal variance $\sigma^2$. The model implies a recursive
formula that allows for computing the right-hand side of the equation for
observation $i$ based on the values of the equations computed for previous
observations. Observations from an AR process are therefore not conditionally
independent by definition, but the likelihood still factorizes because we
can write down a separate contribution for each observation [see
@buerkner:non-factorizable for more discussion on factorizability of statistical
models].
In the quadratic model with AR(2) effects (the most complex model in our
simulations), the true data generating process was defined as
\begin{equation}
y_i = b_0 + b_1 t + b_2 t^2 + \varepsilon_i \quad \text{with} \quad
\varepsilon_i = \varphi_1 \varepsilon_{i - 1} + \varphi_2 \varepsilon_{i - 2} + e_i,
\end{equation}
where $t$ is the time variable scaled to the unit interval, that is, $t = 0$ for
the smallest time point ($1$ in our simulations) and $t = 1$ for the largest
time point ($200$ in our simulations). Specifically, we set the true regression
coefficients to the values of $b_0 = 0$, $b_1 = 17$, $b_2 = 25$, and the true
autocorrelation parameters to $\varphi_1 = 0.5$, and $\varphi_2 = 0.3$ (see
Figure \@ref(fig:simmodels) for an illustration). The choices of the regression
coefficients were done so that neither the linear nor quadratic term dominates
the other within the specified time frame. The values of the autocorrelation
parameters were set to represent typical positively autocorrelated data.
In the simulation conditions without linear and/or quadratic and/or AR(2) terms,
the corresponding true parameters were simply fixed to zero. We always fit the
true data generating model to the data. This is neither required for the
validity of LFO-CV in general nor for the validity of the comparison between
exact and approximate versions but simply a choice of convenience. For example,
a linear model without autocorrelation is used when all but $b_0$ and $b_1$ were
set to zero in the simulations.
In addition to exact and approximate LFO-CV, we also computed approximate LOO-CV
for comparison. This is not because we think LOO-CV is a generally appropriate
approach for time series models, but because, in the absence of any approximate
LFO-CV method, researchers may have used approximate LOO-CV for time series
models in the past simply because it was the only available option.
Demonstrating that LOO-CV is a biased estimate of LFO-CV underscores
the importance of developing methods better suited for the task.
All simulations were done in R [@R2018] using the brms package [@brms1;
@brms2] together with the probabilistic programming language Stan
[@carpenter2017; @rstan2019] for model fitting, the loo R package
[@loo2019] for the PSIS computations, and several tidyverse R packages
[@tidyverse] for data processing. The full code and all results are available on
Github (https://github.com/paul-buerkner/LFO-CV-paper).
## Results {#sim_results}
```{r}
# extractor helper functions
get_elpd <- function(x) {
summarize_elpds(x)[1]
}
get_refits <- function(x) {
sum(attr(x, "refits"), na.rm = TRUE)
}
# pretty factor labels
mlevels <- c(
"constant", "linear", "quadratic",
"AR2-only", "AR2-linear", "AR2-quadratic"
)
tau_levels <- TeX(paste0("$\\tau$ = ", c(0.5, 0.6, 0.7)))
elpd_colors <- c(
loo = bayesplot::color_scheme_get("viridis")$mid,
lfo_bw = bayesplot::color_scheme_get("viridis")$light,
lfo_fw = bayesplot::color_scheme_get("viridis")$dark,
lfo_cb = bayesplot::color_scheme_get("viridis")$mid
)
# data preparation
lfo_sims <- read_rds("results/lfo_sims_1sap.rds") %>%
# stored in different files due to Github's file size limit
bind_rows(read_rds("results/lfo_sims_4sap.rds")) %>%
as_tibble() %>%
filter(lengths(res) > 0) %>%
mutate(
model = factor(model, levels = mlevels),
tau = factor(k_thres, labels = tau_levels),
# compute sum of pointwise ELPD values
elpd_loo = map_dbl(res, ~ .$loo_cv$estimates["elpd_loo", 1]),
elpd_exact_lfo = map_dbl(res, ~ get_elpd(.$lfo_exact_elpds)),
elpd_approx_fw_lfo = map_dbl(res, ~ get_elpd(.$lfo_approx_fw_elpds)),
elpd_approx_bw_lfo = map_dbl(res, ~ get_elpd(.$lfo_approx_bw_elpds)),
# compute differences to the exact values
elpd_diff_bw_lfo = elpd_approx_bw_lfo - elpd_exact_lfo,
elpd_diff_fw_lfo = elpd_approx_fw_lfo - elpd_exact_lfo,
elpd_diff_loo = elpd_loo - elpd_exact_lfo,
nrefits_bw = map_dbl(res, ~ get_refits(.$lfo_approx_bw_elpds)),
nrefits_fw = map_dbl(res, ~ get_refits(.$lfo_approx_fw_elpds)),
rel_nrefits_bw = nrefits_bw / (N - L),
rel_nrefits_fw = nrefits_fw / (N - L)
# the combined mode is not yet fully implemented
# elpd_approx_cb_lfo = map_dbl(res, ~ get_elpd(.$lfo_approx_cb_elpds)),
# elpd_diff_cb_lfo = elpd_approx_cb_lfo - elpd_exact_lfo,
# nrefits_cb = map_dbl(res, ~ get_refits(.$lfo_approx_cb_elpds)),
# rel_nrefits_cb = nrefits_cb / (N - L)
) %>%
select(-res)
```
In this section we focus on the ELPD as a measure out-of-sample predictive
performance for reasons outlined in Section \ref{m-sap}. In Appendix C,
we provide additional simulation results for the RMSE.
Results of the 1-SAP simulations are visualized in Figure \@ref(fig:1sap).
Comparing the columns of Figure \@ref(fig:1sap), it is clearly visible that the
the accuracy of the PSIS approximation is independent of the threshold $\tau$
when $\tau$ is within the interval $[0.5,0.7]$ motivated in \@ref(psis) [this
would not be the case if $\tau$ was allowed to be larger; @vehtari2019psis]. For
all conditions, the PSIS-LFO-CV approximation is highly accurate, that is, both
unbiased and low in variance around the corresponding exact LFO-CV value
(represented by the dashed line in Figure \@ref(fig:1sap)). The proportion of
observations at which refitting the model was required did not exceed $3\%$
under any of the conditions and only increased minimally when decreasing $\tau$
(see Table \@ref(tab:refits)). At least for the models investigated in our
simulations, using $\tau = 0.7$ seems to be sufficient for achieving high
accuracy and as such there is no need to lower the threshold below that value.
As expected, LOO-CV (the lighter histograms in Figure \@ref(fig:1sap)) is a
biased estimate of the 1-SAP performance for all non-constant models, in
particular for models with a trend in the time series. More precisely, LOO-CV is
positively biased, which implies that it systematically overestimates $M$-SAP
performance of time series models.
```{r 1sap, fig.height=8, fig.cap="Simulation results of 1-step-ahead predictions. Histograms are based on 100 simulation trials of time series with $N = 200$ observations requiring at least $L = 25$ observations to make predictions. The black dashed lines indicates the exact LFO-CV result."}
lfo_sims %>%
filter(is.na(B), M == 1) %>%
select(elpd_diff_fw_lfo, elpd_diff_loo, model, tau) %>%
gather("Mode", "elpd_diff", starts_with("elpd_diff")) %>%
ggplot(aes(x = elpd_diff, y = ..density.., fill = Mode)) +
facet_grid(
model ~ tau, scales = "free_y",
labeller = label_parsed
) +
geom_histogram(alpha = 0.7) +
scale_fill_manual(
values = unname(elpd_colors[c("lfo_fw", "loo")]),
labels = c("PSIS-LFO-CV", "PSIS-LOO-CV"),
) +
labs(x = "ELPD difference of approximate and exact LFO-CV", y = "") +
geom_vline(xintercept = 0, linetype = 2) +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "bottom"
) +
NULL
```
```{r refits, cache=FALSE}
lfo_sims %>%
filter(is.na(B)) %>%
select(rel_nrefits_fw, model, M, k_thres) %>%
group_by(model, M, k_thres) %>%
summarise(rel_nrefits = round(mean(rel_nrefits_fw), 2)) %>%
ungroup() %>%
spread("model", "rel_nrefits") %>%
mutate(M = ifelse(duplicated(M), "", M)) %>%
rename(`$\\tau$` = "k_thres") %>%
kable(
caption = "Mean proportions of required refits for PSIS-LFO-CV.",
booktabs = TRUE,
escape = FALSE
) %>%
footnote(
general = "Note: Results are based on 100 simulation trials of time series with $N = 200$ observations requiring at least $L = 25$ observations to make predictions. Abbreviations: $\\\\tau$ = threshold of the Pareto $k$ estimates; $M$ = number of predicted future observations.",
general_title = "",
threeparttable = TRUE,
escape = FALSE
)
```
Results of the 4-SAP simulations are visualized in Figure \@ref(fig:4sap).
Comparing the columns of Figure \@ref(fig:4sap), it is again clearly visible
that the accuracy of the PSIS approximation is independent of the threshold
$\tau$. The proportion of observations at which refitting the model was required
did not exceed $3\%$ under any condition and only increased minimally when
decreasing $\tau$ (see Table \@ref(tab:refits)). In light of the corresponding
1-SAP results presented above, this is not surprising because the procedure for
determining the necessity of a refit is independent of $M$ (see Section
\@ref(approximate-MSAP)). PSIS-LOO-CV is not displayed in Figure \@ref(fig:4sap) as
the number of observations predicted at each step (4 vs. 1) makes 4-SAP LFO-CV
and LOO-CV incomparable.
```{r 4sap, fig.height=8, fig.cap="Simulation results of 4-step-ahead predictions. Histograms are based on 100 simulation trials of time series with $N = 200$ observations requiring at least $L = 25$ observations to make predictions. The black dashed lines indicates the exact LFO-CV result."}
lfo_sims %>%
filter(is.na(B), M == 4) %>%
select(elpd_diff_fw_lfo, model, tau) %>%
ggplot(aes(x = elpd_diff_fw_lfo, y = ..density..)) +
facet_grid(
model ~ tau, scales = "free_y",
labeller = label_parsed
) +
geom_histogram(alpha = 0.7, fill = unname(elpd_colors["lfo_fw"])) +
labs(x = "ELPD difference of approximate and exact LFO-CV", y = "") +
geom_vline(xintercept = 0, linetype = 2) +
xlim(c(-10, 10)) +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "bottom"
) +
NULL
```
# Case Studies {#case-studies}
## Annual measurements of the level of Lake Huron {#case-LH}
To illustrate the application of PSIS-LFO-CV for estimating expected $M$-SAP
performance, we will fit a model for 98 annual measurements of the water level
(in feet) of [Lake Huron](https://en.wikipedia.org/wiki/Lake_Huron) from the
years 1875--1972. This data set is found in the *datasets* R package, which is
installed automatically with R [@R2018]. The time series shows rather strong
autocorrelation and some downward trend towards lower water levels for later points
in time. Figure \@ref(fig:lake-huron) shows the observed time series of water
levels as well as predictions from a fitted AR(4) model.
```{r}
data("LakeHuron")
N <- length(LakeHuron)
df <- data.frame(
y = as.numeric(LakeHuron),
year = as.numeric(time(LakeHuron)),
time = 1:N
)
```
```{r fit_lh, results = "hide"}
fit_lh <- brm(
y | mi() ~ 1,
data = df,
autocor = cor_ar(~time, p = 4),
prior = prior(normal(0, 0.5), class = "ar"),
chains = 2, warmup = 1000, iter = 5000,
control = list(adapt_delta = 0.99),
seed = 5838296, file = "models/fit_lh"
)
```
```{r lake-huron, fig.cap="Water Level in Lake Huron (1875-1972). Black points are observed data. The blue line represents mean predictions of an AR(4) model with 90% prediction intervals shown in gray.", fig.height=3}
preds <- posterior_predict(fit_lh)
preds <- cbind(
Estimate = colMeans(preds),
Q5 = apply(preds, 2, quantile, probs = 0.05),
Q95 = apply(preds, 2, quantile, probs = 0.95)
)
ggplot(cbind(df, preds), aes(x = year, y = Estimate)) +
geom_smooth(aes(ymin = Q5, ymax = Q95), stat = "identity", size = 0.5) +
geom_point(aes(y = y)) +
labs(y = "Water Level (ft)", x = "Year")
```
```{r}
L <- 20
k_thres <- 0.7
loo_lh <- loo(log_lik(fit_lh)[, (L + 1):N])
```
Based on this data and model, we will illustrate the use of PSIS-LFO-CV to
provide estimates of $1$-SAP and $4$-SAP when leaving out all future values. To
allow for reasonable predictions, we will require at least $L = 20$ historical
observations (20 years) to make predictions. Further, we set a threshold of
$\tau =$ `r k_thres` for the Pareto $k$ estimates that indicate when refitting
becomes necessary. Our fully reproducible analysis of this case study can be
found on GitHub (https://github.com/paul-buerkner/LFO-CV-paper).
```{r}
lh_elpd_1sap_exact <- compute_lfo(
fit_lh, type = "exact", M = 1, L = L,
file = "results/lh_elpd_1sap_exact.rds"
)
lh_elpd_1sap_approx <- compute_lfo(
fit_lh, type = "approx", M = 1, L = L,
k_thres = k_thres, mode = "forward",
file = "results/lh_elpd_1sap_approx_fw.rds"
)
lh_nrefits <- get_refits(lh_elpd_1sap_approx)
```
```{r}
lh_elpd_4sap_exact <- compute_lfo(
fit_lh, type = "exact", M = 4, L = L,
file = "results/lh_elpd_4sap_exact.rds"
)
lh_elpd_4sap_approx <- compute_lfo(
fit_lh, type = "approx", M = 4, L = L,
k_thres = k_thres, mode = "forward",
file = "results/lh_elpd_4sap_approx_fw.rds"
)
```
We start by computing exact and PSIS-approximated LFO-CV of 1-SAP. The computed
ELPD values are
${\rm ELPD}_{\rm exact} =$ `r fmt(get_elpd(lh_elpd_1sap_exact), 2)` and
${\rm ELPD}_{\rm approx} =$ `r fmt(get_elpd(lh_elpd_1sap_approx), 2)`, which are
almost identical. Not only is the overall ELPD estimated accurately but so are
all of the pointwise ELPD contributions (see the left panel of Figure
\@ref(fig:lh-pw-elpd)). In comparison, PSIS-LOO-CV returns
${\rm ELPD}_{\rm loo} =$ `r fmt(loo_lh$estimates[1, 1], 1)`,
overestimating the predictive performance and as suggested by our simulation
results for stationary autoregressive models (see fourth row of Figure
\@ref(fig:1sap)). Plotting the Pareto $k$ estimates reveals that the model had
to be refit `r lh_nrefits` times, out of a total of $N - L =$ `r N - L` predicted
observations (see Figure \@ref(fig:lh-pareto-k)). On average, this means one
refit every `r fmt((N - L) / lh_nrefits, 1)` observations, which implies a drastic
speed increase compared to exact LFO-CV.
Performing LFO-CV for 4-SAP, we obtained ${\rm ELPD}_{\rm exact} =$
`r fmt(get_elpd(lh_elpd_4sap_exact), 2)` and ${\rm ELPD}_{\rm approx} =$
`r fmt(get_elpd(lh_elpd_4sap_approx), 2)`, which are again very similar.
In general, as $M$ increases, the approximation will tend to become more variable
around the true value in absolute ELPD units because the ELPD increment of each
observation will be based on more and more observations (see also Section
\@ref(simulations)). For this example, we see some considerable differences in
the pointwise ELPD contributions of specific observations which were hard to
predict accurately by the model (see the right panel of Figure
\@ref(fig:lh-pw-elpd)). This is to be expected because predicting $M$ steps ahead
using an AR model will yield highly uncertain predictions if most of the
autocorrelation happens at lags smaller than $M$ (see also the bottom rows in
Figure \ref{fig:4sap}). For such a model, it may be ill-advised to evaluate
predictions too far into the future, at least when using the approximate methods
presented in this paper. Since, for a constant threshold $\tau$, the importance
weights are the same independent of $M$, the Pareto $k$ estimates are the same
for $4$-SAP and $1$-SAP.
```{r lh-pw-elpd, warning=FALSE, fig.height=3, fig.cap="Pointwise exact vs. PSIS-approximated ELPD contributions for 1-SAP (left) and 4-SAP (right) for the Lake Huron model. A threshold of $\\tau = 0.7$ was used for the Pareto $k$ estimates. $M$ is the number of predicted future observations."}
lh_pw_elpd <- tibble(
elpd_exact = na.omit(lh_elpd_1sap_exact),
elpd_approx = na.omit(lh_elpd_1sap_approx),
k = na.omit(attributes(lh_elpd_1sap_approx)$ks),
M = "M = 1"
) %>% bind_rows(
tibble(
elpd_exact = na.omit(lh_elpd_4sap_exact),
elpd_approx = na.omit(lh_elpd_4sap_approx),
k = na.omit(attributes(lh_elpd_4sap_approx)$ks),
M = "M = 4"
)
)
ggplot(lh_pw_elpd, aes(elpd_exact, elpd_approx)) +
facet_wrap(facets = "M", nrow = 1, ncol = 2, scales = "free") +
geom_abline(slope = 1) +
geom_point() +
labs(y = "Approximate ELPD", x = "Exact ELPD") +
theme_bw()
```
```{r lh-pareto-k, fig.height=2.5, fig.cap="Pareto $k$ estimates for PSIS-LFO-CV of the Lake Huron model. The dotted red line indicates the threshold at which the refitting was necessary."}
ks <- na.omit(attributes(lh_elpd_1sap_approx)$ks)
ids <- (L + 1):N
plot_ks(ks, ids, k_thres = k_thres)
```
## Annual date of the cherry blossoms in Japan {#case-CB}
```{r}
cherry <- read.csv("data/cherry_blossoms.csv")
cherry_temp <- cherry[!is.na(cherry$temp), ]
cherry_doy <- cherry[!is.na(cherry$doy), ]
```
The cherry blossom in Japan is a famous natural phenomenon occurring once every
year during spring. As the climate changes so does the annual date of the cherry
blossom [@aono2008; @aono2010]. The most complete reconstruction available to
date contains data between `r min(cherry$year)` AD and `r max(cherry$year)` AD
[@aono2008; @aono2010] and is available online
(https://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/).
In this case study, we will predict the annual date of the cherry
blossom using an approximate Gaussian process model [@solin2014; @RiutortMayol2019] to
provide flexible non-linear smoothing of the time series. A visualisation of
both the data and the fitted model is provided in Figure
\@ref(fig:cherry-blossom). While the time series appears rather stable across
earlier centuries, with substantial variation across consecutive years, there
are some clearly visible trends in the data. Particularly in more recent years,
the cherry blossom has tended to happen much earlier than before, which may
be a consequence of changes in the climate [@aono2008; @aono2010].
Based on this data and model, we will illustrate the use of PSIS-LFO-CV to
provide estimates of $1$-SAP and $4$-SAP leaving out all future values. To allow
for reasonable predictions of future values, we will require at least $L = 100$
historical observations (100 years) to make predictions. Further, we set a
threshold of $\tau =$ `r k_thres` for the Pareto $k$ estimates to determine when
refitting becomes necessary. Our fully reproducible analysis of this case study
can be found on GitHub (https://github.com/paul-buerkner/LFO-CV-paper).
```{r fit_cb}
fit_cb <- brm(
formula = bf(doy ~ gp(year, k = 20, c = 5/4)),
data = cherry_doy,
prior = prior(normal(0, 0.1), class = lscale, coef = gpyear),
chain = 2, warmup = 4000, iter = 7000, inits = 0,
control = list(adapt_delta = 0.99, max_treedepth = 15),
seed = 5838234, file = "models/fit_cb"
)
```
```{r cherry-blossom, fig.height=3, fig.cap="Day of the cherry blossom in Japan (812-2015). Black points are observed data. The blue line represents mean predictions of a thin-plate spline model with 90% regression intervals shown in gray."}
me_cb <- marginal_effects(fit_cb, probs = c(0.05, 0.95))
plot(me_cb, points = TRUE, plot = FALSE)[[1]] +
labs(x = "Year", y = "Day of cherry blossom")
```
```{r}
N <- NROW(cherry_doy)
L <- 100
k_thres <- 0.7
cb_elpd_1sap_exact <- compute_lfo(
fit_cb, type = "exact", M = 1, L = L,
factorizes = TRUE,
file = "results/cb_elpd_1sap_exact.rds"
)
cb_elpd_1sap_approx <- compute_lfo(
fit_cb, type = "approx", M = 1, L = L,
k_thres = k_thres, mode = "forward", factorizes = TRUE,
file = "results/cb_elpd_1sap_approx_fw.rds"
)