hdcd
is a package for multivariate, or even high-dimensional change point detection methods. It implements change point detection for (high-dimensional) Gaussian Graphical Models (GGMs) with possibly missing values as described in the paper [1]: Londschien, Kovács and Bühlmann: "Change point detection for graphical models in the presence of missing values", see here for a preprint. Additionally, the hdcd
package also implements some ongoing work on multivariate nonparametric change point detection from [2].
The package can be installed directly from github:
# install.packages(devtools)
library(devtools)
devtools::install_github("mlondschien/hdcd")
The folders simulations/main_simulation
and simulation/plots
contain R-scripts and READMEs to reproduce Table 1 with the main simulation and figures showing gain curves (Figure 2, 3 and 6). Figure 4 can be reproduced using the R-script in simulations/histograms
and the data_histograms
stored in the data
folder. Other raw data generated by the scripts (e.g. for the main simulation of Table 1) are also included in data
folder. We do not include the raw data for the groundwater example, as the owners only permitted using it as an illustrating example.
The following code snippet generates a p = 100
dimensional time series of n = 500
observations with three change points at locations alpha = c(120, 240, 310)
, where in-segment distributions arise from Gaussian Graphical Models (GGMs), i.e. multivariate Gaussians with mean zero and precision matrices that are generated as chain networks. Note that this is a truly high-dimensional setting, as the number of variables is higher than the length of the smallest segment, 70.
set.seed(0)
model <- hdcd::create_model(n = 500, p = 100, c(120, 240, 310), hdcd::ChainNetwork)
x <- hdcd::simulate_from_model(model)
We then use the main algorithm from [1] to find change points in the GGM structure. Note that we specify method = "glasso"
to use the graphical lasso ([3]) based loss and to estimate within segment precision matrices using the glasso package of [3]. We use Optimistic Binary Segmentation (OBS) here, as the "section_search"
optimizer (i.e. naive Optimistic Search) was selected for finding the "best" split point candidate in each step and Binary Segmentation style algorithm is the default. This reduces computation time drastically. To perform the computationally more expensive full grid search, one can set the optimizer to "line_search"
. This would then correspond to the traditional Binary Segmentation based change point detection. We use the default value of delta = 0.1
for the minimal relative segment length and let hdcd
find a suitable initial regularization parameter lambda
. Note that running hdcd
might take a few seconds due to the high-dimensionality of the simulated dataset.
tree <- hdcd::hdcd(x, method = "glasso", optimizer = "section_search")
tree
# levelName split_point max_gain cv_loss cv_improvement lambda
# 1 (0 500] 310 9.201834 14003.254 366.877394 0.05592379
# 2 ¦--(0 310] 120 6.443479 8630.149 248.956480 0.07040388
# 3 ¦ ¦--(0 120] 65 2.649532 3125.832 -113.364558 0.11158263
# 4 ¦ ¦ ¦--(0 65] NA NA 1719.190 NA 0.22263662
# 5 ¦ ¦ °--(65 120] NA NA 1520.007 NA 0.22263662
# 6 ¦ °--(120 310] 240 4.946971 5255.360 179.297754 0.11158263
# 7 ¦ ¦--(120 240] 170 2.262078 3146.867 -82.117642 0.14047421
# 8 ¦ ¦ ¦--(120 170] NA NA 1386.386 NA 0.28028290
# 9 ¦ ¦ °--(170 240] NA NA 1842.599 NA 0.22263662
# 10 ¦ °--(240 310] NA NA 1929.196 NA 0.22263662
# 11 °--(310 500] 397 2.859732 5006.227 -6.237342 0.08863323
# 12 ¦--(310 397] NA NA 2304.448 NA 0.17684655
# 13 °--(397 500] NA NA 2708.017 NA 0.14047421
The returned object is a binary_segmentation_tree
, which inherits from data.tree::Node
. Each node in the tree corresponds to one segment, with attributes start
, split_point
, end
, gain
and max_gain
. In our setting using the glasso method each node also has an attribute cv_improvement
, which is calculated as the cross validated increase in likelihood when splitting the segment (start, end]
at split_point
. The attribute lambda
is the optimal regularization parameter (as determined by the cross validation procedure) and it used for the given segment for splitting. hdcd
stops splitting when cv_improvement <= 0
. Note that split_point
s with positive cv_improvement
are given as 120, 240 and 310, which are exactly the true underlying change points. These can be extracted from the saved tree
with the method hdcd::get_change_points_from_tree
.
hdcd::get_change_points_from_tree(tree)
# [1] 120 240 310
hdcd
can also handle missing values according to the methodology described in [1]. We delete 40% of entries of the matrix x
completely at random and run our procedure again. We take the default imputation method, which is the Loh-Wainwrigth bias correction approach. The algorithm still succeeds to approximately recover the original change points.
x_deleted <- hdcd::delete_values(x, 0.4, "mcar")
tree_deleted <- hdcd::hdcd(x_deleted, method = "glasso", optimizer = "section_search")
tree_deleted
# levelName split_point max_gain cv_loss cv_improvement lambda
# 1 (0 500] 310 4.756226 8586.2079 42.32761 0.1030157
# 2 ¦--(0 310] 118 1.901519 5318.4926 183.65402 0.1296891
# 3 ¦ ¦--(0 118] 65 1.049296 1952.1950 -53.64900 0.1296891
# 4 ¦ ¦ ¦--(0 65] NA NA 1104.9134 NA 0.1632689
# 5 ¦ ¦ °--(65 118] NA NA 900.9306 NA 0.2055434
# 6 ¦ °--(118 310] 226 1.627940 3182.6435 40.72925 0.1296891
# 7 ¦ ¦--(118 226] 174 1.005050 1778.5837 -52.08288 0.1296891
# 8 ¦ ¦ ¦--(118 174] NA NA 968.1591 NA 0.1296891
# 9 ¦ ¦ °--(174 226] NA NA 862.5074 NA 0.2055434
# 10 ¦ °--(226 310] NA NA 1363.3306 NA 0.1632689
# 11 °--(310 500] 390 1.048517 3225.3877 -78.84160 0.1296891
# 12 ¦--(310 390] NA NA 1363.3497 NA 0.1632689
# 13 °--(390 500] NA NA 1940.8796 NA 0.1296891
hdcd::get_change_points_from_tree(tree_deleted)
[1] 118 226 310
[1] M. Londschien, S. Kovács and P. Bühlmann (2019), "Change point detection for graphical models in the presence of missing values", arXiv:1907.05409.
[2] S. Kovács, M. Londschien and P. Bühlmann (2020), "Random Forests and other nonparametric classifiers for multivariate change point detection", working paper.
[3] J. Friedman, T. Hastie and R. Tibshirani (2008), "Sparse inverse covariance estimation with the graphical lasso", Biostatistics, 9, 432–441.