Skip to content

High-dimensional change point detection in Gaussian Graphical models with missing values

Notifications You must be signed in to change notification settings

mlondschien/hdcd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-dimensional change point detection

hdcd is a package for multivariate, or even high-dimensional change point detection methods. It implements change point detection for (high-dimensional) Gaussian Graphical Models (GGMs) with possibly missing values as described in the paper [1]: Londschien, Kovács and Bühlmann: "Change point detection for graphical models in the presence of missing values", see here for a preprint. Additionally, the hdcd package also implements some ongoing work on multivariate nonparametric change point detection from [2].

Installation of the hdcd package

The package can be installed directly from github:

# install.packages(devtools)
library(devtools)
devtools::install_github("mlondschien/hdcd")

Simulations and plots for the paper [1]

The folders simulations/main_simulation and simulation/plots contain R-scripts and READMEs to reproduce Table 1 with the main simulation and figures showing gain curves (Figure 2, 3 and 6). Figure 4 can be reproduced using the R-script in simulations/histograms and the data_histograms stored in the data folder. Other raw data generated by the scripts (e.g. for the main simulation of Table 1) are also included in data folder. We do not include the raw data for the groundwater example, as the owners only permitted using it as an illustrating example.

Example: detecting changes in a GGM

The following code snippet generates a p = 100 dimensional time series of n = 500 observations with three change points at locations alpha = c(120, 240, 310), where in-segment distributions arise from Gaussian Graphical Models (GGMs), i.e. multivariate Gaussians with mean zero and precision matrices that are generated as chain networks. Note that this is a truly high-dimensional setting, as the number of variables is higher than the length of the smallest segment, 70.

set.seed(0)
model <- hdcd::create_model(n = 500, p = 100, c(120, 240, 310), hdcd::ChainNetwork)
x <- hdcd::simulate_from_model(model)

We then use the main algorithm from [1] to find change points in the GGM structure. Note that we specify method = "glasso" to use the graphical lasso ([3]) based loss and to estimate within segment precision matrices using the glasso package of [3]. We use Optimistic Binary Segmentation (OBS) here, as the "section_search" optimizer (i.e. naive Optimistic Search) was selected for finding the "best" split point candidate in each step and Binary Segmentation style algorithm is the default. This reduces computation time drastically. To perform the computationally more expensive full grid search, one can set the optimizer to "line_search". This would then correspond to the traditional Binary Segmentation based change point detection. We use the default value of delta = 0.1 for the minimal relative segment length and let hdcd find a suitable initial regularization parameter lambda. Note that running hdcd might take a few seconds due to the high-dimensionality of the simulated dataset.

tree <- hdcd::hdcd(x, method = "glasso", optimizer = "section_search")
tree
#                    levelName split_point max_gain   cv_loss cv_improvement     lambda
# 1  (0 500]                           310 9.201834 14003.254     366.877394 0.05592379
# 2   ¦--(0 310]                       120 6.443479  8630.149     248.956480 0.07040388
# 3   ¦   ¦--(0 120]                    65 2.649532  3125.832    -113.364558 0.11158263
# 4   ¦   ¦   ¦--(0 65]                 NA       NA  1719.190             NA 0.22263662
# 5   ¦   ¦   °--(65 120]               NA       NA  1520.007             NA 0.22263662
# 6   ¦   °--(120 310]                 240 4.946971  5255.360     179.297754 0.11158263
# 7   ¦       ¦--(120 240]             170 2.262078  3146.867     -82.117642 0.14047421
# 8   ¦       ¦   ¦--(120 170]          NA       NA  1386.386             NA 0.28028290
# 9   ¦       ¦   °--(170 240]          NA       NA  1842.599             NA 0.22263662
# 10  ¦       °--(240 310]              NA       NA  1929.196             NA 0.22263662
# 11  °--(310 500]                     397 2.859732  5006.227      -6.237342 0.08863323
# 12      ¦--(310 397]                  NA       NA  2304.448             NA 0.17684655
# 13      °--(397 500]                  NA       NA  2708.017             NA 0.14047421

The returned object is a binary_segmentation_tree, which inherits from data.tree::Node. Each node in the tree corresponds to one segment, with attributes start, split_point, end, gain and max_gain. In our setting using the glasso method each node also has an attribute cv_improvement, which is calculated as the cross validated increase in likelihood when splitting the segment (start, end] at split_point. The attribute lambda is the optimal regularization parameter (as determined by the cross validation procedure) and it used for the given segment for splitting. hdcd stops splitting when cv_improvement <= 0. Note that split_points with positive cv_improvement are given as 120, 240 and 310, which are exactly the true underlying change points. These can be extracted from the saved tree with the method hdcd::get_change_points_from_tree.

hdcd::get_change_points_from_tree(tree)
# [1] 120 240 310

Example: detecting changes in a GGM with missing values

hdcd can also handle missing values according to the methodology described in [1]. We delete 40% of entries of the matrix x completely at random and run our procedure again. We take the default imputation method, which is the Loh-Wainwrigth bias correction approach. The algorithm still succeeds to approximately recover the original change points.

x_deleted    <- hdcd::delete_values(x, 0.4, "mcar")
tree_deleted <- hdcd::hdcd(x_deleted, method = "glasso", optimizer = "section_search")
tree_deleted
#                    levelName split_point max_gain   cv_loss cv_improvement    lambda
# 1  (0 500]                           310 4.756226 8586.2079       42.32761 0.1030157
# 2   ¦--(0 310]                       118 1.901519 5318.4926      183.65402 0.1296891
# 3   ¦   ¦--(0 118]                    65 1.049296 1952.1950      -53.64900 0.1296891
# 4   ¦   ¦   ¦--(0 65]                 NA       NA 1104.9134             NA 0.1632689
# 5   ¦   ¦   °--(65 118]               NA       NA  900.9306             NA 0.2055434
# 6   ¦   °--(118 310]                 226 1.627940 3182.6435       40.72925 0.1296891
# 7   ¦       ¦--(118 226]             174 1.005050 1778.5837      -52.08288 0.1296891
# 8   ¦       ¦   ¦--(118 174]          NA       NA  968.1591             NA 0.1296891
# 9   ¦       ¦   °--(174 226]          NA       NA  862.5074             NA 0.2055434
# 10  ¦       °--(226 310]              NA       NA 1363.3306             NA 0.1632689
# 11  °--(310 500]                     390 1.048517 3225.3877      -78.84160 0.1296891
# 12      ¦--(310 390]                  NA       NA 1363.3497             NA 0.1632689
# 13      °--(390 500]                  NA       NA 1940.8796             NA 0.1296891
hdcd::get_change_points_from_tree(tree_deleted)
[1]  118 226 310

References

[1] M. Londschien, S. Kovács and P. Bühlmann (2019), "Change point detection for graphical models in the presence of missing values", arXiv:1907.05409.

[2] S. Kovács, M. Londschien and P. Bühlmann (2020), "Random Forests and other nonparametric classifiers for multivariate change point detection", working paper.

[3] J. Friedman, T. Hastie and R. Tibshirani (2008), "Sparse inverse covariance estimation with the graphical lasso", Biostatistics, 9, 432–441.

About

High-dimensional change point detection in Gaussian Graphical models with missing values

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages