GraphGrad: Efficient Estimation of Sparse Polynomial Representations for General State-Space Models

Benjamin Cox, , Émilie Chouzenoux, , and
Víctor Elvira
Abstract

State-space models (SSMs) are a powerful statistical tool for modelling time-varying systems via a latent state. In these models, the latent state is never directly observed. Instead, a sequence of observations related to the state is available. The state-space model is defined by the state dynamics and the observation model, both of which are described by parametric distributions. Estimation of parameters of these distributions is a very challenging, but essential, task for performing inference and prediction. Furthermore, it is typical that not all states of the system interact. We can therefore encode the interaction of the states via a graph, usually not fully connected. However, most parameter estimation methods do not take advantage of this feature. In this work, we propose GraphGrad, a fully automatic approach for obtaining sparse estimates of the state interactions of a non-linear state-space model via a polynomial approximation. This novel methodology unveils the latent structure of the data-generating process, allowing us to infer both the structure and value of a rich and efficient parameterisation of a general state-space model. Our method utilises a differentiable particle filter to optimise a Monte Carlo likelihood estimator. It also promotes sparsity in the estimated system through the use of suitable proximity updates, known to be more efficient and stable than subgradient methods. As shown in our paper, a number of well-known dynamical systems can be accurately represented and recovered by our method, providing basis for application to real-world scenarios.

00footnotetext: B.C. acknowledges support from the Natural Environment Research Council of the UK through a SENSE CDT studentship (NE/T00939X/1). The work of V. E. is supported by the Agence Nationale de la Recherche of France under PISCES (ANR-17-CE40-0031-01), the Leverhulme Research Fellowship (RF-2021-593), and by ARL/ARO under grant W911NF-22-1-0235. É.C. acknowledges support from the European Research Council Starting Grant MAJORIS ERC-2019-STG850925.

I Introduction

State-space models (SSMs) are a powerful tool for describing systems that evolve in time. SSMs are utilised in many fields, such as target tracking, [1], epidemiology [2], ecology [3], finance [4], and meteorology [5]. State-space models represent a dynamical system via an unobserved hidden state and a series of related observations. The objective is then to estimate the hidden state, conditional on the series of observations. Estimation of the state using only observations available at the time the state occurs is called the filtering problem, while estimation of the state using the entire observation series, is known as the smoothing problem. In order to solve either of these problems, one must know both the conditional densities of the hidden state and of the observation.

When the state transition and observation model are linear and admit Gaussian densities, a linear-Gaussian state-space model is obtained, and one can solve the filtering and smoothing problems exactly using the Kalman filter [6] and RTS smoother [7] respectively. If the state dynamics or observation model are not linear, then we have a non-linear state-space model (NLSSM), and the Kalman filter cannot be applied without simplifying the model. In this case we can use Gaussian approximations, such as the Unscented Kalman filter [8] or the extended Kalman filter, although in many cases a Gaussian approximation is not sufficiently rich to capture the behaviour of the system. In such cases, more accurate results might be obtained by a particle filter, which approximates the posterior density of the state via a series of importance weighted Monte Carlo samples [9]. However, all of these techniques require knowing the form of the transition and observations models, and for all parameters therein to be known. It is common for the parameters to be unknown, and therefore they must be estimated. Furthermore, in many cases, the form of the model is not known, and therefore we must impose a form before we can estimate anything.

Estimating the parameters of a non-linear state-space model is a difficult task, even when the model is known. One challenge is that the exact likelihood is usually intractable for general non-linear models, and we must use a stochastic estimate thereof. A further problem is that we cannot efficiently compute gradients of the likelihood, and hence we cannot easily compute the maximum likelihood estimator, nor utilise standard modern sampling schemes such as Hamiltonian Monte Carlo. Thanks to the recent advent of differentiable particle filters [10, 11, 12], it recently became possible to obtain the gradient of the estimated likelihood of model parameters, and hence became much simpler to fit them using, e.g., maximum likelihood, assuming the form of the model is known.

In dynamical systems, many phenomena occur with rates proportional to a polynomial function of the state dimensions. For example, in population modelling, birth-immigration-death-emigration models are often used, within which the dynamics of the rates are dependent on 2nd degree polynomials of the state-space. Furthermore, the well known Lorenz 63 [13] and Lorenz 96 [14] models both have rates of change described by 2nd degree polynomials, and are known to be highly chaotic systems of equations. We can therefore see that polynomial systems can describe complex phenomena, and hence a crucial problem is to learn the coefficients of a polynomial approximation to the transition function of a NLSSM.

Contribution. In this paper, we propose GraphGrad, a method to model and learn the transition distribution of a non-linear state-space model via a polynomial approximation of the state.

  • The proposed approximation can represent a rich class of systems, as many dynamical systems are described by a series of differential equations that are polynomial functions of the states. The resulting systems are interpretable, with the interactions between variables having interpretation similar to the rate terms in a dynamical system.

  • Our method uses a differentiable particle filter, allowing us to use a first order optimisation scheme to perform parameter estimation, improving robustness whilst decreasing computation time. Furthermore, we promote sparse systems by the use of a proximity update, thereby increasing interpretability, and improving stability compared to naive subgradient updates of a penalised loss.

  • GraphGrad is unsupervised, as we optimise the (penalised) parameter log-likelihood, and therefore do not require knowledge of the underlying hidden state to be trained.

  • GraphGrad is fast both to fit and to evaluate, and provides excellent performance in challenging scenarios, even when the underlying system is not of polynomial form, i.e., under model mismatch.

Structure. In Section II, we present the underlying particle filtering methodology that we build upon, and introduce the differentiable particle filter, as well as the notation used throughout this paper. We present the method in Section III, and discussion in Section IV. We present several numerical experiments in Section V, and conclude in Section VI.

II Background

II-A Particle filtering

The general state-space model can be described by

𝐱tp(𝐱t|𝐱t1;𝜽),𝐲tp(𝐲t|𝐱t;𝜽),formulae-sequencesimilar-tosubscript𝐱𝑡𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝜽similar-tosubscript𝐲𝑡𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡𝜽\displaystyle\begin{split}{\mathbf{x}}_{t}&\sim p({\mathbf{x}}_{t}|{\mathbf{x}% }_{t-1};\bm{\theta}),\\ {\mathbf{y}}_{t}&\sim p({\mathbf{y}}_{t}|{\mathbf{x}}_{t};\bm{\theta}),\end{split}start_ROW start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL ∼ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_θ ) , end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL ∼ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) , end_CELL end_ROW (1)

where t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T denotes discrete time, 𝐱tNxsubscript𝐱𝑡superscriptsubscript𝑁𝑥{\mathbf{x}}_{t}\in\mathbb{R}^{N_{x}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the state of the system at time t𝑡titalic_t, 𝐲tNysubscript𝐲𝑡superscriptsubscript𝑁𝑦{\mathbf{y}}_{t}\in\mathbb{R}^{N_{y}}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the observation at time t𝑡titalic_t, p(𝐱t|𝐱t1;𝜽)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝜽p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1};\bm{\theta})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_θ ) is the density of the hidden state 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, given the previous state 𝐱t1subscript𝐱𝑡1{\mathbf{x}}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, p(𝐲t|𝐱t;𝜽)𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡𝜽p({\mathbf{y}}_{t}|{\mathbf{x}}_{t};\bm{\theta})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) is the density of the observation 𝐲tsubscript𝐲𝑡{\mathbf{y}}_{t}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given the hidden state 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and 𝜽𝜽\bm{\theta}bold_italic_θ is a set of model parameters. The initial value of the hidden state is distributed 𝐱0p(𝐱0|𝜽)similar-tosubscript𝐱0𝑝conditionalsubscript𝐱0𝜽{\mathbf{x}}_{0}\sim p({\mathbf{x}}_{0}|\bm{\theta})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_θ ). The state sequence 𝐱0:Tsubscript𝐱:0𝑇{\mathbf{x}}_{0:T}bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT is typically hidden, whilst the related sequence of observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT is known.

Filtering methods aim at estimating the hidden state at time t𝑡titalic_t, denoted 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, typically utilising the posterior density function of the state conditional on the observations up to time t𝑡titalic_t, denoted 𝐲1:tsubscript𝐲:1𝑡{\mathbf{y}}_{1:t}bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. Particle filters approximate this density, p(𝐱t|𝐲1:t;𝜽)𝑝conditionalsubscript𝐱𝑡subscript𝐲:1𝑡𝜽p({\mathbf{x}}_{t}|{\mathbf{y}}_{1:t};\bm{\theta})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ; bold_italic_θ ), using a set of K𝐾Kitalic_K Monte Carlo samples (particles) and their associated weights, {𝐱1:T(k),w~1:T(k)}k=1Ksuperscriptsubscriptsuperscriptsubscript𝐱:1𝑇𝑘superscriptsubscript~𝑤:1𝑇𝑘𝑘1𝐾\{{\mathbf{x}}_{1:T}^{(k)},\widetilde{w}_{1:T}^{(k)}\}_{k=1}^{K}{ bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. The posterior density can then be approximated by

p(𝐱t|𝐲1:t;𝜽)k=1Kw~t(k)δ𝐱t(k).𝑝conditionalsubscript𝐱𝑡subscript𝐲:1𝑡𝜽superscriptsubscript𝑘1𝐾superscriptsubscript~𝑤𝑡𝑘subscript𝛿superscriptsubscript𝐱𝑡𝑘p({\mathbf{x}}_{t}|{\mathbf{y}}_{1:t};\bm{\theta})\approx\sum_{k=1}^{K}% \widetilde{w}_{t}^{(k)}\delta_{{\mathbf{x}}_{t}^{(k)}}.italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) ≈ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (2)

A commonly used particle filtering method is the sequential importance resampling algorithm, given in Alg. 1. At every time-step t𝑡titalic_t, the K𝐾Kitalic_K particles and normalised weights, {𝐱1:T(k),w¯1:T(k)}k=1Ksuperscriptsubscriptsuperscriptsubscript𝐱:1𝑇𝑘superscriptsubscript¯𝑤:1𝑇𝑘𝑘1𝐾\{{\mathbf{x}}_{1:T}^{(k)},\overline{w}_{1:T}^{(k)}\}_{k=1}^{K}{ bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, are calculated. First, we perform the resampling step (line 6), which generates K𝐾Kitalic_K samples, sampling 𝐱t1(k),k=1,,Kformulae-sequencesuperscriptsubscript𝐱𝑡1𝑘𝑘1𝐾{\mathbf{x}}_{t-1}^{(k)},\ k=1,\dots,Kbold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_k = 1 , … , italic_K, with probability w¯t1(k)superscriptsubscript¯𝑤𝑡1𝑘\overline{w}_{t-1}^{(k)}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. The resampling step is vital to avoid the degeneracy of the filter, i.e., to ensure diversity in the particle set and obtain more accurate approximations of the posterior distribution, p(𝐱t|𝐲1:t;𝜽)𝑝conditionalsubscript𝐱𝑡subscript𝐲:1𝑡𝜽p({\mathbf{x}}_{t}|{\mathbf{y}}_{1:t};\bm{\theta})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ; bold_italic_θ ). Next, K𝐾Kitalic_K particles 𝐱t(k)superscriptsubscript𝐱𝑡𝑘{\mathbf{x}}_{t}^{(k)}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K, are drawn from the proposal distribution π(𝐱t|𝐱t1,𝐲t;𝜽)𝜋conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐲𝑡𝜽\pi({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1},{\mathbf{y}}_{t};\bm{\theta})italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) (line 8). Finally, we incorporate the observation 𝐲tsubscript𝐲𝑡{\mathbf{y}}_{t}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is done via the particle weights, given by wt(k)superscriptsubscript𝑤𝑡𝑘{w}_{t}^{(k)}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K, in line 9, and the normalised weights (line 10).

Algorithm 1 Sequential importance resampling (SIR) particle filter
1:Input: Observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, parameters 𝜽𝜽\bm{\theta}bold_italic_θ.
2:Output: Hidden state estimates 𝐱1:Tsubscript𝐱:1𝑇{\mathbf{x}}_{1:T}bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, particle weights w1:Tsubscript𝑤:1𝑇w_{1:T}italic_w start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT.
3:Draw 𝐱0(k)p(𝐱0|𝜽)similar-tosuperscriptsubscript𝐱0𝑘𝑝conditionalsubscript𝐱0𝜽{\mathbf{x}}_{0}^{(k)}\sim p({\mathbf{x}}_{0}|\bm{\theta})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_θ ), for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K.
4:Set w~0(k)=w¯0(k)=1/Ksuperscriptsubscript~𝑤0𝑘superscriptsubscript¯𝑤0𝑘1𝐾\widetilde{w}_{0}^{(k)}=\overline{w}_{0}^{(k)}=1/Kover~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 1 / italic_K, for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K.
5:for t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T and k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K do
6:    Sample at(k)Categorical(w¯t1)similar-tosuperscriptsubscript𝑎𝑡𝑘Categoricalsubscript¯𝑤𝑡1a_{t}^{(k)}\sim\mathrm{Categorical}(\overline{w}_{t-1})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ roman_Categorical ( over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ).
7:    Set w~t1(k)=1/Ksuperscriptsubscript~𝑤𝑡1𝑘1𝐾\widetilde{w}_{t-1}^{(k)}=1/Kover~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 1 / italic_K.
8:    Sample 𝐱t(k)π(𝐱t|𝐱t1(at(k)),𝐲t;𝜽)similar-tosuperscriptsubscript𝐱𝑡𝑘𝜋conditionalsubscript𝐱𝑡superscriptsubscript𝐱𝑡1superscriptsubscript𝑎𝑡𝑘subscript𝐲𝑡𝜽{\mathbf{x}}_{t}^{(k)}\sim\pi({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1}^{(a_{t}^{(k)% })},{\mathbf{y}}_{t};\bm{\theta})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ).
9:    Compute wt(k)=p(𝐲t|𝐱t(k);𝜽)p(𝐱t(k)|𝐱t1(at(k));𝜽)π(𝐱t|𝐱t1(at(k)),𝐲t;𝜽)superscriptsubscript𝑤𝑡𝑘𝑝conditionalsubscript𝐲𝑡superscriptsubscript𝐱𝑡𝑘𝜽𝑝conditionalsuperscriptsubscript𝐱𝑡𝑘superscriptsubscript𝐱𝑡1superscriptsubscript𝑎𝑡𝑘𝜽𝜋conditionalsubscript𝐱𝑡superscriptsubscript𝐱𝑡1superscriptsubscript𝑎𝑡𝑘subscript𝐲𝑡𝜽w_{t}^{(k)}=\frac{p({\mathbf{y}}_{t}|{\mathbf{x}}_{t}^{(k)};\bm{\theta})p({% \mathbf{x}}_{t}^{(k)}|{\mathbf{x}}_{t-1}^{(a_{t}^{(k)})};\bm{\theta})}{\pi({% \mathbf{x}}_{t}|{\mathbf{x}}_{t-1}^{(a_{t}^{(k)})},{\mathbf{y}}_{t};\bm{\theta% })}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = divide start_ARG italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ; bold_italic_θ ) italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ; bold_italic_θ ) end_ARG start_ARG italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) end_ARG.
10:    Compute w¯t(k)=w~t1(k)wt(i)/k=1Kw~t1(k)wt(k)superscriptsubscript¯𝑤𝑡𝑘superscriptsubscript~𝑤𝑡1𝑘superscriptsubscript𝑤𝑡𝑖superscriptsubscript𝑘1𝐾superscriptsubscript~𝑤𝑡1𝑘superscriptsubscript𝑤𝑡𝑘\overline{w}_{t}^{(k)}=\widetilde{w}_{t-1}^{(k)}w_{t}^{(i)}/\sum_{k=1}^{K}% \widetilde{w}_{t-1}^{(k)}w_{t}^{(k)}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT.
11:end for

II-B Parameter estimation in state-space models

In many problems of interest, the parameter 𝜽𝜽\bm{\theta}bold_italic_θ is not known, and must be estimated. The posterior distribution of 𝜽𝜽\bm{\theta}bold_italic_θ in the state-space model can be factorised as p(𝜽|𝐲1:T)p(𝜽)p(𝐲1:T|𝜽)proportional-to𝑝conditional𝜽subscript𝐲:1𝑇𝑝𝜽𝑝conditionalsubscript𝐲:1𝑇𝜽p(\bm{\theta}|\mathbf{y}_{1:T})\propto p(\bm{\theta})p(\mathbf{y}_{1:T}|\bm{% \theta})italic_p ( bold_italic_θ | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∝ italic_p ( bold_italic_θ ) italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_θ ), where 𝜽𝜽\bm{\theta}bold_italic_θ is the parameter of interest, p(𝜽)𝑝𝜽p(\bm{\theta})italic_p ( bold_italic_θ ) is the prior distribution of the parameter 𝜽𝜽\bm{\theta}bold_italic_θ, and p(𝐲1:T|𝜽)𝑝conditionalsubscript𝐲:1𝑇𝜽p(\mathbf{y}_{1:T}|\bm{\theta})italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_θ ) is given by

p(𝐲1:T|𝜽)=t=1Tp(𝐲t|𝐲1:t1;𝜽),𝑝conditionalsubscript𝐲:1𝑇𝜽superscriptsubscriptproduct𝑡1𝑇𝑝conditionalsubscript𝐲𝑡subscript𝐲:1𝑡1𝜽p(\mathbf{y}_{1:T}|\bm{\theta})=\prod_{t=1}^{T}p(\mathbf{y}_{t}|\mathbf{y}_{1:% t-1};\bm{\theta}),italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_θ ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_θ ) , (3)

where p(𝐲1|𝐲1:0;𝜽):=p(𝐲1|𝜽)assign𝑝conditionalsubscript𝐲1subscript𝐲:10𝜽𝑝conditionalsubscript𝐲1𝜽p(\mathbf{y}_{1}|\mathbf{y}_{1:0};\bm{\theta}):=p(\mathbf{y}_{1}|\bm{\theta})italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : 0 end_POSTSUBSCRIPT ; bold_italic_θ ) := italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_italic_θ ) [15]. The prior distribution p(𝜽)𝑝𝜽p(\bm{\theta})italic_p ( bold_italic_θ ) encodes our pre-existing beliefs as to the value and structure of the parameter 𝜽𝜽\bm{\theta}bold_italic_θ, and can be used for regularisation, e.g., by the Lasso [16].

There are many methods to estimate parameters 𝜽𝜽\bm{\theta}bold_italic_θ given its posterior density function p(𝜽|𝐲1:T)𝑝conditional𝜽subscript𝐲:1𝑇p(\bm{\theta}|\mathbf{y}_{1:T})italic_p ( bold_italic_θ | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ). We can broadly classify these parameter estimation methods as point estimation methods and distributional methods. Point estimation methods provide a single estimate that is, in some defined way, the optimal value. An example of a point estimation method is the maximum-a-posteriori (MAP) estimator, that defines the optimal value of 𝜽𝜽\bm{\theta}bold_italic_θ as the one that maximises the posterior density p(𝜽|𝐲1:T)𝑝conditional𝜽subscript𝐲:1𝑇p(\bm{\theta}|\mathbf{y}_{1:T})italic_p ( bold_italic_θ | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ). The method proposed in this work yields a MAP estimator.

In the case of a linear-Gaussian state-space model, we have closed form methods for the MAP estimator assuming a diffuse prior [15], and efficient methods for obtaining the MAP estimator for sparsity promoting priors [17]. Distributional methods estimate the posterior distribution of the parameter, with common methods being importance sampling [18], Markov chain Monte Carlo [19], and variational inference [20]. For the linear-Gaussian state-space model, we can utilise reversible jump Markov chain Monte Carlo to obtain an estimate of the distribution of sparsity [21], however no similar method exists for the general state-space model. Our proposed method in this work is a point estimation method.

For general SSMs, the likelihood of the parameter 𝜽𝜽\bm{\theta}bold_italic_θ, necessary to compute the posterior and hence to design a procedure to maximise it, cannot be obtained in closed form. We can estimate it using the particle filter, by the Monte Carlo estimate

p(𝜽|𝐲1:T)p(𝜽)p(𝐲1:T|𝜽)p(𝜽)t=1T(k=1Kwt(k)w~t1(k)),proportional-to𝑝conditional𝜽subscript𝐲:1𝑇𝑝𝜽𝑝conditionalsubscript𝐲:1𝑇𝜽𝑝𝜽superscriptsubscriptproduct𝑡1𝑇superscriptsubscript𝑘1𝐾superscriptsubscript𝑤𝑡𝑘superscriptsubscript~𝑤𝑡1𝑘p(\bm{\theta}|{\mathbf{y}}_{1:T})\propto p(\bm{\theta})p(\mathbf{y}_{1:T}|\bm{% \theta})\approx p(\bm{\theta})\prod_{t=1}^{T}\left(\sum_{k=1}^{K}w_{t}^{(k)}% \widetilde{w}_{t-1}^{(k)}\right),italic_p ( bold_italic_θ | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∝ italic_p ( bold_italic_θ ) italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_θ ) ≈ italic_p ( bold_italic_θ ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) , (4)

where wt(k)superscriptsubscript𝑤𝑡𝑘w_{t}^{(k)}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and w~t1(k)superscriptsubscript~𝑤𝑡1𝑘\widetilde{w}_{t-1}^{(k)}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are the weights of the particle filter in Alg. 1 [15, Chapter 12]. Note that the weights are dependent on the parameter 𝜽𝜽\bm{\theta}bold_italic_θ through their computation in Alg. 1. Using these weights, we construct an estimate of the log-likelihood by

log(p(𝜽|𝐲1:T))=log(p(𝜽))+log(p(𝐲1:T|𝜽))+c,log(p(𝜽))+t=1Tlog(k=1Kwt(k)w~t1(k))+c,\displaystyle\begin{split}\log(p(\bm{\theta}|{\mathbf{y}}_{1:T}))&=\log(p(\bm{% \theta}))+\log(p(\mathbf{y}_{1:T}|\bm{\theta}))+c,\\ &\approx\log(p(\bm{\theta}))+\sum_{t=1}^{T}\log\left(\sum_{k=1}^{K}w_{t}^{(k)}% \widetilde{w}_{t-1}^{(k)}\right)+c,\end{split}start_ROW start_CELL roman_log ( italic_p ( bold_italic_θ | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ) end_CELL start_CELL = roman_log ( italic_p ( bold_italic_θ ) ) + roman_log ( italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_θ ) ) + italic_c , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≈ roman_log ( italic_p ( bold_italic_θ ) ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) + italic_c , end_CELL end_ROW (5)

where c𝑐citalic_c results from the proportionality in Eq. (4), and we note that if resampling occurred at time t1𝑡1t-1italic_t - 1, which is always the case if following Alg. 1, we have w~t1=1/Ksubscript~𝑤𝑡11𝐾\widetilde{w}_{t-1}=1/Kover~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = 1 / italic_K. Note that, in practice, log-weights are used for numerical stability, and we compute k=1Kwt(k)w~t1(k)superscriptsubscript𝑘1𝐾superscriptsubscript𝑤𝑡𝑘superscriptsubscript~𝑤𝑡1𝑘\sum_{k=1}^{K}w_{t}^{(k)}\widetilde{w}_{t-1}^{(k)}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT using a logsumexp reparameterisation.

II-C Differentiable particle filter

Algorithm 2 Stop-gradient differentiable particle filter (DPF) [12]
1:Input: Observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, parameters 𝜽𝜽\bm{\theta}bold_italic_θ.
2:Output: Hidden state estimates 𝐱1:Tsubscript𝐱:1𝑇{\mathbf{x}}_{1:T}bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, particle weights w1:Tsubscript𝑤:1𝑇w_{1:T}italic_w start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT.
3:Draw 𝐱0(k)p(𝐱0|𝜽)similar-tosuperscriptsubscript𝐱0𝑘𝑝conditionalsubscript𝐱0𝜽{\mathbf{x}}_{0}^{(k)}\sim p({\mathbf{x}}_{0}|\bm{\theta})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_θ ), for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K.
4:Set w~0(k)=w¯0(k)=1/Ksuperscriptsubscript~𝑤0𝑘superscriptsubscript¯𝑤0𝑘1𝐾\widetilde{w}_{0}^{(k)}=\overline{w}_{0}^{(k)}=1/Kover~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 1 / italic_K, for k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K.
5:for t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T and k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K do
6:    Sample at(k)Categorical((w¯t1))similar-tosuperscriptsubscript𝑎𝑡𝑘Categoricalbottomsubscript¯𝑤𝑡1a_{t}^{(k)}\sim\mathrm{Categorical}(\bot(\overline{w}_{t-1}))italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ roman_Categorical ( ⊥ ( over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ).
7:    Set w~t(k)=1Kw¯t1at(k)/(w¯t1at(k))\widetilde{w}_{t}^{(k)}=\frac{1}{K}\ \overline{w}_{t-1}^{a_{t}^{(k)}}/\bot(% \overline{w}_{t-1}^{a_{t}^{(k)}})over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT / ⊥ ( over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ).
8:    Sample 𝐱t(k)π(𝐱t|𝐱t1at(k),𝐲t;𝜽)similar-tosuperscriptsubscript𝐱𝑡𝑘𝜋conditionalsubscript𝐱𝑡superscriptsubscript𝐱𝑡1superscriptsubscript𝑎𝑡𝑘subscript𝐲𝑡𝜽{\mathbf{x}}_{t}^{(k)}\sim\pi({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1}^{a_{t}^{(k)}% },{\mathbf{y}}_{t};\bm{\theta})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ).
9:    Compute wt(k)=p(𝐲t|𝐱t(k);𝜽)p(𝐱t(k)|𝐱t1at(k);𝜽)π(𝐱t|𝐱t1at(k),𝐲t;𝜽)superscriptsubscript𝑤𝑡𝑘𝑝conditionalsubscript𝐲𝑡superscriptsubscript𝐱𝑡𝑘𝜽𝑝conditionalsuperscriptsubscript𝐱𝑡𝑘superscriptsubscript𝐱𝑡1superscriptsubscript𝑎𝑡𝑘𝜽𝜋conditionalsubscript𝐱𝑡superscriptsubscript𝐱𝑡1superscriptsubscript𝑎𝑡𝑘subscript𝐲𝑡𝜽w_{t}^{(k)}=\frac{p({\mathbf{y}}_{t}|{\mathbf{x}}_{t}^{(k)};\bm{\theta})p({% \mathbf{x}}_{t}^{(k)}|{\mathbf{x}}_{t-1}^{a_{t}^{(k)}};\bm{\theta})}{\pi({% \mathbf{x}}_{t}|{\mathbf{x}}_{t-1}^{a_{t}^{(k)}},{\mathbf{y}}_{t};\bm{\theta})}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = divide start_ARG italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ; bold_italic_θ ) italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ; bold_italic_θ ) end_ARG start_ARG italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) end_ARG.
10:    Compute w¯t(k)=w~t1(k)wt(i)/k=1Kw~t1(k)wt(k)superscriptsubscript¯𝑤𝑡𝑘superscriptsubscript~𝑤𝑡1𝑘superscriptsubscript𝑤𝑡𝑖superscriptsubscript𝑘1𝐾superscriptsubscript~𝑤𝑡1𝑘superscriptsubscript𝑤𝑡𝑘\overline{w}_{t}^{(k)}=\widetilde{w}_{t-1}^{(k)}w_{t}^{(i)}/\sum_{k=1}^{K}% \widetilde{w}_{t-1}^{(k)}w_{t}^{(k)}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT.
11:end for

The outputs of the particle filter, as given in Alg. 1, namely {𝐱1:T(k),w1:T(k)}k=1Ksuperscriptsubscriptsubscriptsuperscript𝐱𝑘:1𝑇subscriptsuperscript𝑤𝑘:1𝑇𝑘1𝐾\{{\mathbf{x}}^{(k)}_{1:T},w^{(k)}_{1:T}\}_{k=1}^{K}{ bold_x start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, are not differentiable with respect to 𝜽𝜽\bm{\theta}bold_italic_θ [12], because of the resampling step on line 6.  The resampling step requires sampling a multinomial distribution. Sampling a multinomial distribution is not differentiable, as an infinitesimal change in the input probabilities can lead to a discrete change in the output sample value [12].

Differentiable particle filters (DPFs) are recently introduced tools that modify the resampling step of the SIR particle filter to be differentiable, and therefore lead to a particle filtering algorithm that is differentiable with respect to 𝜽𝜽\bm{\theta}bold_italic_θ [11, 22, 12, 23]. There exist a number of DPF methods, based on various techniques for making the resampling step differentiable. These range from weight retention in soft resampling [10], to optimal transport [11]. An overview of the DPF and its interplay with deep learning methods can be found in [22], which motivates our choice of differentiable particle filter, and our usage of stochastic gradient descent methods.

In this work, we built upon the DPF approach from [12], which utilises a stop gradient operator to make the resampling step differentiable. We summarise this method in Alg. 2. This algorithm yields unbiased gradient estimates with minimal computational overhead, and does not modify the behaviour of the forward pass of the particle filter.

In Alg. 2, at each time-step t𝑡titalic_t, we first sample the previous particle set, sampling at(k)superscriptsubscript𝑎𝑡𝑘a_{t}^{(k)}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K, with probability w¯t1subscript¯𝑤𝑡1\overline{w}_{t-1}over¯ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT (line 6). The value of at(k)superscriptsubscript𝑎𝑡𝑘a_{t}^{(k)}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT determines the ancestry of the k𝑘kitalic_k-th particle at time t𝑡titalic_t. Note that we apply a stop gradient operator to the weights in the sampling method, and therefore do not attempt to propagate gradients through sampling a discrete distribution. We then set the weights of all K𝐾Kitalic_K particles to 1/K1𝐾1/K1 / italic_K whilst preserving gradient information in the weights by dividing the weights by themselves, applying a stop-gradient operation to the divisor, and then multiplying by the constant 1/K1𝐾1/K1 / italic_K (line 7). The rest of the DPF proceeds as Alg. 1, described in Sec. II-A, with the particle sampling, weighting, and normalisation steps unmodified.

The DPF is particularly useful when estimating parameters, as we can compute the gradients of functions of the likelihood and of the particle trajectories, and therefore compute parameters optimising a chosen loss function. In particular, we typically use likelihood-based losses when performing inference where the hidden state is unknown in the training data, and trajectory-based losses when the sequence of hidden states is known for the training data. Examples of likelihood-based losses are the negative log-likelihood and the ELBO. The mean-square error of the inferred state is a typical trajectory-based loss. Note that the trajectory of the hidden states depends on the weights, and therefore the weights must be differentiable even if the loss function is based only on the trajectory.

Given a loss function based on the weights and/or particles of a differentiable particle filter, we can compute the minimiser using a gradient-based optimisation scheme. This is efficient, especially compared to the gradient-free methods previously required, and is more robust to the stochasticity of the likelihood and state estimates, as many gradient schemes are designed with noisy gradients in mind [24, 25].

II-D Notation

We here present our notation, used throughout the paper. We denote by 𝐀,isubscript𝐀𝑖{\mathbf{A}}_{\cdot,i}bold_A start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT the i𝑖iitalic_i-th column vector of matrix 𝐀𝐀{\mathbf{A}}bold_A, and by 𝐀i,subscript𝐀𝑖{\mathbf{A}}_{i,\cdot}bold_A start_POSTSUBSCRIPT italic_i , ⋅ end_POSTSUBSCRIPT the i𝑖iitalic_i-th row vector of matrix 𝐀𝐀{\mathbf{A}}bold_A. We denote by 𝐈𝐝nsubscript𝐈𝐝𝑛\mathbf{Id}_{n}bold_Id start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT the n×n𝑛𝑛n\times nitalic_n × italic_n identity matrix.

We denote by 𝟙cond(x)subscript1cond𝑥\mathds{1}_{\mathrm{cond}}(x)blackboard_1 start_POSTSUBSCRIPT roman_cond end_POSTSUBSCRIPT ( italic_x ) the binary indicator function, which returns 1111 when x𝑥xitalic_x satisfies condcond\mathrm{cond}roman_cond, and 00 otherwise. We denote by sgn(x)=𝟙0(x)𝟙0(x)sgn𝑥subscript1absent0𝑥subscript1absent0𝑥\mathrm{sgn}(x)=\mathds{1}_{\geq 0}(x)-\mathds{1}_{\leq 0}(x)roman_sgn ( italic_x ) = blackboard_1 start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ( italic_x ) - blackboard_1 start_POSTSUBSCRIPT ≤ 0 end_POSTSUBSCRIPT ( italic_x ) the sign function. We denote by abs(x)abs𝑥\mathrm{abs}(x)roman_abs ( italic_x ) the absolute value function. All three functions can be applied element-wise to a matrix or vector.

We denote by 0subscript0\mathbb{N}_{0}blackboard_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the natural numbers including 00, and by \mathbb{N}blackboard_N the natural numbers excluding 00.

We denote by count(a,b)count𝑎𝑏\mathrm{count}(a,b)roman_count ( italic_a , italic_b ) the number of times the item a𝑎aitalic_a occurs in the list b𝑏bitalic_b, and by cwr(S,r)cwr𝑆𝑟\mathrm{cwr}(S,r)roman_cwr ( italic_S , italic_r ) the set of all length r𝑟ritalic_r combinations of the elements of the set S𝑆Sitalic_S with replacement, up to reordering. For example, we have count(1,[1,2,2,3,1])=2count1122312\mathrm{count}(1,[1,2,2,3,1])=2roman_count ( 1 , [ 1 , 2 , 2 , 3 , 1 ] ) = 2 and cwr({a,b,c},2)={[a,a],[a,b],[a,c],[b,b],[b,c],[c,c]}cwr𝑎𝑏𝑐2𝑎𝑎𝑎𝑏𝑎𝑐𝑏𝑏𝑏𝑐𝑐𝑐\mathrm{cwr}(\{a,b,c\},2)=\{[a,a],[a,b],[a,c],[b,b],[b,c],[c,c]\}roman_cwr ( { italic_a , italic_b , italic_c } , 2 ) = { [ italic_a , italic_a ] , [ italic_a , italic_b ] , [ italic_a , italic_c ] , [ italic_b , italic_b ] , [ italic_b , italic_c ] , [ italic_c , italic_c ] }.

We denote by (x)bottom𝑥\bot(x)⊥ ( italic_x ) the stop-gradient operator applied to x𝑥xitalic_x, with properties as defined in [12].

III Sparse estimation of non-linear state-space models

This Section presents our main contribution. We propose our approach to model and learn the state transition distribution of a general state-space model. Our approach builds upon a polynomial approximation to the transition distribution parameterised by 𝐂Nx×M𝐂superscriptsubscript𝑁𝑥𝑀{\mathbf{C}}\in\mathbb{R}^{N_{x}\times M}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT, a matrix of real numbers encoding polynomail coefficients, to be learnt, and 𝐃0Nx×M𝐃superscriptsubscript0subscript𝑁𝑥𝑀{\mathbf{D}}\in\mathbb{N}_{0}^{N_{x}\times M}bold_D ∈ blackboard_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT a fixed (i.e., known) integer matrix of monomial degrees associated with 𝐂𝐂{\mathbf{C}}bold_C.

The positive integer d𝑑ditalic_d denotes the fixed maximum degree of our polynomial approximation. The number of monomials of degree d𝑑ditalic_d in Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT variables is M=n=0d(n+Nx1Nx1)𝑀superscriptsubscript𝑛0𝑑binomial𝑛subscript𝑁𝑥1subscript𝑁𝑥1M=\sum_{n=0}^{d}\binom{n+N_{x}-1}{N_{x}-1}italic_M = ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n + italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_ARG ). No other model parameters are assumed unknown, hence 𝜽𝜽\bm{\theta}bold_italic_θ is equal in our case to 𝐂𝐂{\mathbf{C}}bold_C, and, in particular, p(𝐱t|𝐱t1;𝜽)=p(𝐱t|𝐱t1;𝐂)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝜽𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝐂p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1};\bm{\theta})=p({\mathbf{x}}_{t}|{\mathbf% {x}}_{t-1};{\mathbf{C}})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_θ ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_C ). For example, in the additive zero-mean Gaussian noise case, we have

𝐱tp(𝐱t|𝐱t1,𝐂):=𝒩(f(𝐱t1,𝐂;𝐃),𝚺v),similar-tosubscript𝐱𝑡𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝐂assign𝒩𝑓subscript𝐱𝑡1𝐂𝐃subscript𝚺𝑣{\mathbf{x}}_{t}\sim p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1},{\mathbf{C}}):=% \mathcal{N}(f({\mathbf{x}}_{t-1},{\mathbf{C}};{\mathbf{D}}),\bm{\Sigma}_{v}),bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_C ) := caligraphic_N ( italic_f ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_C ; bold_D ) , bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , (6)

where 𝚺vsubscript𝚺𝑣\bm{\Sigma}_{v}bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the covariance of the state noise distribution, and, for every 𝐱=[x1,x2,,xNx]Nx𝐱superscriptsubscript𝑥1subscript𝑥2subscript𝑥subscript𝑁𝑥topsuperscriptsubscript𝑁𝑥{\mathbf{x}}=[x_{1},x_{2},\dots,x_{N_{x}}]^{\top}\in\mathbb{R}^{N_{x}}bold_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT,

f(𝐱,𝐂;𝐃)=j=1M(𝐂,ji=1NxxiDi,j).𝑓𝐱𝐂𝐃superscriptsubscript𝑗1𝑀subscript𝐂𝑗superscriptsubscriptproduct𝑖1subscript𝑁𝑥superscriptsubscript𝑥𝑖subscript𝐷𝑖𝑗f({\mathbf{x}},{\mathbf{C}};{\mathbf{D}})=\sum_{j=1}^{M}\left({\mathbf{C}}_{% \cdot,j}\cdot\prod_{i=1}^{N_{x}}x_{i}^{D_{i,j}}\right).italic_f ( bold_x , bold_C ; bold_D ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( bold_C start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT ⋅ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) . (7)

is our considered polynomial approximation. The degree matrix 𝐃𝐃{\mathbf{D}}bold_D is constructed from the number of hidden states Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and the maximum degree d𝑑ditalic_d, and is a static parameter.

The remainder of this section is structured as follows. Our polynomial model is thoroughly described in Section III-A. We then design an approach to learn the coefficient matrix 𝐂𝐂{\mathbf{C}}bold_C using a maximum-a-posteriori estimator under a sparsity inducing penalty. We next discuss in Section III-B the graphical interpretation of 𝐂𝐂{\mathbf{C}}bold_C and the key role of sparsity. We then use the Lorenz 63 dynamical system as a pedagogical example to better illustrate the usefulness of our model, in Section  III-C. We move to the presentation of our approach to estimate 𝐂𝐂{\mathbf{C}}bold_C, starting from the problem definition in Section III-D, a single batch algorithm S-GraphGrad in Section III-E, and a practical mini-batch implementation, leading to our final algorithm B-GraphGrad, in Section III-F.

III-A Constructing a polynomial approximant

In order to learn a polynomial approximation to the state transition distribution, let us first define the learnt and static parameters of our polynomials. A polynomial is the result of summing a number of monomials, and hence can be constructed as a sum of estimated monomial terms. The degree of a monomial is given by the sum of the powers of its constituent terms, with a polynomial having degree equal to the maximum degree of its constituent monomials. In our model, we assume a fixed maximum degree d𝑑ditalic_d for our polynomial approximation.

Once d𝑑ditalic_d is set, we construct all length Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT sequences of positive integers that sum to nd,n0formulae-sequence𝑛𝑑𝑛subscript0n\leq d,\,n\in\mathbb{N}_{0}italic_n ≤ italic_d , italic_n ∈ blackboard_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, resulting in

M=n=0d(n+Nx1Nx1)𝑀superscriptsubscript𝑛0𝑑binomial𝑛subscript𝑁𝑥1subscript𝑁𝑥1M=\sum_{n=0}^{d}\binom{n+N_{x}-1}{N_{x}-1}italic_M = ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n + italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_ARG ) (8)

unique sequences. This simple procedure allows us to generate the powers of all monomial terms in a polynomial of degree d𝑑ditalic_d, that we store in an Nx×Msubscript𝑁𝑥𝑀N_{x}\times Mitalic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M matrix, denoted 𝐃𝐃{\mathbf{D}}bold_D, with the term Di,jsubscript𝐷𝑖𝑗D_{i,j}italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT corresponding to the power of state dimension i𝑖iitalic_i in the j𝑗jitalic_j-th monomial term. The polynomial expression is then defined by matrix 𝐂𝐂{\mathbf{C}}bold_C, following Eq. (7). The ordering of the monomials is arbitrary but must be consistent, as it implies the order of the columns of 𝐂𝐂{\mathbf{C}}bold_C and 𝐃𝐃{\mathbf{D}}bold_D.

III-A1 Generating the degree matrix

We will now specify the construction of 𝐃𝐃{\mathbf{D}}bold_D, the degree matrix. As before, 𝐃𝐃{\mathbf{D}}bold_D is such that Di,jsubscript𝐷𝑖𝑗D_{i,j}italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the power of state dimension i𝑖iitalic_i in the j𝑗jitalic_j-th monomial term. For example, if Nx=3subscript𝑁𝑥3N_{x}=3italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 3, and, for some j{1,,M}𝑗1𝑀j\in\{1,\ldots,M\}italic_j ∈ { 1 , … , italic_M }, 𝐃,j=[0,1,2]subscript𝐃𝑗012{\mathbf{D}}_{\cdot,j}=[0,1,2]bold_D start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT = [ 0 , 1 , 2 ], then the value of the j𝑗jitalic_j-th monomial term when evaluating the transition of the i𝑖iitalic_i-th state dimension is Ci,jx10x21x32subscript𝐶𝑖𝑗superscriptsubscript𝑥10superscriptsubscript𝑥21superscriptsubscript𝑥32C_{i,j}x_{1}^{0}x_{2}^{1}x_{3}^{2}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where Ci,jsubscript𝐶𝑖𝑗C_{i,j}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is a learnt coefficient. The degree matrix 𝐃𝐃{\mathbf{D}}bold_D is static, and hence the same for every state, meaning that all states fit a polynomial of the same maximum degree.

Our method for construction the degree matrix 𝐃𝐃{\mathbf{D}}bold_D is given in Alg. 3, where ‘countcount\mathrm{count}roman_count’ and ‘cwrcwr\mathrm{cwr}roman_cwr’ are defined in Sec. II-D. Alg. 3 takes the union of all possible combinations with replacement of the set {1,2,,Nx}12subscript𝑁𝑥\{1,2,\dots,N_{x}\}{ 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } of length less than or equal to d𝑑ditalic_d, denoting by 𝒬𝒬\mathcal{Q}caligraphic_Q the resulting set. We then construct 𝐃𝐃{\mathbf{D}}bold_D by setting each entry Di,jsubscript𝐷𝑖𝑗D_{i,j}italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT equal to the number of times i𝑖iitalic_i occurs in the j𝑗jitalic_j-th element of 𝒬𝒬\mathcal{Q}caligraphic_Q, for i{1,,Nx}𝑖1subscript𝑁𝑥i\in\{1,\dots,N_{x}\}italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } and j{1,,M}𝑗1𝑀j\in\{1,\dots,M\}italic_j ∈ { 1 , … , italic_M }.

Algorithm 3 Generating the degree matrix 𝐃𝐃{\mathbf{D}}bold_D
1:Input: State size Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, maximal degree d𝑑ditalic_d.
2:Output: Matrix 𝐃Nx×M𝐃superscriptsubscript𝑁𝑥𝑀{\mathbf{D}}\in\mathbb{R}^{N_{x}\times M}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT of monomial degrees.
3:Compute 𝒬=δ=0dcwr([1,2,,Nx],δ)𝒬superscriptsubscript𝛿0𝑑cwr12subscript𝑁𝑥𝛿\mathcal{Q}=\bigcup\limits_{\delta=0}^{d}\mathrm{cwr}([1,2,\dots,N_{x}],\delta)caligraphic_Q = ⋃ start_POSTSUBSCRIPT italic_δ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_cwr ( [ 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ] , italic_δ )
4:Di,j=count(i,𝒬j)subscript𝐷𝑖𝑗count𝑖subscript𝒬𝑗D_{i,j}=\mathrm{count}(i,\mathcal{Q}_{j})italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = roman_count ( italic_i , caligraphic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) i{1,,Nx},j{1,,M}formulae-sequencefor-all𝑖1subscript𝑁𝑥𝑗1𝑀\forall\ i\in\{1,\dots,N_{x}\},\ j\in\{1,\dots,M\}∀ italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } , italic_j ∈ { 1 , … , italic_M }.

We observe that |𝒬|=M𝒬𝑀|\mathcal{Q}|=M| caligraphic_Q | = italic_M. Note that the set 𝒬𝒬\mathcal{Q}caligraphic_Q has no inherent ordering, but we access it by index. We must therefore impose an ordering on the set 𝒬𝒬\mathcal{Q}caligraphic_Q. One such ordering is lexicographical ordering. To apply this ordering, we first count how many times each number appears in an element of 𝒬𝒬\mathcal{Q}caligraphic_Q. We then order these elements by the number of times 1111 appears, and in case of equality comparing the number of times 2222 appears, and so on until Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Note that the ordering of the degree matrix does not change the properties of the algorithm. In this work, for interpretability, we sort monomials in ascending order by degree, then sorting by lexicographical order as a tiebreaker within degree, as this aligns closely with how the degree matrix is generated in Alg. 3, by iterating over degrees.

III-A2 Initialising the coefficient matrix and evaluating the polynomial

Now that 𝐃𝐃{\mathbf{D}}bold_D is constructed, to evaluate the resulting polynomial for each state dimension, we require 𝐂𝐂{\mathbf{C}}bold_C, encoding the coefficients of each monomial term in every state dimension. These coefficients are our object of inference, and therefore should be stored in a manner admitting efficient evaluation of the polynomial.

We have M𝑀Mitalic_M monomials for each of the Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT state dimensions, and therefore we propose to store the monomial coefficients in an Nx×Msubscript𝑁𝑥𝑀N_{x}\times Mitalic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M matrix 𝐂𝐂{\mathbf{C}}bold_C, where Ci,jsubscript𝐶𝑖𝑗C_{i,j}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT corresponds to the coefficient of the j𝑗jitalic_j-th monomial when computing state i𝑖iitalic_i. This gives us a total of NxM=Nxn=0d(n+Nx1Nx1)subscript𝑁𝑥𝑀subscript𝑁𝑥superscriptsubscript𝑛0𝑑binomial𝑛subscript𝑁𝑥1subscript𝑁𝑥1N_{x}\cdot M=N_{x}\sum_{n=0}^{d}\binom{n+N_{x}-1}{N_{x}-1}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ italic_M = italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n + italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_ARG ) parameters to estimate.

Following usual broadcasting rules, given 𝐱𝐱{\mathbf{x}}bold_x, 𝐃𝐃{\mathbf{D}}bold_D, and 𝐂𝐂{\mathbf{C}}bold_C, we can now evaluate the value of our polynomial at any 𝐱Nx𝐱superscriptsubscript𝑁𝑥{\mathbf{x}}\in\mathbb{R}^{N_{x}}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT by Eq. (7). Note that i=1NxxiDi,jsuperscriptsubscriptproduct𝑖1subscript𝑁𝑥superscriptsubscript𝑥𝑖subscript𝐷𝑖𝑗\prod_{i=1}^{N_{x}}x_{i}^{D_{i,j}}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is scalar, with 𝐂,ji=1NxxiDi,jsubscript𝐂𝑗superscriptsubscriptproduct𝑖1subscript𝑁𝑥superscriptsubscript𝑥𝑖subscript𝐷𝑖𝑗{\mathbf{C}}_{\cdot,j}\prod_{i=1}^{N_{x}}x_{i}^{D_{i,j}}bold_C start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT being the vector 𝐂,jsubscript𝐂𝑗{\mathbf{C}}_{\cdot,j}bold_C start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT multiplied by the scalar i=1NxxiDi,jsuperscriptsubscriptproduct𝑖1subscript𝑁𝑥superscriptsubscript𝑥𝑖subscript𝐷𝑖𝑗\prod_{i=1}^{N_{x}}x_{i}^{D_{i,j}}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In effect, i=1NxxiDi,jsuperscriptsubscriptproduct𝑖1subscript𝑁𝑥superscriptsubscript𝑥𝑖subscript𝐷𝑖𝑗\prod_{i=1}^{N_{x}}x_{i}^{D_{i,j}}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT evaluates the j𝑗jitalic_j-th monomial term with coefficient 1111, and this calculation is reused for every state dimension, with 𝐂,ji=1NxxiDi,jsubscript𝐂𝑗superscriptsubscriptproduct𝑖1subscript𝑁𝑥superscriptsubscript𝑥𝑖subscript𝐷𝑖𝑗{\mathbf{C}}_{\cdot,j}\prod_{i=1}^{N_{x}}x_{i}^{D_{i,j}}bold_C start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT applying the coefficients, which are unique to each state dimension. Once the model is initialised, our goal is to learn the coefficient matrix 𝐂𝐂{\mathbf{C}}bold_C, since this matrix, in conjunction with the known fixed degree matrix 𝐃𝐃{\mathbf{D}}bold_D, defines the transition density p(𝐱t|𝐱t1;𝜽)=p(𝐱t|𝐱t1;𝐂)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝜽𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝐂p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1};\bm{\theta})=p({\mathbf{x}}_{t}|{\mathbf% {x}}_{t-1};{\mathbf{C}})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_θ ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_C ) in Eq. (1), where 𝜽𝜽\bm{\theta}bold_italic_θ is the set of learnt parameters, in our case 𝐂𝐂{\mathbf{C}}bold_C.

III-B A graphical interpretation of matrices 𝐂𝐂{\mathbf{C}}bold_C and 𝐃𝐃{\mathbf{D}}bold_D

Within time series modelling, we can often interpret the driving parameters of a sparse model as a graph encoding the connectivity of the system [17, 21, 26, 27, 28, 29]. This is the case here as well, where we observe that, for (a,b){1,,Nx}2𝑎𝑏superscript1subscript𝑁𝑥2(a,b)\in\{1,\dots,N_{x}\}^{2}( italic_a , italic_b ) ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, state dimension b𝑏bitalic_b affects state dimension a𝑎aitalic_a in our estimated dynamics if (abs(𝐂)𝐃)a,b0subscriptabs𝐂superscript𝐃top𝑎𝑏0(\mathrm{abs}({\mathbf{C}}){\mathbf{D}}^{\top})_{a,b}\neq 0( roman_abs ( bold_C ) bold_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT ≠ 0, where we note that abs(𝐂)𝐃abs𝐂superscript𝐃top\mathrm{abs}({\mathbf{C}}){\mathbf{D}}^{\top}roman_abs ( bold_C ) bold_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is an Nx×Nxsubscript𝑁𝑥subscript𝑁𝑥N_{x}\times N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT matrix. We can interpret this in terms of Granger causality, where we see that including information from state b𝑏bitalic_b improves the knowledge on state a𝑎aitalic_a, and therefore we say that state b𝑏bitalic_b Granger-causes state a𝑎aitalic_a if (abs(𝐂)𝐃)a,b0subscriptabs𝐂superscript𝐃top𝑎𝑏0(\mathrm{abs}({\mathbf{C}}){\mathbf{D}}^{\top})_{a,b}\neq 0( roman_abs ( bold_C ) bold_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT ≠ 0.

We are therefore able to construct a directed graph of our estimated system from the matrices 𝐂𝐂{\mathbf{C}}bold_C and 𝐃𝐃{\mathbf{D}}bold_D. This graph has adjacency matrix 𝐀𝐀{\mathbf{A}}bold_A, defined by 𝐀=𝟙0(abs(𝐂)𝐃){0,1}Nx×Nx𝐀subscript1absent0abs𝐂superscript𝐃topsuperscript01subscript𝑁𝑥subscript𝑁𝑥{\mathbf{A}}=\mathds{1}_{\neq 0}(\mathrm{abs}({\mathbf{C}}){\mathbf{D}}^{\top}% )\in\{0,1\}^{N_{x}\times N_{x}}bold_A = blackboard_1 start_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT ( roman_abs ( bold_C ) bold_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where we have an edge from node b𝑏bitalic_b to node a𝑎aitalic_a if Aa,b=1subscript𝐴𝑎𝑏1A_{a,b}=1italic_A start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT = 1, and no edge if Aa,b=0subscript𝐴𝑎𝑏0A_{a,b}=0italic_A start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT = 0. If this graph is not fully connected, or equivalently if there exist (a,b){1,,Nx}2𝑎𝑏superscript1subscript𝑁𝑥2(a,b)\in\{1,\dots,N_{x}\}^{2}( italic_a , italic_b ) ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT such that Aa,b=Ab,a=0subscript𝐴𝑎𝑏subscript𝐴𝑏𝑎0A_{a,b}=A_{b,a}=0italic_A start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_b , italic_a end_POSTSUBSCRIPT = 0, then some state dimensions do not directly interact in our estimated system.

We can also interpret our system as a collection of M𝑀Mitalic_M graphs, {𝒢j}j=1Msuperscriptsubscriptsubscript𝒢𝑗𝑗1𝑀\{\mathcal{G}_{j}\}_{j=1}^{M}{ caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where the j𝑗jitalic_j-th graph 𝒢jsubscript𝒢𝑗\mathcal{G}_{j}caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT has adjacency matrix abs(𝐂,j)𝐃,jabssubscript𝐂𝑗superscriptsubscript𝐃𝑗top\mathrm{abs}({\mathbf{C}}_{\cdot,j}){\mathbf{D}}_{\cdot,j}^{\top}roman_abs ( bold_C start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT ) bold_D start_POSTSUBSCRIPT ⋅ , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Each 𝒢jsubscript𝒢𝑗\mathcal{G}_{j}caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be interpreted as encoding the connectivity resulting from the j𝑗jitalic_j-th monomial. A sparsity promoting prior on 𝐂𝐂{\mathbf{C}}bold_C also promotes sparsity in these graphs. We can therefore interpret our method of estimating 𝐂𝐂{\mathbf{C}}bold_C as estimating multiple interacting sparse graphical models, one for each of the underlying monomials.

III-C An example: Lorenz 63

We present here a worked example utilising the Lorenz 63 model. The Lorenz 63 model [13] is a popular chaotic oscillator model, with a discrete time variant given by,

x1,t=x1,t1+Δt(σ(x2,t1x1,t1))+v1,t,x2,t=x2,t1+Δt(x1,t1(ρx3,t1)x2,t1)+v2,t,x3,t=x3,t1+Δt(x1,t1x2,t1βx3,t1)+v3,t,𝐲t=𝐱t+𝐫t,formulae-sequencesubscript𝑥1𝑡subscript𝑥1𝑡1Δ𝑡𝜎subscript𝑥2𝑡1subscript𝑥1𝑡1subscript𝑣1𝑡formulae-sequencesubscript𝑥2𝑡subscript𝑥2𝑡1Δ𝑡subscript𝑥1𝑡1𝜌subscript𝑥3𝑡1subscript𝑥2𝑡1subscript𝑣2𝑡formulae-sequencesubscript𝑥3𝑡subscript𝑥3𝑡1Δ𝑡subscript𝑥1𝑡1subscript𝑥2𝑡1𝛽subscript𝑥3𝑡1subscript𝑣3𝑡subscript𝐲𝑡subscript𝐱𝑡subscript𝐫𝑡\displaystyle\begin{split}x_{1,t}&=x_{1,t-1}+\Delta t(\sigma(x_{2,t-1}-x_{1,t-% 1}))+v_{1,t},\\ x_{2,t}&=x_{2,t-1}+\Delta t(x_{1,t-1}(\rho-x_{3,t-1})-x_{2,t-1})+v_{2,t},\\ x_{3,t}&=x_{3,t-1}+\Delta t(x_{1,t-1}x_{2,t-1}-\beta x_{3,t-1})+v_{3,t},\\ {\mathbf{y}}_{t}&={\mathbf{x}}_{t}+{\mathbf{r}}_{t},\end{split}start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_x start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_σ ( italic_x start_POSTSUBSCRIPT 2 , italic_t - 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT ) ) + italic_v start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_x start_POSTSUBSCRIPT 2 , italic_t - 1 end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_x start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT ( italic_ρ - italic_x start_POSTSUBSCRIPT 3 , italic_t - 1 end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT 2 , italic_t - 1 end_POSTSUBSCRIPT ) + italic_v start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 3 , italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_x start_POSTSUBSCRIPT 3 , italic_t - 1 end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_x start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 , italic_t - 1 end_POSTSUBSCRIPT - italic_β italic_x start_POSTSUBSCRIPT 3 , italic_t - 1 end_POSTSUBSCRIPT ) + italic_v start_POSTSUBSCRIPT 3 , italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW (9)

where ΔtΔ𝑡\Delta troman_Δ italic_t is the time elapsed between observations, 𝐯t𝒩(𝟎,Σv)similar-tosubscript𝐯𝑡𝒩0subscriptΣ𝑣{\mathbf{v}}_{t}\sim\mathcal{N}(\bm{0},\Sigma_{v})bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , roman_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) the state noise term, and 𝐫t𝒩(𝟎,Σr)similar-tosubscript𝐫𝑡𝒩0subscriptΣ𝑟{\mathbf{r}}_{t}\sim\mathcal{N}(\bm{0},\Sigma_{r})bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) the observation noise term, and β,ρ,σ𝛽𝜌𝜎\beta,\,\rho,\,\sigmaitalic_β , italic_ρ , italic_σ are real scalar parameters, often taken as β=8/3,σ=10,ρ=28formulae-sequence𝛽83formulae-sequence𝜎10𝜌28\beta=8/3,\ \sigma=10,\ \rho=28italic_β = 8 / 3 , italic_σ = 10 , italic_ρ = 28. The initial condition 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is arbitrary, so long as it is non-zero, and is often taken to be [1,0,0]100[1,0,0][ 1 , 0 , 0 ].

We can see that the transition state system in Eq. (9) is described by a degree d=2𝑑2d=2italic_d = 2 polynomial in Nx=3subscript𝑁𝑥3N_{x}=3italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 3 variables. Therefore, using our notations, we have

Q={[1,1],[1,2],[1,3],[1],[2,2],[2,3],[2],[3,3],[3],[]}𝑄111213delimited-[]12223delimited-[]233delimited-[]3Q=\{[1,1],[1,2],[1,3],[1],[2,2],[2,3],[2],[3,3],[3],[]\}italic_Q = { [ 1 , 1 ] , [ 1 , 2 ] , [ 1 , 3 ] , [ 1 ] , [ 2 , 2 ] , [ 2 , 3 ] , [ 2 ] , [ 3 , 3 ] , [ 3 ] , [ ] }

under lexicographical ordering, and

Q={[],[1],[2],[3],[1,1],[1,2],[1,3],[2,2],[2,3],[3,3]}𝑄delimited-[]1delimited-[]2delimited-[]3111213222333Q=\{[],[1],[2],[3],[1,1],[1,2],[1,3],[2,2],[2,3],[3,3]\}italic_Q = { [ ] , [ 1 ] , [ 2 ] , [ 3 ] , [ 1 , 1 ] , [ 1 , 2 ] , [ 1 , 3 ] , [ 2 , 2 ] , [ 2 , 3 ] , [ 3 , 3 ] }

under lexicographical-in-degree ordering. From the lexicographical-in-degree ordering, the resulting degree matrix 𝐃𝐃{\mathbf{D}}bold_D is

𝐃=(010021100000100102100001001012).𝐃matrix010021100000100102100001001012{\mathbf{D}}=\begin{pmatrix}0&1&0&0&2&1&1&0&0&0\\ 0&0&1&0&0&1&0&2&1&0\\ 0&0&0&1&0&0&1&0&1&2\\ \end{pmatrix}.bold_D = ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL end_ROW end_ARG ) .

Given the above degree matrix 𝐃𝐃{\mathbf{D}}bold_D, we can extract from Eq. (9) the true coefficient matrix 𝐂𝐂{\mathbf{C}}bold_C,

𝐂=(01σΔtσΔt00000000ρΔt1Δt000Δt0000001βΔt0Δt0000),𝐂matrix01𝜎Δ𝑡𝜎Δ𝑡00000000𝜌Δ𝑡1Δ𝑡000Δ𝑡0000001𝛽Δ𝑡0Δ𝑡0000{\mathbf{C}}=\begin{pmatrix}0&1-\sigma\Delta t&\sigma\Delta t&0&0&0&0&0&0&0\\ 0&\rho\Delta t&1-\Delta t&0&0&0&-\Delta t&0&0&0\\ 0&0&0&1-\beta\Delta t&0&\Delta t&0&0&0&0\\ \end{pmatrix},bold_C = ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 - italic_σ roman_Δ italic_t end_CELL start_CELL italic_σ roman_Δ italic_t end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_ρ roman_Δ italic_t end_CELL start_CELL 1 - roman_Δ italic_t end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - roman_Δ italic_t end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 - italic_β roman_Δ italic_t end_CELL start_CELL 0 end_CELL start_CELL roman_Δ italic_t end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) ,

which, notably, is sparse with 23232323 elements out of 30303030 equal to zero. We can verify that inputting the above 𝐂𝐂{\mathbf{C}}bold_C and 𝐃𝐃{\mathbf{D}}bold_D into Eq. (6) and Eq. (7) yields the system given in Eq. (9). Furthermore, we construct the adjacency matrix 𝐀=𝟙0(abs(𝐂)𝐃)𝐀subscript1absent0abs𝐂superscript𝐃top{\mathbf{A}}=\mathds{1}_{\neq 0}(\mathrm{abs}({\mathbf{C}}){\mathbf{D}}^{\top})bold_A = blackboard_1 start_POSTSUBSCRIPT ≠ 0 end_POSTSUBSCRIPT ( roman_abs ( bold_C ) bold_D start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) from the above 𝐂𝐂{\mathbf{C}}bold_C and 𝐃𝐃{\mathbf{D}}bold_D matrices, yielding the adjacency matrix and associated directed graph given in Fig. 1.

𝐀=(110111111),𝐀matrix110111111{\mathbf{A}}=\begin{pmatrix}1&1&0\\ 1&1&1\\ 1&1&1\\ \end{pmatrix},bold_A = ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ,
Refer to caption
Figure 1: Graph and adjacency matrix encoding the connectivity of the Lorenz 63 system.

In this example, the coefficient matrix 𝐂𝐂{\mathbf{C}}bold_C is sparse, while the adjacency matrix 𝐀𝐀{\mathbf{A}}bold_A gives a highly connected graph. Imposing a sparse 𝐂𝐂{\mathbf{C}}bold_C at the estimation stage therefore gives more information than the adjacency matrix 𝐀𝐀{\mathbf{A}}bold_A, as we also obtain estimates of the type of interactions that occur between the hidden states.

III-D Parameter estimation

We now move to the learning procedure, for the unknown parameter 𝜽:=𝐂Nx×Massign𝜽𝐂superscriptsubscript𝑁𝑥𝑀\bm{\theta}:={\mathbf{C}}\in\mathbb{R}^{N_{x}\times M}bold_italic_θ := bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT. This is done via a maximum-a-posteriori approach, that is by minimising the penalised negative log-likelihood Rsubscript𝑅\ell_{R}roman_ℓ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, given by

R(𝐂|𝐲1:T,λ)=(𝐂;𝐲1:T)+λR(𝐂),subscript𝑅conditional𝐂subscript𝐲:1𝑇𝜆𝐂subscript𝐲:1𝑇𝜆𝑅𝐂\ell_{R}({\mathbf{C}}|{\mathbf{y}}_{1:T},\lambda)=\ell({\mathbf{C}};{\mathbf{y% }}_{1:T})+\lambda R({\mathbf{C}}),roman_ℓ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_C | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_λ ) = roman_ℓ ( bold_C ; bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) + italic_λ italic_R ( bold_C ) , (10)

where

(𝐂|𝐲1:T)=log(p(𝐲1:T|𝐂)),conditional𝐂subscript𝐲:1𝑇𝑝conditionalsubscript𝐲:1𝑇𝐂\ell({\mathbf{C}}|{\mathbf{y}}_{1:T})=-\log(p({\mathbf{y}}_{1:T}|{\mathbf{C}})),roman_ℓ ( bold_C | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) = - roman_log ( italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_C ) ) , (11)

and λR(𝐂)𝜆𝑅𝐂\lambda R({\mathbf{C}})italic_λ italic_R ( bold_C ) is a sparsity promoting penalty term acting on 𝐂𝐂{\mathbf{C}}bold_C, with penalty weight λ>0𝜆0\lambda>0italic_λ > 0. Hence, we aim to compute

𝐂^=argmin𝐂Nx×MR(𝐂|𝐲1:T,λ).^𝐂subscriptargmin𝐂superscriptsubscript𝑁𝑥𝑀subscript𝑅conditional𝐂subscript𝐲:1𝑇𝜆\widehat{{\mathbf{C}}}=\operatorname*{argmin}_{{\mathbf{C}}\in\mathbb{R}^{N_{x% }\times M}}\ell_{R}({\mathbf{C}}|{\mathbf{y}}_{1:T},\lambda).over^ start_ARG bold_C end_ARG = roman_argmin start_POSTSUBSCRIPT bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_C | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_λ ) . (12)

We propose to adopt the L1𝐿1L1italic_L 1 penalty to promote sparsity, given by

(𝐂Nx×M)R(𝐂)=𝐂1:=j=1Mi=1Nx|Ci,j|.for-all𝐂superscriptsubscript𝑁𝑥𝑀𝑅𝐂subscriptnorm𝐂1assignsuperscriptsubscript𝑗1𝑀superscriptsubscript𝑖1subscript𝑁𝑥subscript𝐶𝑖𝑗(\forall{\mathbf{C}}\in\mathbb{R}^{N_{x}\times M})\quad R({\mathbf{C}})=||{% \mathbf{C}}||_{1}:=\sum_{j=1}^{M}\sum_{i=1}^{N_{x}}|C_{i,j}|.( ∀ bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT ) italic_R ( bold_C ) = | | bold_C | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | . (13)

We propose to solve Eq. (12), using a first-order optimisation scheme, combining a gradient step on the log-likelihood, and a proximity step over λR𝜆𝑅\lambda Ritalic_λ italic_R. We will now describe the gradient step, and then describe the proximity step.

III-D1 Estimating the parameter likelihood and its gradient

In order to resolve Eq. (12) using a first order optimisation scheme, we require to evaluate the negative log-likelihood and its first derivative with respect to 𝐂𝐂{\mathbf{C}}bold_C, given the observation series and the state-space model. We propose to estimate the likelihood through Eq. (4), where we have 𝜽:=𝐂assign𝜽𝐂\bm{\theta}:={\mathbf{C}}bold_italic_θ := bold_C, our parameter of interest. We then transform this quantity to the negative log-likelihood through negating the logarithm of the resulting estimate. For stability reasons, in practice, log-weights are used, and the log-likelihood is thus computed directly. The obtained estimate of the log-likelihood depends on the particle weights, which, in the standard SIR particle filter of Alg. 1, are subject to a non-differentiable resampling step.

Thankfully, we can utilise the DPF approach discussed in Sec. II-C, to obtain an estimate of the negative log-likelihood with respect to our parameter 𝐂𝐂{\mathbf{C}}bold_C, and the gradient of the negative log-likelihood with respect to 𝐂𝐂{\mathbf{C}}bold_C. The DPF yields a stochastic estimate of the gradient, which we can use to perform gradient based minimisation of the negative log-likelihood. In order to be robust to gradient noise, we propose to rely on first-order updates from the deep learning literature, where stochastic gradients are common due to stochastic input batches. In our experiments, we utilise the Novograd optimiser [25, 30] to compute our gradient updates for the negative log-likelihood, as it is robust to gradient outliers [30].

We denote the result of the gradient update utilising the negative log-likelihood at iteration s𝑠sitalic_s,

𝐂~s=updateη(𝐂s1,(𝐂s1|𝐲1:T)),subscript~𝐂𝑠subscriptupdate𝜂subscript𝐂𝑠1conditionalsubscript𝐂𝑠1subscript𝐲:1𝑇\widetilde{{\mathbf{C}}}_{s}=\mathrm{update}_{\eta}({\mathbf{C}}_{s-1},\nabla% \ell({\mathbf{C}}_{s-1}|{\mathbf{y}}_{1:T})),over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_update start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT , ∇ roman_ℓ ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ) , (14)

where updateη(𝐀,(𝐁|𝐲1:T))subscriptupdate𝜂𝐀conditional𝐁subscript𝐲:1𝑇\mathrm{update}_{\eta}({\mathbf{A}},\nabla\ell({\mathbf{B}}|{\mathbf{y}}_{1:T}))roman_update start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_A , ∇ roman_ℓ ( bold_B | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ) is a step of a given minimisation scheme with learning rate η𝜂\etaitalic_η applied to 𝐀𝐀{\mathbf{A}}bold_A with gradient of the negative log-likelihood obtained from running the particle filter with parameter 𝐁𝐁{\mathbf{B}}bold_B and observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT.

III-D2 Proximal update on the penalty term

Given that we can estimate the negative log-likelihood (𝐂;𝐲1:T)𝐂subscript𝐲:1𝑇\ell({\mathbf{C}};{\mathbf{y}}_{1:T})roman_ℓ ( bold_C ; bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) and its gradient (𝐂;𝐲1:T)𝐂subscript𝐲:1𝑇\nabla\ell({\mathbf{C}};{\mathbf{y}}_{1:T})∇ roman_ℓ ( bold_C ; bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), we now need to account for the penalty term λR(𝐂)𝜆𝑅𝐂\lambda R({\mathbf{C}})italic_λ italic_R ( bold_C ). Note that the L1𝐿1L1italic_L 1 penalty is not differentiable when a coordinate is 00, hence it cannot be optimised using standard gradient descent if we aim to recover sparse estimates. It is however convex, and particularly well suited to the use of a proximity operator update [31], that can be understood as an implicit subgradient step.

The proximity operator update, for the L1𝐿1L1italic_L 1 penalty, reads as a simple soft thresholding, so that our resulting scheme; combining both steps, is

𝐂~s=updateη(𝐂s1,(𝐂s1|𝐲1:T)),𝐂s=𝒯ηλ(𝐂~s),formulae-sequencesubscript~𝐂𝑠subscriptupdate𝜂subscript𝐂𝑠1conditionalsubscript𝐂𝑠1subscript𝐲:1𝑇subscript𝐂𝑠subscript𝒯𝜂𝜆subscript~𝐂𝑠\displaystyle\begin{split}\widetilde{{\mathbf{C}}}_{s}&=\mathrm{update}_{\eta}% ({\mathbf{C}}_{s-1},\nabla\ell({\mathbf{C}}_{s-1}|{\mathbf{y}}_{1:T})),\\ {{\mathbf{C}}}_{s}&=\mathcal{T}_{\eta\cdot\lambda}(\widetilde{{\mathbf{C}}}_{s% }),\end{split}start_ROW start_CELL over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL = roman_update start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT , ∇ roman_ℓ ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ) , end_CELL end_ROW start_ROW start_CELL bold_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL = caligraphic_T start_POSTSUBSCRIPT italic_η ⋅ italic_λ end_POSTSUBSCRIPT ( over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , end_CELL end_ROW (15)

where the soft-thresholding operator,

𝒯α(x)=max(|x|α,0)sgn(x),subscript𝒯𝛼𝑥𝑥𝛼0sgn𝑥\mathcal{T}_{\alpha}(x)=\max(|x|-\alpha,0)\cdot\mathrm{sgn}(x),caligraphic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x ) = roman_max ( | italic_x | - italic_α , 0 ) ⋅ roman_sgn ( italic_x ) ,

is applied element-wise. The above algorithm belongs to the class of stochastic proximal gradient methods, the convergence of which is well studied, for instance in [32, 33].

Note that automatic differentiation can also be used when applying the L1𝐿1L1italic_L 1 penalty, as the L1𝐿1L1italic_L 1 penalty admits a subgradient [34, 35]. However, the proximal operator update is more computationally efficient, and more stable. The proximal operator is also more versatile, as it allows using other penalties such as low rank, or L0𝐿0L0italic_L 0 penalty, which cannot be applied by the subgradient method [36].

III-E S-GraphGrad algorithm

We have now all the elements for solving (12). We first start with a full batch implementation, in Alg. 4, which we call S-GraphGrad, and which operates on the entire series of observations at once.

Algorithm 4 Series GraphGrad algorithm (S-GraphGrad)    
1:Input: Series of observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, number of steps S𝑆Sitalic_S, penalty parameter λ𝜆\lambdaitalic_λ, Nx×Msuperscriptsubscript𝑁𝑥𝑀\mathbb{N}^{N_{x}\times M}blackboard_N start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT degree matrix 𝐃𝐃{\mathbf{D}}bold_D, initial coefficient value 𝐂0subscript𝐂0{\mathbf{C}}_{0}bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, learning rate η𝜂\etaitalic_η.
2:Output: Sparse Nx×Msuperscriptsubscript𝑁𝑥𝑀\mathbb{R}^{N_{x}\times M}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT matrix 𝐂𝐂{\mathbf{C}}bold_C of polynomial coefficients.
3:for s=1,,S𝑠1𝑆s=1,\dots,Sitalic_s = 1 , … , italic_S do
4:    Run Alg. 2 with p(𝐱t|𝐱t1;𝐂s1,𝐃)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐂𝑠1𝐃p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1};{\mathbf{C}}_{s-1},{\mathbf{D}})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT , bold_D ) and observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT.
5:    Estimate (𝐂s1|𝐲1:T)conditionalsubscript𝐂𝑠1subscript𝐲:1𝑇\ell({\mathbf{C}}_{s-1}|{\mathbf{y}}_{1:T})roman_ℓ ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) and (𝐂s1|𝐲1:T)conditionalsubscript𝐂𝑠1subscript𝐲:1𝑇\nabla\ell({\mathbf{C}}_{s-1}|{\mathbf{y}}_{1:T})∇ roman_ℓ ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) via Eq. (4) and backpropagation.
6:    Set 𝐂~s=updateη(𝐂s1,(𝐂s1|𝐲1:T))subscript~𝐂𝑠subscriptupdate𝜂subscript𝐂𝑠1conditionalsubscript𝐂𝑠1subscript𝐲:1𝑇\widetilde{{\mathbf{C}}}_{s}=\mathrm{update}_{\eta}({\mathbf{C}}_{s-1},\nabla% \ell({\mathbf{C}}_{s-1}|{\mathbf{y}}_{1:T}))over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_update start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT , ∇ roman_ℓ ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ).
7:    Set 𝐂s=𝒯ηλ(𝐂~s)subscript𝐂𝑠subscript𝒯𝜂𝜆subscript~𝐂𝑠{{\mathbf{C}}}_{s}=\mathcal{T}_{\eta\cdot\lambda}(\widetilde{{\mathbf{C}}}_{s})bold_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_T start_POSTSUBSCRIPT italic_η ⋅ italic_λ end_POSTSUBSCRIPT ( over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).
8:end for
9:Output 𝐂=𝐂S𝐂subscript𝐂𝑆{\mathbf{C}}={\mathbf{C}}_{S}bold_C = bold_C start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.

III-E1 S-GraphGrad description

S-GraphGrad takes as input the series of observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, the number of steps S𝑆Sitalic_S, the penalty parameter λ𝜆\lambdaitalic_λ, the 0Nx×Msuperscriptsubscript0subscript𝑁𝑥𝑀\mathbb{N}_{0}^{N_{x}\times M}blackboard_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT degree matrix 𝐃𝐃{\mathbf{D}}bold_D, the initial coefficient value 𝐂0subscript𝐂0{\mathbf{C}}_{0}bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the learning rate η𝜂\etaitalic_η, producing as output a sparse Nx×Msuperscriptsubscript𝑁𝑥𝑀\mathbb{R}^{N_{x}\times M}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT matrix 𝐂𝐂{\mathbf{C}}bold_C of polynomial coefficients. S-GraphGrad iterates for S𝑆Sitalic_S steps, with the s𝑠sitalic_s-th step proceeding as follows.

First, we run a differentiable particle filter with the estimate of the coefficients from the previous step 𝐂s1subscript𝐂𝑠1{\mathbf{C}}_{s-1}bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT with observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT (line 4). When running the filter, we assume that we either know or have suitable estimates of the observation model p(𝐲t|𝐱t)𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡p({\mathbf{y}}_{t}|{\mathbf{x}}_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the proposal distribution π(𝐱t|𝐱t1,𝐲t)𝜋conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐲𝑡\pi({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1},{\mathbf{y}}_{t})italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), as well as the state noise. For example, if the noise is additive and Gaussian, we have Eq. (6), and would require a suitable estimate of 𝚺vsubscript𝚺𝑣\bm{\Sigma}_{v}bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Other noises can be accounted for, such as multiplicative Gaussian, assuming a definition of how the estimated polynomial model and the noise interact to compute p(𝐱t|𝐱t1;𝐂,𝐃)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝐂𝐃p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1};{\mathbf{C}},{\mathbf{D}})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_C , bold_D ).

While running the filter, we process the weights to obtain an estimate of the likelihood of the parameter 𝐂s1subscript𝐂𝑠1{\mathbf{C}}_{s-1}bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT using Eq. (4), and use automatic differentiation to obtain an estimate of (𝐂s1|𝐲1:T)conditionalsuperscript𝐂𝑠1subscript𝐲:1𝑇\nabla\ell({\mathbf{C}}^{s-1}|{\mathbf{y}}_{1:T})∇ roman_ℓ ( bold_C start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) (line 5). We then apply a gradient-based update step, on the negative log-likelihood, of a given minimisation scheme updateupdate\mathrm{update}roman_update with learning rate η𝜂\etaitalic_η to 𝐂s1subscript𝐂𝑠1{\mathbf{C}}_{s-1}bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT, yielding 𝐂~ssubscript~𝐂𝑠\widetilde{{\mathbf{C}}}_{s}over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Namely, the gradients (𝐂s1|𝐲1:T)conditionalsubscript𝐂𝑠1subscript𝐲:1𝑇\nabla\ell({\mathbf{C}}_{s-1}|{\mathbf{y}}_{1:T})∇ roman_ℓ ( bold_C start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) are those of the negative log-likelihood, as we are aiming to maximise the log-likelihood, and hence we minimise the negative log-likelihood using the minimisation scheme updateupdate\mathrm{update}roman_update (line 6).

We then apply the proximal update at 𝐂~ssubscript~𝐂𝑠\widetilde{{\mathbf{C}}}_{s}over~ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, yielding 𝐂ssubscript𝐂𝑠{{\mathbf{C}}}_{s}bold_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the estimated coefficient matrix at step s𝑠sitalic_s (line 7). In the case of the L1𝐿1L1italic_L 1 penalty the proximity operator is the soft thresholding operator, depending on the learning rate of η𝜂\etaitalic_η and a penalty parameter of λ𝜆\lambdaitalic_λ.

III-E2 Combating likelihood degeneracy

In practice, using S-GraphGrad is hampered by likelihood degeneracy, which causes the likelihood to evaluatenumerically as zero due to the limited precision of floating point arithmetic. This could result in the gradient vanishing, and hence numerical failure of the scheme.

A typical way to combat this is by using log weights, different floating point representations, or gradient scaling, but these do not address the core problem, which is that the likelihood becomes more concentrated the longer the observations series is. We will now discuss this phenomenon, and how we mitigate it.

For observation series longer than a few elements, the likelihood function, given in Eq. (4), when evaluated with parameters that do not yield dynamics close to the true dynamics, is numerically 00. Numerical zeros are a limitation of computer arithmetic, and in this case result from the multiplication of the very low likelihoods that occur in an unadapted filter to yield a result that is numerically zero. Note that, if operating on a log scale, we instead obtain -\infty- ∞ rather than 0, but the root cause is the same. Note that even when the dynamics are perfectly known we still observe this phenomenon, and here we aim to mitigate the effect of unadapted parameters, not the fundamental issue of likelihood accumulation in particle filters.

There are several standard ways to mitigate the effects of likelihood concentration within state-space models, such as variance inflation, where we increase the magnitude of the state and observation covariances to decrease likelihood concentration, or simply running the particle filter with a larger number of particles. However, both of these methods break down as the length of the observation series increases, as they do not address the underlying issue of the parameters not being adapted to the observed data.

The likelihood degenerating causes the gradients of the log-likelihood to explode, and hence makes it impossible to perform gradient-based inference. We can address this issue by running our method multiple times, first to fit a coarse estimate, and then refining this estimate over several subsequent iterations. We fit this coarse estimate using only the first few observations, and then gradually introduce more observations, refining our estimate and mitigating likelihood degeneracy due to unadapted parameters.

III-F B-GraphGrad algorithm

We end this section by presenting B-GraphGrad, our final method for estimating the dynamics of a general non-linear state-space model via a polynomial approximation. This method builds upon that described Alg. 4 (S-GraphGrad) in Sec. III-E, proposing a sequentially-batched implementation of the method, utilising an observation batching strategy to mitigate likelihood concentration issues.

The proposed method is doubly iterative: we iterate over telescoping batches of observations, within which we iteratively estimate the coefficients of our polynomial approximation. We create B𝐵Bitalic_B batches of observations by 𝐲(b):=𝐲1:bT/Bassignsuperscript𝐲𝑏subscript𝐲:1𝑏𝑇𝐵{\mathbf{y}}^{(b)}:={\mathbf{y}}_{1:\lceil bT/B\rceil}bold_y start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT := bold_y start_POSTSUBSCRIPT 1 : ⌈ italic_b italic_T / italic_B ⌉ end_POSTSUBSCRIPT for b=1,,B𝑏1𝐵b=1,\dots,Bitalic_b = 1 , … , italic_B. Note that these batches are of increasing size, with 𝐲(b)𝐲(b+1)superscript𝐲𝑏superscript𝐲𝑏1{\mathbf{y}}^{(b)}\subseteq{\mathbf{y}}^{(b+1)}bold_y start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ⊆ bold_y start_POSTSUPERSCRIPT ( italic_b + 1 ) end_POSTSUPERSCRIPT for b=1,,B𝑏1𝐵b=1,\dots,Bitalic_b = 1 , … , italic_B. Within each batch of observations, we perform S𝑆Sitalic_S runs of Alg. 4 (S-GraphGrad). We perform batching to avoid numerical errors, as the first sampled trajectories, with parameters close the the initial random initialisation, will likely have an extremely small log-likelihood, which compounds numerical errors when computing the weights in Alg. 2 [9]. Note that, in the case that the true system can represented as a polynomial, few batches may be needed. Indeed, in the case of the Lorenz 63 oscillator, we require only a batch of 10101010 observations to initialise the coefficients, and can then proceed with estimation on series of lengths exceeding 1000100010001000. However, in general the true system is not polynomial, so we proceed with fixed-size batches for simplicity and robustness. We present the method in Alg. 5 (B-GraphGrad), and describe it below.

Algorithm 5 Batched GraphGrad algorithm (B-GraphGrad)    
1:Input: Series of observations 𝐲1:Tsubscript𝐲:1𝑇{\mathbf{y}}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, number of batches B𝐵Bitalic_B, steps per batch S𝑆Sitalic_S, penalty parameter λ𝜆\lambdaitalic_λ, learning rate η𝜂\etaitalic_η, maximal degree d𝑑ditalic_d, hidden state size Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.
2:Output: Sparse Nx×Msuperscriptsubscript𝑁𝑥𝑀\mathbb{R}^{N_{x}\times M}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT matrix 𝐂𝐂{\mathbf{C}}bold_C of polynomial coefficients.
3:Construct 𝐃𝐃{\mathbf{D}}bold_D by running Alg. 3.
4:Randomly initialise 𝐂0Nx×Msubscript𝐂0superscriptsubscript𝑁𝑥𝑀{\mathbf{C}}_{0}\in\mathbb{R}^{N_{x}\times M}bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT element-wise by sampling a 𝒰(1,1)𝒰11\mathcal{U}(-1,1)caligraphic_U ( - 1 , 1 ) distribution.
5:for b=1,,B𝑏1𝐵b=1,\dots,Bitalic_b = 1 , … , italic_B do
6:    Set 𝐲(b):=𝐲1:bT/Bassignsuperscript𝐲𝑏subscript𝐲:1𝑏𝑇𝐵{\mathbf{y}}^{(b)}:={\mathbf{y}}_{1:\lceil bT/B\rceil}bold_y start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT := bold_y start_POSTSUBSCRIPT 1 : ⌈ italic_b italic_T / italic_B ⌉ end_POSTSUBSCRIPT.
7:    Run Alg. 4 (S-GraphGrad) with observations 𝐲(b)superscript𝐲𝑏{\mathbf{y}}^{(b)}bold_y start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT, number of steps S𝑆Sitalic_S, penalty parameter λ𝜆\lambdaitalic_λ, degree matrix 𝐃𝐃{\mathbf{D}}bold_D, initial coefficient value 𝐂b1subscript𝐂𝑏1{\mathbf{C}}_{b-1}bold_C start_POSTSUBSCRIPT italic_b - 1 end_POSTSUBSCRIPT, and learning rate η𝜂\etaitalic_η, setting 𝐂bsubscript𝐂𝑏{\mathbf{C}}_{b}bold_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT to the output.
8:end for
9:Output 𝐂:=𝐂Bassign𝐂subscript𝐂𝐵{\mathbf{C}}:={\mathbf{C}}_{B}bold_C := bold_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT

III-F1 B-GraphGrad description

As stated in Sec. III-F, B-GraphGrad builds upon S-GraphGrad, implementing observation batching to mitigate likelihood generation for unadapted models. In particular, B-GraphGrad runs for B𝐵Bitalic_B iterations, with the b𝑏bitalic_b-th iteration running S-GraphGrad with S𝑆Sitalic_S iterations on the observation series 𝐲(b):=𝐲1:bT/B𝐲1:Tassignsuperscript𝐲𝑏subscript𝐲:1𝑏𝑇𝐵subscript𝐲:1𝑇{\mathbf{y}}^{(b)}:={\mathbf{y}}_{1:\lceil bT/B\rceil}\subseteq{\mathbf{y}}_{1% :T}bold_y start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT := bold_y start_POSTSUBSCRIPT 1 : ⌈ italic_b italic_T / italic_B ⌉ end_POSTSUBSCRIPT ⊆ bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. This allows us to ‘warm up’ the coefficient parameter 𝐂𝐂{\mathbf{C}}bold_C, so that when we are performing estimation on the entire series we do not encounter issues due to likelihood concentration. We avoid these issues by learning 𝐂𝐂{\mathbf{C}}bold_C first on small series, which have relatively dispersed likelihoods due to their length. By learning an initial estimate of the transition dynamics on the initial batches, we have a better model when we come to estimating 𝐂𝐂{\mathbf{C}}bold_C on the longer series that can display likelihood issues when the model is not adapted.

B-GraphGrad proceeds as follows. First, we construct the degree matrix 𝐃𝐃{\mathbf{D}}bold_D given the selected maximum degree d𝑑ditalic_d and hidden state dimension Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. 𝐃𝐃{\mathbf{D}}bold_D is constructed following Alg. 3 described in Sec. III-A1. We note that 𝐃𝐃{\mathbf{D}}bold_D is a static parameter, and is not learnt through our training. We then generate an initial value for 𝐂𝐂{\mathbf{C}}bold_C, denoted 𝐂0subscript𝐂0{\mathbf{C}}_{0}bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We can choose this value randomly, as we avoid likelihood related issues through our batching procedure. In Alg. 5 (B-GraphGrad) we draw 𝐂0subscript𝐂0{\mathbf{C}}_{0}bold_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT element-wise by sampling a uniform 𝒰(1,1)𝒰11\mathcal{U}(-1,1)caligraphic_U ( - 1 , 1 ) distribution, however in principle any real valued distribution could be used, such as a standard normal distribution. After initialising 𝐂𝐂{\mathbf{C}}bold_C, we generate the B𝐵Bitalic_B observation batches. We then iterate over the B𝐵Bitalic_B batches of observations. For batch b𝑏bitalic_b, we run Alg. 4 (S-GraphGrad) with the parameter estimate from the previous batch, 𝐂b1subscript𝐂𝑏1{\mathbf{C}}_{b-1}bold_C start_POSTSUBSCRIPT italic_b - 1 end_POSTSUBSCRIPT, and label the resulting updated parameter by 𝐂bsubscript𝐂𝑏{\mathbf{C}}_{b}bold_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. The B𝐵Bitalic_B-th batch is the final batch and trains using the entire series of observations, outputting the final value 𝐂Bsubscript𝐂𝐵{\mathbf{C}}_{B}bold_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, which is learnt so as to optimise our objective function Eq. (12).

IV Discussion

We now discuss our algorithm, B-GraphGrad, as given in Alg. 5, and provide some potential extensions and modifications to particular systems.

IV-A GraphGrad prerequisites

In order to apply GraphGrad, we must either know or have estimates of the observation model and its noise process, and must know the form of the state noise. The observation model are typically assumed known in dynamical systems, such as by the intrinsic properties of the sensors used, or are estimated from previous studies. We require that the likelihood of the observation is differentiable with respect to 𝐂𝐂{\mathbf{C}}bold_C, as otherwise we cannot apply the differentiable particle filter. If we use a proposal distribution π(𝐱t|𝐱t1,𝐲t)p(𝐱t|𝐱t1)=p(𝐱t|𝐱t1;𝐂,𝐃)𝜋conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐲𝑡𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝐂𝐃\pi({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1},{\mathbf{y}}_{t})\neq p({\mathbf{x}}_{% t}|{\mathbf{x}}_{t-1})=p({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1};{\mathbf{C}},{% \mathbf{D}})italic_π ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≠ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_C , bold_D ), then we must be able to sample from this distribution differentiably with respect with 𝐂𝐂{\mathbf{C}}bold_C, and it must admit a differentiable likelihood with respect with 𝐂𝐂{\mathbf{C}}bold_C.

Furthermore, we assume that the state noise is such that we can sample it in a differentiable manner. An example of such a noise is the Gaussian distribution, from which we can generate sample differentiably with regard to the distribution parameters using the reparameterisation trick [37]. Many distributions can be sampled differentiably with regard to their parameters using similar tricks, such as the multivariate t distribution (with known degrees of freedom), which could be used if heavier tails are required.

IV-B Computational cost

B-GraphGrad, as given in Alg. 5, requires SB𝑆𝐵SBitalic_S italic_B evaluations of the particle filter to obtain the negative log-likelihood and its gradient. The particle filter is of time complexity 𝒪(KT),𝒪𝐾𝑇\mathcal{O}(KT),caligraphic_O ( italic_K italic_T ) , and therefore our method is of complexity 𝒪(SBKT)𝒪𝑆𝐵𝐾𝑇\mathcal{O}(SBKT)caligraphic_O ( italic_S italic_B italic_K italic_T ). Indeed, we evaluate the particle filter SB𝑆𝐵SBitalic_S italic_B times, and obtain the gradients via reverse-mode automatic differentiation, which is of the same time complexity as the function we differentiate. Note that we neglect here the complexity cost of sampling distributions and evaluating the likelihoods in the particle filter, as these vary from model to model, and occur the same number of times across filter runs.

The cost of evaluating the polynomial in Eq. (7) is small, and of complexity 𝒪(MNx2)𝒪𝑀superscriptsubscript𝑁𝑥2\mathcal{O}(MN_{x}^{2})caligraphic_O ( italic_M italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We note that the polynomial evaluation can be efficiently parallelised as 𝐃𝐃{\mathbf{D}}bold_D is fixed, so it is possible to evaluate the polynomial by performing an associative scan over evaluating the monomials. That is, we can evaluate the xjDi,jsuperscriptsubscript𝑥𝑗subscript𝐷𝑖𝑗x_{j}^{D_{i,j}}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT terms in parallel, and then evaluate their product and sum in parallel. Furthermore, the vector-scalar element product is typically automatically abstracted to an SIMD operation, speeding evaluation up by a factor of Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

Finally, we note that the computation of the particle filter and its gradient can be accelerated by observing that the computations in the differentiable particle filter depend on the previous state only through the particles and the weights. Under this restriction, we can implement the particle filter as a scan-with-carry. Therefore, it is possible to construct the computational graph of the particle filter, and therefore of its gradient, from the computational graph of the scanned function, yielding a much smaller graph than from the entire filter [34].

IV-C Exploiting parallelisation to decrease runtime

B-GraphGrad, if implemented following Alg. 5, is a sequential algorithm. Sequential operation is required as the estimate at each step depends on the estimate at the previous step. However, we can use parallel computing to reduce the elapsed time of the computation by computing fewer batches.

In particular, the batching proposed in B-GraphGrad is very conservative, in that the batch size increases by the same increment in all instances. In practice, as the approximand uses few parameters, with each parameter having a distinct effect, we rapidly adapt to the system within the first few batches.

We hence propose a more efficient procedure. We initialise P𝑃Pitalic_P independent coefficient matrices, {𝐂(p)}p=1Psuperscriptsubscriptsuperscript𝐂𝑝𝑝1𝑃\{{\mathbf{C}}^{(p)}\}_{p=1}^{P}{ bold_C start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, and learn each independently in parallel for BIsubscript𝐵𝐼B_{I}italic_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT batches of observations. Then, we check if there exists a subset of the P𝑃Pitalic_P coefficient matrices such that the coefficient matrices are close in value, for example by attempting to find 𝐂Nx×M𝐂superscriptsubscript𝑁𝑥𝑀{\mathbf{C}}\in\mathbb{R}^{N_{x}\times M}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT such that 𝐂𝐂(p)<ϵp{1,,P}norm𝐂superscript𝐂𝑝italic-ϵfor-all𝑝1𝑃||{\mathbf{C}}-{\mathbf{C}}^{(p)}||<\epsilon\ \forall p\in\{1,\dots,P\}| | bold_C - bold_C start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT | | < italic_ϵ ∀ italic_p ∈ { 1 , … , italic_P } for some matrix norm ||||||\cdot||| | ⋅ | | and ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. If this matrix exists, then this indicates that the coefficient matrices have adapted to the system. We then cease batching, and learn on the entire series of observations using the coefficient matrices in our subset, aggregating after finishing optimisation, e.g., via an element-wise mean. If the subset cannot be constructed, we continue learning for another batch of observations, and then recheck the above condition, stopping when either we exhaust the set of observations or the subset of matrices can be constructed.

This batching method can take advantage of parallel processing by performing optimisation on each independent parameter in parallel, with the subset construction and matrix calculations being cheap in comparison. This accelerates our inference by reducing the number of batches that need to be constructed, therefore reducing the run time of the algorithm.

V Numerical study

We here present our experimental results, to illustrate and discuss the performance of our proposed approach. We are interested in the recovery of the underlying model in dynamical systems described by polynomial ordinary differential equations (ODEs), namely the Lorenz 63 (Sec. V-B) and Lorenz 96 (Sec. V-C) oscillators, and estimating a non-polynomial system, the Kuramoto oscillator (Sec. V-D). In each case, we use an Euler discretisation to build a discretised form for the model, and we map it with our problem formulation where the goal is to recover an estimate of a ground truth matrix 𝐂𝐂{\mathbf{C}}bold_C.

We implement the proposed B-GraphGrad algorithm, and compare it against the fitting of a polynomial of the same degree using a maximum likelihood scheme, which we denote pMLE, for polynomial maximum likelihood estimator. This scheme is identical to our B-GraphGrad scheme, except that we remove the proximal steps. pMLE fits a fully dense model, so in its case we present only RMSE, as the sparsity metrics are predetermined (i.e., no sparsity is recovered).

V-A Experimental setup

For our numerical experiments, we use the following settings, unless stated otherwise. We use the Novograd optimisation scheme [25, 30], for update in Alg. 4 line 6, with a fixed learning rate of η=103𝜂superscript103\eta=10^{-3}italic_η = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. We split our observation series into B𝐵Bitalic_B batches such that B=T/10𝐵𝑇10B=\lceil T/10\rceilitalic_B = ⌈ italic_T / 10 ⌉, therefore giving a batch size increment of approximately 10101010. We use K=100𝐾100K=100italic_K = 100 particles in our particle filter in Alg. 2. T𝑇Titalic_T denotes the length of the observation series. For a given polynomial degree d𝑑ditalic_d, we construct the degree matrix 𝐃𝐃{\mathbf{D}}bold_D following Sec. III-A1.

Performance assessment is performed using several quantitative metrics, either based on the recovery of the sparse support of 𝐂𝐂{\mathbf{C}}bold_C (in terms of specificity, precision, recall, and F1 score), or of its entries (in terms of root mean square error, RMSE). We define an element of 𝐂𝐂{\mathbf{C}}bold_C as numerically zero if it is lower than 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT in absolute value, as this is approximately the precision of single precision floating point arithmetic. Perfect recovery of positive and negative values is indicated by 1.01.01.01.0 in all sparsity metrics, and 0.00.00.00.0 RMSE.

We choose the penalty parameter λ𝜆\lambdaitalic_λ for a system with state dimension Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and maximal degree d𝑑ditalic_d by tuning it on a synthetic system. This system is not used to generate the data for fitting, and is only used to tune 𝝀𝝀\bm{\lambda}bold_italic_λ. This system is such that it has the same Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT dimension state and a maximal degree of d𝑑ditalic_d as the model we are fitting. We generate the 𝐂𝐂{\mathbf{C}}bold_C matrix of this system such that it is 75%percent7575\%75 % sparse, with the dense elements drawn from U(Nx,Nx)Usubscript𝑁𝑥subscript𝑁𝑥\mathrm{U}(-N_{x},N_{x})roman_U ( - italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ), and then scaled such that the maximal singular value of 𝐂𝐂{\mathbf{C}}bold_C is 1111. We discretise this system using an Euler discretisation with a timestep ΔtΔ𝑡\Delta troman_Δ italic_t equal to the timestep of the system we aim to estimate, adding zero mean Gaussian noise terms 𝐯𝐯{\mathbf{v}}bold_v and 𝐫𝐫{\mathbf{r}}bold_r to the state and observations respectively, with 𝚺v=𝚺r=Δt𝐈𝐝Nxsubscript𝚺𝑣subscript𝚺𝑟Δ𝑡subscript𝐈𝐝subscript𝑁𝑥\bm{\Sigma}_{v}=\bm{\Sigma}_{r}=\Delta t\mathbf{Id}_{N_{x}}bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = roman_Δ italic_t bold_Id start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We then optimise λ𝜆\lambdaitalic_λ via setting λ=10l𝜆superscript10𝑙\lambda=10^{l}italic_λ = 10 start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, with l𝑙litalic_l chosen to maximise the accuracy of the estimated 𝐂𝐂{\mathbf{C}}bold_C of this system using B-GraphGrad, via 10101010 iterations of bisection over the interval [5,2]52[-5,2][ - 5 , 2 ].

V-B Lorenz 63 model

We start our experiments, with the Lorenz 63 system [13], which we transform into an NLSSM using an Euler discretisation with a timestep of Δt=0.025Δ𝑡0.025\Delta t=0.025roman_Δ italic_t = 0.025, yielding the system

x1,t+1subscript𝑥1𝑡1\displaystyle x_{1,t+1}italic_x start_POSTSUBSCRIPT 1 , italic_t + 1 end_POSTSUBSCRIPT =x1,t+Δt(σ(x2,tx1,t))+Δtv1,t+1,absentsubscript𝑥1𝑡Δ𝑡𝜎subscript𝑥2𝑡subscript𝑥1𝑡Δ𝑡subscript𝑣1𝑡1\displaystyle=x_{1,t}+\Delta t(\sigma(x_{2,t}-x_{1,t}))+\sqrt{\Delta t}v_{1,t+% 1},= italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_σ ( italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) ) + square-root start_ARG roman_Δ italic_t end_ARG italic_v start_POSTSUBSCRIPT 1 , italic_t + 1 end_POSTSUBSCRIPT , (16)
x2,t+1subscript𝑥2𝑡1\displaystyle x_{2,t+1}italic_x start_POSTSUBSCRIPT 2 , italic_t + 1 end_POSTSUBSCRIPT =x2,t+Δt(x1,t(ρx3,t)x2,t)+Δtv2,t+1,absentsubscript𝑥2𝑡Δ𝑡subscript𝑥1𝑡𝜌subscript𝑥3𝑡subscript𝑥2𝑡Δ𝑡subscript𝑣2𝑡1\displaystyle=x_{2,t}+\Delta t(x_{1,t}(\rho-x_{3,t})-x_{2,t})+\sqrt{\Delta t}v% _{2,t+1},= italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ( italic_ρ - italic_x start_POSTSUBSCRIPT 3 , italic_t end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT ) + square-root start_ARG roman_Δ italic_t end_ARG italic_v start_POSTSUBSCRIPT 2 , italic_t + 1 end_POSTSUBSCRIPT ,
x3,t+1subscript𝑥3𝑡1\displaystyle x_{3,t+1}italic_x start_POSTSUBSCRIPT 3 , italic_t + 1 end_POSTSUBSCRIPT =x3,t+Δt(x1,tx2,tβx3,t)+Δtv3,t+1,absentsubscript𝑥3𝑡Δ𝑡subscript𝑥1𝑡subscript𝑥2𝑡𝛽subscript𝑥3𝑡Δ𝑡subscript𝑣3𝑡1\displaystyle=x_{3,t}+\Delta t(x_{1,t}x_{2,t}-\beta x_{3,t})+\sqrt{\Delta t}v_% {3,t+1},= italic_x start_POSTSUBSCRIPT 3 , italic_t end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT - italic_β italic_x start_POSTSUBSCRIPT 3 , italic_t end_POSTSUBSCRIPT ) + square-root start_ARG roman_Δ italic_t end_ARG italic_v start_POSTSUBSCRIPT 3 , italic_t + 1 end_POSTSUBSCRIPT ,

with the observation model p(𝐲t|𝐱t)=𝒩(𝐱t,𝚺r)𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡𝒩subscript𝐱𝑡subscript𝚺𝑟p({\mathbf{y}}_{t}|{\mathbf{x}}_{t})=\mathcal{N}({\mathbf{x}}_{t},\bm{\Sigma}_% {r})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). Hence, Nx=Ny=3subscript𝑁𝑥subscript𝑁𝑦3N_{x}=N_{y}=3italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 3. We choose 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is equal to 1111 in the first element, and 00 elsewhere. We set 𝚺v=𝚺r=σ2𝐈𝐝3subscript𝚺𝑣subscript𝚺𝑟superscript𝜎2subscript𝐈𝐝3\bm{\Sigma}_{v}=\bm{\Sigma}_{r}=\sigma^{2}\mathbf{Id}_{3}bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Id start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, with σ=1𝜎1\sigma=1italic_σ = 1 unless stated otherwise. Note that we present the true 𝐂𝐂{\mathbf{C}}bold_C and 𝐃𝐃{\mathbf{D}}bold_D matrices for this system in Sec. III-C We use ρ=28,σ=10,β=8/3formulae-sequence𝜌28formulae-sequence𝜎10𝛽83\rho=28,\,\sigma=10,\,\beta=8/3italic_ρ = 28 , italic_σ = 10 , italic_β = 8 / 3, as these parameters are known to result in a chaotic system, and we set t0=0subscript𝑡00t_{0}=0italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. Our particle filter is initialised with p(𝐱0)=𝒩(𝐱0,𝐈𝐝3)𝑝subscript𝐱0𝒩subscript𝐱0subscript𝐈𝐝3p({\mathbf{x}}_{0})=\mathcal{N}({\mathbf{x}}_{0},\mathbf{Id}_{3})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Id start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

We average the results, for our method GraphGrad, and its competitor pMLE, on 150150150150 independent realisations of the specified dynamical system. Hereafter we present our results, on three scenarios, namely assuming maximum degree d=2𝑑2d=2italic_d = 2 or 3333, and varying the series length, then considering a varying noise amplitude.

V-B1 Varying series length, maximum degree d=2𝑑2d=2italic_d = 2

Table I: Lorenz 63: Average recovery metrics for variable series length for 150 independent systems. Maximum polynomial degree d=2𝑑2d=2italic_d = 2.
method T𝑇Titalic_T RMSE (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) spec. recall prec. F1
B-GraphGrad 25252525 1.6 0.90 0.92 0.96 0.94
pMLE 25252525 2.4 - - - -
B-GraphGrad 50505050 1.0 0.98 0.97 0.98 0.98
pMLE 50505050 1.6 - - - -
B-GraphGrad 100100100100 0.4 1.00 1.00 1.00 1.00
pMLE 100100100100 0.8 - - - -
B-GraphGrad 200200200200 0.2 1.00 1.00 1.00 1.00
pMLE 200200200200 0.3 - - - -
2525252550505050100100100100200200200200001111222233334444103absentsuperscript103\cdot 10^{-3}⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPTSeries lengthRMSE
B-GraphGradpMLE
Figure 2: Comparison of B-GraphGrad with pMLE over variable series length on the Lorenz 63 oscillator, with maximum polynomial degree d=2𝑑2d=2italic_d = 2. Markers denote mean performance, with the ribbons being symmetric 95% intervals.
Table II: Lorenz 63: Median runtime relative to T=25𝑇25T=25italic_T = 25 for variable series length for 150 independent systems. Maximum polynomial degree d=2𝑑2d=2italic_d = 2.
method T𝑇Titalic_T Relative runtime Absolute runtime (ms)
B-GraphGrad 25252525 1 51
B-GraphGrad 50505050 3.86 197
B-GraphGrad 100100100100 15.64 768
B-GraphGrad 200200200200 62.25 3175

Table I presents the results for learning over a range of series lengths T𝑇Titalic_T with a maximal degree of d=2𝑑2d=2italic_d = 2. In this case, estimating 𝐂𝐂{\mathbf{C}}bold_C requires learning 30303030 parameters. Here, we assumed the knowledge of the correct degree of the underlying system, so these results serve to provide a baseline for the performance of our method in the scenario where the degree is known. We note that, for many dynamical systems, a default degree of 2222 is a sensible modelling choice, as this type of interaction in the rate is very common in many systems, such as in chemical and biological networks, population modelling, and meteorological systems. We observe that the performance of our method improves as the number of observations T𝑇Titalic_T increases, indicating that our method well incorporates information from new observations. Furthermore, we see that, even for a small number of observations, our method recovers the system connectivity well, even if the RMSE is not as good for low values of T𝑇Titalic_T. This is due to changes in the system connectivity and the presence or absence of system terms having a large effect, and thus being relatively easy to infer compared the the specific value of the terms, the effects of which are obscured by the noise inherent to the system. Overall, we observe excellent performance, with good recovery of all parameters at all tested series lengths. We see that pMLE does not perform well, mostly because it recovers a fully dense system, and therefore yields matrix 𝐂𝐂{\mathbf{C}}bold_C with many nonzero parameters that are, in truth, zero. Hence, the RMSE is poor, as the value of the truly nonzero parameters is affected by the value of the falsely nonzero parameters.

From the results given in Table II, we see that our runtime seems to scale quadratically in T𝑇Titalic_T. This follows from the linear complexity of the particle filter in T𝑇Titalic_T, and the number of batches B𝐵Bitalic_B also scaling linearly in T𝑇Titalic_T as per the experimental setup. Each batch runs a particle filter of a fixed length, with each batch running a longer filter than the previous, and therefore the algorithm runtime scales quadratically with T𝑇Titalic_T. However, one can use the batching method presented in the introduction of Sec. III-F to remove the quadratic scaling by fixing the number of batches, in which case the runtime scales linearly in T𝑇Titalic_T. We note that runtimes were gathered using desktop grade hardware, and that the implementation specifics of the algorithm can affect the runtime by several orders of magnitude, with memory management and buffer reuse having large impacts on performance. The runtime in Table II were obtained using JAX 0.4.33 [34], running on Ubuntu 24.04 LTS, with code executed on an Intel i7 9700K at 4.95GHz. We stress that the runtimes provided are a guide only, and that the implementation specifics matter as much as the hardware.

V-B2 Varying series length and maximum degree d=3𝑑3d=3italic_d = 3

Table III: Lorenz 63 average recovery metrics for variable series length for 150 independent systems. Maximum allowed degree d=3𝑑3d=3italic_d = 3.
method T𝑇Titalic_T RMSE (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) spec. recall prec. F1
B-GraphGrad 25252525 2.0 0.84 0.90 0.92 0.91
pMLE 25252525 2.8 - - - -
B-GraphGrad 50505050 1.5 0.96 0.95 0.97 0.96
pMLE 50505050 2.0 - - - -
B-GraphGrad 100100100100 0.7 0.97 0.98 0.97 0.97
pMLE 100100100100 1.3 - - - -
B-GraphGrad 200200200200 0.4 1.00 1.00 1.00 1.00
pMLE 200200200200 1.0 - - - -
2525252550505050100100100100200200200200001111222233334444103absentsuperscript103\cdot 10^{-3}⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPTSeries lengthRMSE
B-GraphGradpMLE
Figure 3: Comparison of B-GraphGrad with pMLE over variable series length on the Lorenz 63 oscillator, with maximum polynomial degree d=3𝑑3d=3italic_d = 3. Markers denote mean performance, with the ribbons being symmetric 95% intervals.

Table III presents the results of our GraphGrad method over a range of series lengths T𝑇Titalic_T with maximal degree of 3333. In this case estimating 𝐂𝐂{\mathbf{C}}bold_C requires estimating 60606060 parameters. A degree of 3333 is greater than the true degree of the system, which, as we recall, is not in general known beforehand. Systems with a degree of 3333 are uncommon, however it is possible to approximate non-polynomial systems, with higher degree allowing better approximation of the dynamics. For example, the Kuramoto oscillator [38] can be well approximated via a centralised Taylor approximation, which our method can learn directly.

We observe that the performance of our method improves as the number of observations increases, indicating that our method well incorporates information from new observations, and does not over fit given the over specification of the model relative to the system we are learning. We note a decrease in performance relative to Table I where the degree is that of the underlying system, however the results improve as T𝑇Titalic_T increases.

Furthermore, we note that the change in RMSE is more severe than the changes in sparsity metrics, as a small number of degree 3 terms are fit, where in reality they should be zero. As RMSE is computed only on terms recovered as non-zero, these contribution of these deviations is large relative to that of the degree 2 terms. We see that our method performs well, with the sparsity metrics being close to those of the d=2𝑑2d=2italic_d = 2 system. The RMSE is inferior to the d=2𝑑2d=2italic_d = 2 system, but still significantly outperforms the comparable polynomial MLE.

V-B3 Variable noise magnitude

Table IV: Lorenz 63: Average recovery metrics for variable noise magnitude for 150 independent systems. Maximum polynomial degree d=2𝑑2d=2italic_d = 2. T=50𝑇50T=50italic_T = 50.
method σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT RMSE (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) spec. recall prec. F1
B-GraphGrad 0.010.010.010.01 0.09 1.00 1.00 1.00 1.00
pMLE 0.010.010.010.01 0.3 - - - -
B-GraphGrad 0.10.10.10.1 0.3 1.00 1.00 1.00 1.00
pMLE 0.10.10.10.1 0.6 - - - -
B-GraphGrad 1111 1.0 0.98 0.97 0.98 0.98
pMLE 1111 1.6 - - - -
B-GraphGrad 5555 1.8 0.90 0.85 0.87 0.86
pMLE 5555 2.3 - - - -
0.010.010.010.010.10.10.10.111115555000.50.50.50.511111.51.51.51.522222.52.52.52.5103absentsuperscript103\cdot 10^{-3}⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPTσ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTRMSE
B-GraphGradpMLE
Figure 4: Comparison of B-GraphGrad with pMLE over variable noise magnitude on the Lorenz 63 oscillator, with maximum polynomial degree d=2𝑑2d=2italic_d = 2. Markers denote mean performance, with the ribbons being symmetric 95% intervals.

In Table IV, we present the results of our method over a range of noise magnitudes σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, now fixing the number of observations to T=50𝑇50T=50italic_T = 50. We remind that the results from Tables I and III, were obtained using σ2=1superscript𝜎21\sigma^{2}=1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1. We observe that our method performs well for all tested values of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, especially in sparsity metrics. As expected, the recovery quality degrades as the signal to noise ratio decreases when we increase σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We note that a large σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT affects both the system and the observation, so increasing it has a particularly pronounced effect.

V-B4 Likelihood degeneracy

In Sec. III-E2, we discussed the effect of likelihood concentration on parameter estimation in the particle filter. We demonstrate this phenomenon below, giving the mean number of timesteps before the likelihood becomes numerically zero for a stochastic variant of the Lorenz 63 model, given by Eq. (9).

Table V: Average number of timesteps/observations (Δt=0.025Δ𝑡0.025\Delta t=0.025roman_Δ italic_t = 0.025) before likelihood degeneracy with a random 𝐂𝐂{\mathbf{C}}bold_C matrix for the Lorenz 63 model (d=2𝑑2d=2italic_d = 2) over 200 independent systems initialised at random points.
number of particles K𝐾Kitalic_K 5 10 20 50 100 250 500 1000
timesteps before degeneracy 7.8 12.2 15.7 23.8 35.3 57.5 76.8 111.3

It is clear from Table V that increasing the number of particles does combat the likelihood degeneracy, though it does not solve it, with computational cost increasing significantly for small gains. It is for this reason that we do not display the results of GraphGrad, instead displaying only B-GraphGrad, as GraphGrad fails to converge using single precision arithmetic due to these likelihood issues.

V-B5 Prox update compared to subgradient update

B-GraphGrad uses the proximal operator of the L1𝐿1L1italic_L 1 norm to apply a sparsity promoting penalty, following Sec. III-D2. However, as noted, we can also use subgradient methods to minimise our cost function including the L1𝐿1L1italic_L 1 penalty. Using standard convex analysis definition, a subgradient of the L1𝐿1L1italic_L 1 norm at an element x𝑥xitalic_x (i.e., an element of the subdifferential of L1 function, at x𝑥xitalic_x), is a vector of same dimension than x𝑥xitalic_x, with entries equal to sgn(x)sgn𝑥\mathrm{sgn}(x)roman_sgn ( italic_x ), using the convention sgn(0)=0sgn00\text{sgn}(0)=0sgn ( 0 ) = 0. Note that, numerically, we never encountered exactly zero entries when running the subgradient descent method, and subgradient at a 0superscript00^{-}0 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT (resp. 0+superscript00^{+}0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT) entry is taken as 11-1- 1 (resp. +11+1+ 1). We use the same learning rate of η=103𝜂superscript103\eta=10^{-3}italic_η = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT than in the proximal version, and the same (B,S)𝐵𝑆(B,S)( italic_B , italic_S ) parameters for the number of inner/outer iterations. The computational cost differences within each step are negligible, however the proximal method is overall much more efficient, as the proximal version appears to converge faster.

We test both the subgradient approach and the proximal approach on the same system as Sec. V-B1. We present both the average absolute value of elements where the ground truth is 00, and the RMSE of the estimate. Parameters are set per Sec. V-B1.

Table VI: Lorenz 63: Average absolute value of elements of 𝐂𝐂{\mathbf{C}}bold_C where ground truth is 00 over 150 independent systems. Maximum polynomial degree d=2𝑑2d=2italic_d = 2.
method T𝑇Titalic_T Recall RMSE (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)
B-GraphGrad (prox) 25252525 0.920.920.920.92 1.61.61.61.6
B-GraphGrad (subgrad) 25252525 0.760.760.760.76 2.02.02.02.0
B-GraphGrad (prox) 50505050 0.970.970.970.97 1.01.01.01.0
B-GraphGrad (subgrad) 50505050 0.800.800.800.80 1.41.41.41.4
B-GraphGrad (prox) 100100100100 1.001.001.001.00 0.40.40.40.4
B-GraphGrad (subgrad) 100100100100 0.910.910.910.91 0.60.60.60.6
B-GraphGrad (prox) 200200200200 1.001.001.001.00 0.20.20.20.2
B-GraphGrad (subgrad) 200200200200 0.930.930.930.93 0.40.40.40.4

The proximal operator method is more efficient per iteration, in particular, we obtain estimates closer to the ground truth in the same number of iterations as the subgradient method. Further, as the methods have negligable difference in computational cost, this illustrates the superiority of the proximal over the subgradient method for this problem.

V-C Lorenz 96 model

The Lorenz 63 system, while well studied and challenging to estimate, is only 3 dimensional. In order to test the applicability of our method to higher dimensional chaotic systems, we test on the Lorenz 96 system [14], which we transform into a NLSSM using an Euler discretisation, yielding the system

di,t+1subscript𝑑𝑖𝑡1\displaystyle d_{i,t+1}italic_d start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =xi1,t(xi+1,txi2,t)xi,t+F,absentsubscript𝑥𝑖1𝑡subscript𝑥𝑖1𝑡subscript𝑥𝑖2𝑡subscript𝑥𝑖𝑡𝐹\displaystyle=x_{i-1,t}(x_{i+1,t}-x_{i-2,t})-x_{i,t}+F,= italic_x start_POSTSUBSCRIPT italic_i - 1 , italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i + 1 , italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i - 2 , italic_t end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + italic_F , (17)
xi,t+1subscript𝑥𝑖𝑡1\displaystyle x_{i,t+1}italic_x start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =xi,t+Δtdi,t+1+Δtvi,t+1,absentsubscript𝑥𝑖𝑡Δ𝑡subscript𝑑𝑖𝑡1Δ𝑡subscript𝑣𝑖𝑡1\displaystyle=x_{i,t}+\Delta t\cdot d_{i,t+1}+\sqrt{\Delta t}\cdot v_{i,t+1},= italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + roman_Δ italic_t ⋅ italic_d start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT + square-root start_ARG roman_Δ italic_t end_ARG ⋅ italic_v start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT ,
yi,t+1subscript𝑦𝑖𝑡1\displaystyle y_{i,t+1}italic_y start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =xi,t+1+Δtri,t+1,absentsubscript𝑥𝑖𝑡1Δ𝑡subscript𝑟𝑖𝑡1\displaystyle=x_{i,t+1}+\sqrt{\Delta t}\cdot r_{i,t+1},= italic_x start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT + square-root start_ARG roman_Δ italic_t end_ARG ⋅ italic_r start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT ,

for i={1,,Nx},𝑖1subscript𝑁𝑥i=\{1,\dots,N_{x}\},italic_i = { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } , with 𝐯t𝒩(𝟎,𝚺v),𝐫t𝒩(𝟎,𝚺r)formulae-sequencesimilar-tosubscript𝐯𝑡𝒩0subscript𝚺𝑣similar-tosubscript𝐫𝑡𝒩0subscript𝚺𝑟{\mathbf{v}}_{t}\sim\mathcal{N}(\bm{0},\bm{\Sigma}_{v}),{\mathbf{r}}_{t}\sim% \mathcal{N}(\bm{0},\bm{\Sigma}_{r})bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), where x,=xNxx_{,\cdot}=x_{N_{x}}italic_x start_POSTSUBSCRIPT , ⋅ end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT, x1,=xNx1subscript𝑥1subscript𝑥subscript𝑁𝑥1x_{-1,\cdot}=x_{N_{x}-1}italic_x start_POSTSUBSCRIPT - 1 , ⋅ end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT, and x2,=xNx2subscript𝑥2subscript𝑥subscript𝑁𝑥2x_{-2,\cdot}=x_{N_{x}-2}italic_x start_POSTSUBSCRIPT - 2 , ⋅ end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT. For this experiment we choose Nx=Ny=20subscript𝑁𝑥subscript𝑁𝑦20N_{x}=N_{y}=20italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 20. Therefore, in the case of d=2𝑑2d=2italic_d = 2, estimating 𝐂𝐂{\mathbf{C}}bold_C requires estimating 4620462046204620 parameters, and the case of d=3𝑑3d=3italic_d = 3 estimating 𝐂𝐂{\mathbf{C}}bold_C requires estimating 35420354203542035420 parameters. We choose a forcing constant of F=8𝐹8F=8italic_F = 8, which is known known to result in a chaotic system. The connectivity graph of this system, constructed following Sec. III-B, is given in Figure 5, where we observe that state i𝑖iitalic_i is affected by states i2,i1,i,𝑖2𝑖1𝑖i-2,i-1,i,italic_i - 2 , italic_i - 1 , italic_i , and i+1𝑖1i+1italic_i + 1, with index boundaries such that state Nxk+isubscript𝑁𝑥𝑘𝑖N_{x}\cdot k+iitalic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⋅ italic_k + italic_i is equivalent to state i𝑖iitalic_i for any k𝑘k\in\mathbb{Z}italic_k ∈ blackboard_Z.

Refer to caption
Refer to caption
Figure 5: Graph encoding the connectivity of the Lorenz 96 system for Nx=20subscript𝑁𝑥20N_{x}=20italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 20. The left plot represents the true connectivity of the system under the graphical perspective described in Section III-B. The right plot represents the connectivity retrieved by the B-GraphGrad algorithm. The links in blue correspond to connections where the monomial is also correctly identified. The links in red correspond to connections where the monomial is incorrectly identified. In this example, the node connectivity was perfectly retrieved by B-GraphGrad, however in 6 out of the 80 links, the link was found but with an incorrect monomial (e.g., the self loop of node 16 was identified as quadratic while it should be a linear term).

We set 𝚺v=𝚺r=σ2𝐈𝐝Nxsubscript𝚺𝑣subscript𝚺𝑟superscript𝜎2subscript𝐈𝐝subscript𝑁𝑥\bm{\Sigma}_{v}=\bm{\Sigma}_{r}=\sigma^{2}\mathbf{Id}_{N_{x}}bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Id start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT, with σ=1𝜎1\sigma=1italic_σ = 1 unless stated otherwise. We discretise this system with a timestep of Δt=0.025Δ𝑡0.025\Delta t=0.025roman_Δ italic_t = 0.025. We choose t0=0subscript𝑡00t_{0}=0italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, and fix 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that 𝐱0subscript𝐱0{\mathbf{x}}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is equal to 1111 in the first element, and 00 elsewhere. Our particle filter is initialised with p(𝐱0)=𝒩(𝐱0,𝐈𝐝Nx)𝑝subscript𝐱0𝒩subscript𝐱0subscript𝐈𝐝subscript𝑁𝑥p({\mathbf{x}}_{0})=\mathcal{N}({\mathbf{x}}_{0},\mathbf{Id}_{N_{x}})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Id start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). We note that the system is of polynomial form, and therefore can be exactly recovered by our model, assuming d𝑑ditalic_d equal or larger than two.

Table VII: Lorenz 96: Average recovery metrics for variable series length for 150 independent systems. Maximum polynomial degree d=2𝑑2d=2italic_d = 2.
method T𝑇Titalic_T RMSE (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) spec. recall prec. F1
B-GraphGrad 25252525 1.8 0.92 0.93 0.93 0.93
pMLE 25252525 2.6 - - - -
B-GraphGrad 50505050 1.4 0.96 0.95 0.94 0.95
pMLE 50505050 2.0 - - - -
B-GraphGrad 100100100100 0.8 0.99 0.97 0.98 0.98
pMLE 100100100100 1.3 - - - -
B-GraphGrad 200200200200 0.3 1.00 1.00 1.00 1.00
pMLE 200200200200 0.8 - - - -
2525252550505050100100100100200200200200001111222233334444103absentsuperscript103\cdot 10^{-3}⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPTSeries lengthRMSE
B-GraphGradpMLE
Figure 6: Comparison of B-GraphGrad with pMLE over variable series length on the Lorenz 96 oscillator, with maximum polynomial degree d=2𝑑2d=2italic_d = 2. Markers denote mean performance, with the ribbons being symmetric 95% intervals.
Table VIII: Lorenz 96: Average recovery metrics for variable series length for 150 independent systems. Maximum polynomial degree d=3𝑑3d=3italic_d = 3.
method T𝑇Titalic_T RMSE (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) spec. recall prec. F1
B-GraphGrad 25252525 2.4 0.80 0.92 0.73 0.82
pMLE 25252525 0.32 - - - -
B-GraphGrad 50505050 2.0 0.89 0.96 0.86 0.91
pMLE 50505050 2.5 - - - -
B-GraphGrad 100100100100 1.1 0.95 0.98 0.96 0.97
pMLE 100100100100 2.2 - - - -
B-GraphGrad 200200200200 0.5 1.00 1.00 1.00 1.00
pMLE 200200200200 1.6 - - - -
2525252550505050100100100100200200200200001111222233334444103absentsuperscript103\cdot 10^{-3}⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPTSeries lengthRMSE
B-GraphGradpMLE
Figure 7: Comparison of B-GraphGrad with pMLE over variable series length on the Lorenz 96 oscillator, with maximum polynomial degree d=3𝑑3d=3italic_d = 3. Markers denote mean performance, with the ribbons being symmetric 95% intervals.

We see that in both cases of d=2𝑑2d=2italic_d = 2 and d=3𝑑3d=3italic_d = 3, GraphGrad performs well, outperforming pMLE by a considerable margin. We see no significant deterioration of performance in GraphGrad in this setting, noting that we are estimating 4620462046204620 parameters in the d=2𝑑2d=2italic_d = 2 case, and 35420354203542035420 parameters in the d=3𝑑3d=3italic_d = 3 case.

V-D Kuramoto oscillator

The Kuramoto oscillator [38] is a mathematical model that describes the behaviour of a system of phase-coupled oscillators. The model is described, for all i{1,,Nx}𝑖1subscript𝑁𝑥i\in\{1,\dots,N_{x}\}italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT }, by

dϕidt=ωi+Nx1o=1NxKsin(ϕiϕo),dsubscriptitalic-ϕ𝑖d𝑡subscript𝜔𝑖superscriptsubscript𝑁𝑥1superscriptsubscript𝑜1subscript𝑁𝑥𝐾subscriptitalic-ϕ𝑖subscriptitalic-ϕ𝑜\displaystyle\begin{split}\frac{\mathrm{d}\phi_{i}}{\mathrm{d}t}=\omega_{i}+N_% {x}^{-1}\sum_{o=1}^{N_{x}}K\sin(\phi_{i}-\phi_{o}),\end{split}start_ROW start_CELL divide start_ARG roman_d italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_d italic_t end_ARG = italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K roman_sin ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) , end_CELL end_ROW (18)

where ϕisubscriptitalic-ϕ𝑖\phi_{i}\in\mathbb{R}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R denotes the phase of the i𝑖iitalic_i-th oscillator, and K𝐾K\in\mathbb{R}italic_K ∈ blackboard_R is the coupling constant between oscillators. However, this does not restrict ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ, which will, in general, diverge to ±plus-or-minus\pm\infty± ∞ as t𝑡t\rightarrow\inftyitalic_t → ∞. To address this, we transform Eq. (18) by introducing derived parameters R𝑅Ritalic_R and ψ𝜓\psiitalic_ψ such that

Rexp(1ψ)=Nx1o=1Nxexp(1ϕo),(i{1,,Nx})dϕidt=ωi+KRsin(ψϕi),formulae-sequence𝑅1𝜓superscriptsubscript𝑁𝑥1superscriptsubscript𝑜1subscript𝑁𝑥1subscriptitalic-ϕ𝑜for-all𝑖1subscript𝑁𝑥dsubscriptitalic-ϕ𝑖d𝑡subscript𝜔𝑖𝐾𝑅𝜓subscriptitalic-ϕ𝑖\displaystyle\begin{split}R\exp(\sqrt{-1}\psi)&=N_{x}^{-1}\sum_{o=1}^{N_{x}}% \exp(\sqrt{-1}\phi_{o}),\\ (\forall i\in\{1,\ldots,N_{x}\})\;\frac{\mathrm{d}\phi_{i}}{\mathrm{d}t}&=% \omega_{i}+KR\sin(\psi-\phi_{i}),\end{split}start_ROW start_CELL italic_R roman_exp ( square-root start_ARG - 1 end_ARG italic_ψ ) end_CELL start_CELL = italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( square-root start_ARG - 1 end_ARG italic_ϕ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL ( ∀ italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } ) divide start_ARG roman_d italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_d italic_t end_ARG end_CELL start_CELL = italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_K italic_R roman_sin ( italic_ψ - italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL end_ROW (19)

which restricts ϕ[π,π]Nxbold-italic-ϕsuperscript𝜋𝜋subscript𝑁𝑥\bm{\phi}\in[-\pi,\pi]^{N_{x}}bold_italic_ϕ ∈ [ - italic_π , italic_π ] start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. This is done purely for computational reasons, and does not affect the properties of the system or interpretation of ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ in any way. We transform Eq. (19) into a NLSSM using an Euler discretisation, yielding

Rexp(1ψ)𝑅1𝜓\displaystyle R\exp(\sqrt{-1}\psi)italic_R roman_exp ( square-root start_ARG - 1 end_ARG italic_ψ ) =Nx1o=1Nxexp(1xo,t),absentsuperscriptsubscript𝑁𝑥1superscriptsubscript𝑜1subscript𝑁𝑥1subscript𝑥𝑜𝑡\displaystyle=N_{x}^{-1}\sum_{o=1}^{N_{x}}\exp(\sqrt{-1}x_{o,t}),= italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( square-root start_ARG - 1 end_ARG italic_x start_POSTSUBSCRIPT italic_o , italic_t end_POSTSUBSCRIPT ) , (20)
di,t+1subscript𝑑𝑖𝑡1\displaystyle d_{i,t+1}italic_d start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =ωi+KRsin(ψxi),absentsubscript𝜔𝑖𝐾𝑅𝜓subscript𝑥𝑖\displaystyle=\omega_{i}+KR\sin(\psi-x_{i}),= italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_K italic_R roman_sin ( italic_ψ - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
xi,t+1subscript𝑥𝑖𝑡1\displaystyle x_{i,t+1}italic_x start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =xi,t+Δtdi,t+1+Δtvi,t+1,absentsubscript𝑥𝑖𝑡Δ𝑡subscript𝑑𝑖𝑡1Δ𝑡subscript𝑣𝑖𝑡1\displaystyle=x_{i,t}+\Delta td_{i,t+1}+\sqrt{\Delta t}v_{i,t+1},= italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + roman_Δ italic_t italic_d start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT + square-root start_ARG roman_Δ italic_t end_ARG italic_v start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT ,
yi,t+1subscript𝑦𝑖𝑡1\displaystyle y_{i,t+1}italic_y start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =xi,t+1+Δtri,t+1,absentsubscript𝑥𝑖𝑡1Δ𝑡subscript𝑟𝑖𝑡1\displaystyle=x_{i,t+1}+\sqrt{\Delta t}r_{i,t+1},= italic_x start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT + square-root start_ARG roman_Δ italic_t end_ARG italic_r start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT ,

for i{1,,Nx},𝑖1subscript𝑁𝑥i\in\{1,\dots,N_{x}\},italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT } ,where 𝐱𝐱{\mathbf{x}}bold_x denotes the phases in the discretised system to ease comparison with the rest of the work. We choose Nx=20subscript𝑁𝑥20N_{x}=20italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 20, and K=0.8𝐾0.8K=0.8italic_K = 0.8. We set 𝚺v=𝚺r=σ2𝐈𝐝Nxsubscript𝚺𝑣subscript𝚺𝑟superscript𝜎2subscript𝐈𝐝subscript𝑁𝑥\bm{\Sigma}_{v}=\bm{\Sigma}_{r}=\sigma^{2}\mathbf{Id}_{N_{x}}bold_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_Id start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT, with σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1. We discretise this model with a timestep of Δt=0.05Δ𝑡0.05\Delta t=0.05roman_Δ italic_t = 0.05. We sample ωi𝒩(0.5,0.52)similar-tosubscript𝜔𝑖𝒩0.5superscript0.52\omega_{i}\sim\mathcal{N}(0.5,0.5^{2})italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0.5 , 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and 𝐱i,0U(π,π)i{1,,Nx}similar-tosubscript𝐱𝑖0𝑈𝜋𝜋for-all𝑖1subscript𝑁𝑥{\mathbf{x}}_{i,0}\sim U(-\pi,\pi)\ \forall i\in\{1,\dots,N_{x}\}bold_x start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ∼ italic_U ( - italic_π , italic_π ) ∀ italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT }. Our particle filter is initialised with p(𝐱0)=𝒩(𝐱0,0.2𝐈𝐝20)𝑝subscript𝐱0𝒩subscript𝐱00.2subscript𝐈𝐝20p({\mathbf{x}}_{0})=\mathcal{N}({\mathbf{x}}_{0},0.2\mathbf{Id}_{20})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0.2 bold_Id start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT ). We run the system until t=10𝑡10t=10italic_t = 10, and then begin collecting observations.

We note that this model cannot be represented exactly as a polynomial, and therefore the classification metrics used to illustrate sparsity recovery do not make sense. Therefore, we only present a normalised RMSE metric, where we compare the relative error in recovering the hidden state. nRMSE is given by nRMSE(𝐱EST,𝐱TM,𝐱GT)=RMSE(𝐱GG,𝐱GT)RMSE(𝐱TM,𝐱GT)nRMSEsubscript𝐱𝐸𝑆𝑇subscript𝐱𝑇𝑀subscript𝐱𝐺𝑇RMSEsubscript𝐱𝐺𝐺subscript𝐱𝐺𝑇RMSEsubscript𝐱𝑇𝑀subscript𝐱𝐺𝑇\mathrm{nRMSE}({\mathbf{x}}_{EST},{\mathbf{x}}_{TM},{\mathbf{x}}_{GT})=\frac{% \mathrm{RMSE}({\mathbf{x}}_{GG},{\mathbf{x}}_{GT})}{\mathrm{RMSE}({\mathbf{x}}% _{TM},{\mathbf{x}}_{GT})}roman_nRMSE ( bold_x start_POSTSUBSCRIPT italic_E italic_S italic_T end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT ) = divide start_ARG roman_RMSE ( bold_x start_POSTSUBSCRIPT italic_G italic_G end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG roman_RMSE ( bold_x start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT ) end_ARG, where 𝐱ESTsubscript𝐱𝐸𝑆𝑇{\mathbf{x}}_{EST}bold_x start_POSTSUBSCRIPT italic_E italic_S italic_T end_POSTSUBSCRIPT is the sequence of means recovered by an estimator, 𝐱TMsubscript𝐱𝑇𝑀{\mathbf{x}}_{TM}bold_x start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT is the sequence of means recovered by a particle filter running with the true model, and 𝐱GTsubscript𝐱𝐺𝑇{\mathbf{x}}_{GT}bold_x start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT is the sequence of states we generate when creating our synthetic data. Therefore, nRMSE=1nRMSE1\mathrm{nRMSE}=1roman_nRMSE = 1 indicates identical performance in terms of state recovery between the true model and an approximation.

We compare the performance of GraphGrad against the pMLE as above, but also against the true model, where we estimate the parameters using maximum likelihood, which we denote by TrueMLE. TrueMLE is an oracle algorithm, and serves as a baseline for the case there the form of the model is known. TrueMLE assumes the form the model in Eq. (20) is known, and estimates the 𝝎𝝎\bm{\omega}bold_italic_ω and K𝐾Kitalic_K parameters using maximum likelihood.

252525255050505010010010010020020020020011111.11.11.11.11.21.21.21.21.31.31.31.3Series lengthnRMSE
B-GraphGradpMLETrueMLE
Figure 8: Comparison of B-GraphGrad with pMLE and TrueMLE over variable series length on the Kuramoto oscillator. Quantities are divided by the RMSE of a filter utilising the true model. Markers denote mean performance, with the ribbons being symmetric 95% intervals.

We see in Figure 8 that TrueMLE gives the best results, which is expected given that it it is an oracle algorithm with respect to the form of the dynamics, whilst the other methods estimate both the form of the dynamics and the parameters thereof. Within these methods, GraphGrad performs significantly better than the polynomial MLE, and on average has only 20%percent2020\%20 % greater RMSE than state recovery with the true model and known parameters, whilst also assuming no knowledge of the underlying dynamics.

VI Conclusion

In this work, we have proposed GraphGrad, a method for fitting a sparse polynomial approximation to the transition function of a general state-space model. GraphGrad utilises a differentiable particle filter to learn a polynomial parameterisation of a general state-space model using gradient methods, from which we infer the connectivity of the state. GraphGrad promotes sparsity using a proximal operator, which is computationally efficient and stable. Our method is memory efficient, requiring only to keep track of a few parameters. Moreover, it is expressive, as many well known systems can be exactly represented by polynomial rates. Our method can utilise long observation series without vanishing gradients as we utilise observation batching to mitigate likelihood degeneracy. The method displays strong performance inferring the connectivity of a chaotic system, and keeps performing well when the true system cannot be exactly represented by the approximation.

References

  • [1] X. Wang, T. Li, S. Sun, and J. M. Corchado, “A survey of recent advances in particle filters and remaining challenges for multitarget tracking,” Sensors, vol. 17, no. 12, p. 2707, 2017.
  • [2] T. A. Patterson, A. Parton, R. Langrock, P. G. Blackwell, L. Thomas, and R. King, “Statistical modelling of individual animal movement: an overview of key methods and a discussion of practical challenges,” AStA Advances in Statistical Analysis, vol. 101, pp. 399–438, 2017.
  • [3] K. Newman, R. King, V. Elvira, P. de Valpine, R. S. McCrea, and B. J. Morgan, “State-space models for ecological time-series data: Practical model-fitting,” Methods in Ecology and Evolution, vol. 14, no. 1, pp. 26–42, 2023.
  • [4] A. Virbickaitė, H. F. Lopes, M. C. Ausín, and P. Galeano, “Particle learning for Bayesian semi-parametric stochastic volatility model,” Econometric Reviews, 2019.
  • [5] A. M. Clayton, A. C. Lorenc, and D. M. Barker, “Operational implementation of a hybrid ensemble/4d-Var global data assimilation system at the Met Office,” Quarterly Journal of the Royal Meteorological Society, vol. 139, no. 675, pp. 1445–1461, 2013.
  • [6] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960.
  • [7] H. E. Rauch, F. Tung, and C. T. Striebel, “Maximum likelihood estimates of linear dynamic systems,” AIAA Journal, vol. 3, no. 8, pp. 1445–1450, 1965.
  • [8] E. A. Wan and R. Van Der Merwe, “The unscented Kalman filter for nonlinear estimation,” in Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373).   IEEE, 2000, pp. 153–158.
  • [9] A. Doucet, A. M. Johansen et al., “A tutorial on particle filtering and smoothing: Fifteen years later,” Handbook of nonlinear filtering, vol. 12, no. 656-704, p. 3, 2009.
  • [10] P. Karkus, D. Hsu, and W. S. Lee, “Particle filter networks with application to visual localization,” in Proceedings of the Conference on Robot Learning.   PMLR, 2018, pp. 169–178.
  • [11] A. Corenflos, J. Thornton, A. Doucet, and G. Deligiannidis, “Differentiable particle filtering via entropy-regularized optimal transport,” arXiv preprint arXiv:2102.07850, 2021.
  • [12] A. Ścibior and F. Wood, “Differentiable particle filtering without modifying the forward pass,” arXiv preprint arXiv:2106.10314, 2021.
  • [13] E. N. Lorenz, “Deterministic nonperiodic flow,” Journal of atmospheric sciences, vol. 20, no. 2, pp. 130–141, 1963.
  • [14] ——, “Predictability: A problem partly solved,” in Proc. Seminar on predictability, vol. 1, no. 1.   Reading, 1996.
  • [15] S. Särkkä, Bayesian Filtering and Smoothing.   Cambridge University Press, 2013.
  • [16] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [17] V. Elvira and É. Chouzenoux, “Graphical inference in linear-Gaussian state-space models,” IEEE Transactions on Signal Processing, vol. 70, pp. 4757–4771, 2022.
  • [18] S. T. Tokdar and R. E. Kass, “Importance sampling: a review,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 1, pp. 54–60, 2010.
  • [19] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269–342, 2010.
  • [20] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
  • [21] B. Cox and V. Elvira, “Sparse bayesian estimation of parameters in linear-gaussian state-space models,” IEEE Transactions on Signal Processing, vol. 71, pp. 1922–1937, 2023.
  • [22] X. Chen and Y. Li, “An overview of differentiable particle filters for data-adaptive sequential Bayesian inference,” arXiv preprint arXiv:2302.09639, 2023.
  • [23] W. Li, X. Chen, W. Wang, V. Elvira, and Y. Li, “Differentiable bootstrap particle filters for regime-switching models,” arXiv preprint arXiv:2302.10319, 2023.
  • [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [25] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” arXiv preprint arXiv:1904.03288, 2019.
  • [26] E. Chouzenoux and V. Elvira, “Graphit: Iterative reweighted l1 algorithm for sparse graph inference in state-space models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [27] ——, “Sparse graphical linear dynamical systems,” Journal of Machine Learning Research, vol. 25, no. 223, pp. 1–53, 2024.
  • [28] ——, “Graphical inference in non-markovian linear-gaussian state-space models,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 13 141–13 145.
  • [29] E. Tan, D. Corrêa, T. Stemler, and M. Small, “A backpropagation algorithm for inferring disentagled nodal dynamics and connectivity structure of dynamical networks,” IEEE Transactions on Network Science and Engineering, vol. 11, no. 1, pp. 613–624, 2024.
  • [30] B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, H. Nguyen, and J. M. Cohen, “Stochastic gradient methods with layer-wise adaptive moments for training of deep networks,” CoRR, vol. abs/1905.11286, 2019. [Online]. Available: https://arxiv.org/abs/1905.11286
  • [31] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212, 2011.
  • [32] L. Rosasco, S. Villa, and B. C. Vũ, “Convergence of stochastic proximal gradient algorithm,” Applied Mathematics & Optimization, vol. 82, pp. 891–917, 2020.
  • [33] P. L. Combettes and J.-C. Pesquet, “Stochastic approximations and perturbations in forward-backward splitting for monotone operators,” Pure and Applied Functional Analysis, vol. 1, no. 1, pp. 13–37, 2016.
  • [34] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: https://github.com/google/jax
  • [35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [36] F. Bach, R. Jenatton, J. Mairal, G. Obozinski et al., “Optimization with sparsity-inducing penalties,” Foundations and Trends® in Machine Learning, vol. 4, no. 1, pp. 1–106, 2012.
  • [37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [38] Y. Kuramoto, Chemical Oscillations, Waves, and Turbulence.   Springer, 1984.