Gaussian Processes with Noisy Regression Inputs for Dynamical Systems

Tobias M. Wolff, Victor G. Lopez, and Matthias A. Müller This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 948679).Tobias M. Wolff, Victor G. Lopez, and Matthias A. Müller are with the Leibniz University Hannover, Institute of Automatic Control, 30167 Hannover, Germany {wolff,lopez,mueller}@irt.uni-hannover.de
Abstract

This paper is centered around the approximation of dynamical systems by means of Gaussian processes. To this end, trajectories of such systems must be collected to be used as training data. The measurements of these trajectories are typically noisy, which implies that both the regression inputs and outputs are corrupted by noise. However, most of the literature considers only noise in the regression outputs. In this paper, we show how to account for the noise in the regression inputs in an extended Gaussian process framework to approximate scalar and multidimensional systems. We demonstrate the potential of our framework by comparing it to different state-of-the-art methods in several simulation examples.

I INTRODUCTION

The application of Gaussian process (GP) regression in the context of dynamical systems has received a substantial interest in recent years. It has been applied for a variety of applications such as, e.g., control [1, 2, 3] and state estimation [4, 5, 6]. The most common setup for GP regression considers two major assumptions on the measured data. First, it is assumed that the available regression input data are noise-free. Second, the measured regression output data are assumed to be corrupted by independent and identically distributed (iid) Gaussian noise.

One frequently applied approach to approximate dynamical systems by GPs is to model each component of the transition function f𝑓fitalic_f by the posterior means of independently learned GPs [1, 2, 4, 6]. To approximate these functions, it is assumed that the states (together with the control inputs) can be measured. Subsequently, the control input and state trajectory are used as regression input data, and the (by one time instant shifted) state trajectory is used as regression output data.

In most practical applications, the state measurements are corrupted by noise. On the one hand, this implies that the regression outputs are corrupted by noise, which is in accordance to the standard GP setting. On the other hand, this entails that the regression inputs are also corrupted by noise, which is not covered by the standard GP setting.

To cope with regression input noise in GP regression, one can use heteroscedastic GPs [7, 8] (where a second GP is used to model the noise variance and rather large amounts of data are needed [9]) or variational methods [10, 11]. An alternative, which is simple, but very effective has been proposed by [12, 9]. The key idea is to propagate the input noise to the output by using first order Taylor approximations of the posterior means (see Section II below for the details). In [12], the authors show that this approach can outperform variational methods, heteroscedastic GPs and standard GPs.

Our work can be considered as an extension of the framework suggested in [12] to dynamical systems. Here, one major difference is that one cannot arbitrarily sample training data points to set up a GP. Instead, one typically can only collect trajectories. We show that these trajectories induce correlations that must be taken into account when setting up a GP to correctly represent dynamical systems. Alongside these theoretical derivations, we illustrate the performance of our proposed extension by means of several simulation examples and compare it to the cases where a dynamical system is directly approximated using the method proposed in [12] and a standard GP [13].

This paper is organized as follows. In Section II, we explain some preliminaries and introduce the problem setting. In Sections III and IV, we introduce our framework for scalar and multidimensional systems, respectively. We close this paper with a conclusion in Section V.

II PRELIMINARIES AND PROBLEM SETTING

The set of real numbers is denoted by \mathbb{R}blackboard_R. The identity matrix of dimension N𝑁Nitalic_N is denoted by INsubscript𝐼𝑁I_{N}italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. A diagonal matrix with q1,,qnsubscript𝑞1subscript𝑞𝑛q_{1},\dots,q_{n}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT on its diagonal entries is denoted by diag(q1,,qn)diagsubscript𝑞1subscript𝑞𝑛\mathrm{diag}(q_{1},\dots,q_{n})roman_diag ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). We denote the Kronecker product by tensor-product\otimes. We denote scalars by small letters, vectors by small bold letters and matrices by capital letters. A vector of zeros of length n𝑛nitalic_n is denoted by 𝟎nsubscript0𝑛\mathbf{0}_{n}bold_0 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. A square matrix of zeros of dimension n𝑛nitalic_n is denoted by 0n×nsubscript0𝑛𝑛0_{n\times n}0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT.

We briefly review the fundamentals of standard Gaussian processes; a more detailed introduction to GPs can be found in [13]. GPs are commonly applied to approximate some nonlinear function f¯:n¯:¯𝑓superscript¯𝑛\bar{f}:\mathbb{R}^{\bar{n}}\rightarrow\mathbb{R}over¯ start_ARG italic_f end_ARG : blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_n end_ARG end_POSTSUPERSCRIPT → blackboard_R. They are fully described by a mean function m:n¯:𝑚superscript¯𝑛m:\mathbb{R}^{\bar{n}}\rightarrow\mathbb{R}italic_m : blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_n end_ARG end_POSTSUPERSCRIPT → blackboard_R and a covariance function (also referred to as kernel) k:n¯×n¯:𝑘superscript¯𝑛superscript¯𝑛k:\mathbb{R}^{\bar{n}}\times\mathbb{R}^{\bar{n}}\rightarrow\mathbb{R}italic_k : blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_n end_ARG end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_n end_ARG end_POSTSUPERSCRIPT → blackboard_R. For some 𝐱¯,𝐱¯n¯¯𝐱superscript¯𝐱superscript¯𝑛\bar{\mathbf{x}},\bar{\mathbf{x}}^{\prime}\in\mathbb{R}^{\bar{n}}over¯ start_ARG bold_x end_ARG , over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_n end_ARG end_POSTSUPERSCRIPT, we write

f¯(𝐱¯)𝒢𝒫(m(𝐱¯),k(𝐱¯,𝐱¯))similar-to¯𝑓¯𝐱𝒢𝒫𝑚¯𝐱𝑘¯𝐱superscript¯𝐱\displaystyle\bar{f}(\bar{\mathbf{x}})\sim\mathcal{GP}(m(\bar{\mathbf{x}}),k(% \bar{\mathbf{x}},\bar{\mathbf{x}}^{\prime}))over¯ start_ARG italic_f end_ARG ( over¯ start_ARG bold_x end_ARG ) ∼ caligraphic_G caligraphic_P ( italic_m ( over¯ start_ARG bold_x end_ARG ) , italic_k ( over¯ start_ARG bold_x end_ARG , over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (1)

to denote that the function f¯¯𝑓\bar{f}over¯ start_ARG italic_f end_ARG follows a GP with mean function m𝑚mitalic_m and covariance function k𝑘kitalic_k. We collect N𝑁Nitalic_N regression input and output data points from the unknown function and use them to define X¯=(𝐱¯(0)𝐱¯(N1))¯𝑋matrix¯𝐱0¯𝐱𝑁1\bar{X}=\begin{pmatrix}\bar{\mathbf{x}}(0)&\dots&\bar{\mathbf{x}}(N-1)\end{pmatrix}over¯ start_ARG italic_X end_ARG = ( start_ARG start_ROW start_CELL over¯ start_ARG bold_x end_ARG ( 0 ) end_CELL start_CELL … end_CELL start_CELL over¯ start_ARG bold_x end_ARG ( italic_N - 1 ) end_CELL end_ROW end_ARG ) and Y¯=(y¯(0)y¯(N1))¯Ysuperscriptmatrix¯𝑦0¯𝑦𝑁1top\bar{\textbf{Y}}=\begin{pmatrix}\bar{y}(0)&\dots&\bar{y}(N-1)\end{pmatrix}^{\top}over¯ start_ARG Y end_ARG = ( start_ARG start_ROW start_CELL over¯ start_ARG italic_y end_ARG ( 0 ) end_CELL start_CELL … end_CELL start_CELL over¯ start_ARG italic_y end_ARG ( italic_N - 1 ) end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, respectively. The regression outputs are given by y¯=f¯(𝐱¯)+ε¯¯𝑦¯𝑓¯𝐱¯𝜀\bar{y}=\bar{f}(\bar{\mathbf{x}})+\bar{\varepsilon}over¯ start_ARG italic_y end_ARG = over¯ start_ARG italic_f end_ARG ( over¯ start_ARG bold_x end_ARG ) + over¯ start_ARG italic_ε end_ARG with ε¯¯𝜀\bar{\varepsilon}over¯ start_ARG italic_ε end_ARG being iid Gaussian noise with zero mean and variance σε¯2superscriptsubscript𝜎¯𝜀2\sigma_{\bar{\varepsilon}}^{2}italic_σ start_POSTSUBSCRIPT over¯ start_ARG italic_ε end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The key idea of Gaussian processes is to condition the prior distribution on the training data, which results in a posterior distribution. For some test input 𝐱¯subscript¯𝐱\bar{\mathbf{x}}_{\ast}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, the mean and variance of the posterior distribution are given by [13, Ch. 2]

m¯+(𝐱¯|X¯,Y¯)=𝐤(𝐱¯,X¯)(K(X¯,X¯)+σε¯2IN)1𝐘¯subscript¯𝑚conditionalsubscript¯𝐱¯𝑋¯Y𝐤subscript¯𝐱¯𝑋superscript𝐾¯𝑋¯𝑋superscriptsubscript𝜎¯𝜀2subscript𝐼𝑁1¯𝐘\displaystyle\bar{m}_{+}(\bar{\mathbf{x}}_{\ast}|\bar{X},\bar{\textbf{Y}})=% \mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})(K(\bar{X},\bar{X})+\sigma_{\bar{% \varepsilon}}^{2}I_{N})^{-1}\bar{\mathbf{Y}}over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | over¯ start_ARG italic_X end_ARG , over¯ start_ARG Y end_ARG ) = bold_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG italic_X end_ARG ) ( italic_K ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_X end_ARG ) + italic_σ start_POSTSUBSCRIPT over¯ start_ARG italic_ε end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG bold_Y end_ARG (2)
σ¯+2(𝐱¯|X¯,Y¯)=superscriptsubscript¯𝜎2conditionalsubscript¯𝐱¯𝑋¯Yabsent\displaystyle\bar{\sigma}_{+}^{2}(\bar{\mathbf{x}}_{\ast}|\bar{X},\bar{\textbf% {Y}})=over¯ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | over¯ start_ARG italic_X end_ARG , over¯ start_ARG Y end_ARG ) =
𝐤(𝐱¯,𝐱¯)𝐤(𝐱¯,X¯)(K(X¯,X¯)+σε¯2IN)1𝐤(X¯,𝐱¯),𝐤subscript¯𝐱subscript¯𝐱𝐤subscript¯𝐱¯𝑋superscript𝐾¯𝑋¯𝑋superscriptsubscript𝜎¯𝜀2subscript𝐼𝑁1𝐤¯𝑋subscript¯𝐱\displaystyle\mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{\mathbf{x}}_{\ast})-% \mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})(K(\bar{X},\bar{X})+\sigma_{\bar{% \varepsilon}}^{2}I_{N})^{-1}\mathbf{k}(\bar{X},\bar{\mathbf{x}}_{\ast}),bold_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - bold_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG italic_X end_ARG ) ( italic_K ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_X end_ARG ) + italic_σ start_POSTSUBSCRIPT over¯ start_ARG italic_ε end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_k ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , (3)

for 𝐤(𝐱¯,X¯)=(k(𝐱¯,𝐱¯i))𝐱¯iX¯=𝐤(X¯,𝐱¯)𝐤subscript¯𝐱¯𝑋subscriptmatrix𝑘subscript¯𝐱subscript¯𝐱𝑖subscript¯𝐱𝑖¯𝑋𝐤superscript¯𝑋subscript¯𝐱top\mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})=\begin{pmatrix}k(\bar{\mathbf{x}}_% {\ast},\bar{\mathbf{x}}_{i})\end{pmatrix}_{\bar{\mathbf{x}}_{i}\in\bar{X}}=% \mathbf{k}(\bar{X},\bar{\mathbf{x}}_{\ast})^{\top}bold_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG italic_X end_ARG ) = ( start_ARG start_ROW start_CELL italic_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) start_POSTSUBSCRIPT over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over¯ start_ARG italic_X end_ARG end_POSTSUBSCRIPT = bold_k ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, with 𝐤(𝐱¯,X¯)1×N𝐤subscript¯𝐱¯𝑋superscript1𝑁\mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})\in\mathbb{R}^{1\times N}bold_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over¯ start_ARG italic_X end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT, and K(X¯,X¯)=(k(𝐱¯i,𝐱¯j))𝐱¯i,𝐱¯jX¯𝐾¯𝑋¯𝑋subscript𝑘subscript¯𝐱𝑖subscript¯𝐱𝑗subscript¯𝐱𝑖subscript¯𝐱𝑗¯𝑋K(\bar{X},\bar{X})=(k(\bar{\mathbf{x}}_{i},\bar{\mathbf{x}}_{j}))_{\bar{% \mathbf{x}}_{i},\bar{\mathbf{x}}_{j}\in\bar{X}}italic_K ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_X end_ARG ) = ( italic_k ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over¯ start_ARG italic_X end_ARG end_POSTSUBSCRIPT with K(X¯,X¯)N×N𝐾¯𝑋¯𝑋superscript𝑁𝑁K(\bar{X},\bar{X})\in\mathbb{R}^{N\times N}italic_K ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_X end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT. The kernel depends on hyperparameters (such as, e.g., the signal variance and the length scales in case of the squared exponential kernel) that are commonly determined by maximizing the log marginal likelihood, see, e.g., [13, Eq. (2.30)].

These standard results in Gaussian processes rely on the assumption that the regression input data are noise-free. In turn, if the regression input data points are affected by some noise such that only

𝐱¯ˇ𝐱¯+𝐫¯ˇ¯𝐱¯𝐱¯𝐫\check{\bar{\mathbf{x}}}\coloneqq\bar{\mathbf{x}}+\bar{\mathbf{r}}overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ≔ over¯ start_ARG bold_x end_ARG + over¯ start_ARG bold_r end_ARG (4)

is available with 𝐫¯¯𝐫\bar{\mathbf{r}}over¯ start_ARG bold_r end_ARG being some iid Gaussian noise with variance Σr¯=diag(σr¯2,,σr¯2)subscriptΣ¯𝑟diagsubscriptsuperscript𝜎2¯𝑟subscriptsuperscript𝜎2¯𝑟\Sigma_{\bar{r}}=\mathrm{diag}(\sigma^{2}_{\bar{r}},\dots,\sigma^{2}_{\bar{r}})roman_Σ start_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG end_POSTSUBSCRIPT = roman_diag ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG end_POSTSUBSCRIPT , … , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG end_POSTSUBSCRIPT ), we cannot use standard GP tools anymore, since the problem of exact GP regression based on noisy regression inputs is intractable [14, Sec. 2.3.2]. We here briefly review the work of [12, Ch. 2] (which is more detailed than the original work [9]) to handle this issue. First, a Taylor series expansion around the noisy regression input is done (and truncated after the first-order term), which results in

f¯(𝐱¯)=f¯(𝐱¯ˇ𝐫¯)f¯(𝐱¯ˇ)f¯(𝐱)𝐱|𝐱=𝐱¯ˇ𝐫¯.\displaystyle\bar{f}(\bar{\mathbf{x}})=\bar{f}(\check{\bar{\mathbf{x}}}-\bar{% \mathbf{r}})\approx\bar{f}(\check{\bar{\mathbf{x}}})-\frac{\partial\bar{f}(% \mathbf{x})}{\partial\mathbf{x}}\Big{\rvert}_{\mathbf{x}=\check{\bar{\mathbf{x% }}}}\bar{\mathbf{r}}.over¯ start_ARG italic_f end_ARG ( over¯ start_ARG bold_x end_ARG ) = over¯ start_ARG italic_f end_ARG ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG - over¯ start_ARG bold_r end_ARG ) ≈ over¯ start_ARG italic_f end_ARG ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ) - divide start_ARG ∂ over¯ start_ARG italic_f end_ARG ( bold_x ) end_ARG start_ARG ∂ bold_x end_ARG | start_POSTSUBSCRIPT bold_x = overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG end_POSTSUBSCRIPT over¯ start_ARG bold_r end_ARG . (5)

The second term depends on the derivative of a GP, which is again a GP [15]. Although one can compute the first and second moment of this expression, it is much simpler to perform another approximation by replacing the derivative of the GP by the derivative of its posterior mean [12]. In this case, we consider the following model

y¯f¯(𝐱¯ˇ)m¯+(𝐱¯|X¯,Y¯)𝐱¯|𝐱¯=𝐱¯ˇ𝐫¯+ε¯.\displaystyle\bar{y}\approx\bar{f}(\check{\bar{\mathbf{x}}})-\frac{\partial% \bar{m}_{+}(\bar{\mathbf{x}}|\bar{X},\bar{Y})}{\partial\bar{\mathbf{x}}}\Big{% \rvert}_{\bar{\mathbf{x}}=\check{\bar{\mathbf{x}}}}\bar{\mathbf{r}}+\bar{% \varepsilon}.over¯ start_ARG italic_y end_ARG ≈ over¯ start_ARG italic_f end_ARG ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ) - divide start_ARG ∂ over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG | over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_Y end_ARG ) end_ARG start_ARG ∂ over¯ start_ARG bold_x end_ARG end_ARG | start_POSTSUBSCRIPT over¯ start_ARG bold_x end_ARG = overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG end_POSTSUBSCRIPT over¯ start_ARG bold_r end_ARG + over¯ start_ARG italic_ε end_ARG . (6)

This model results in the following covariance matrix

Kˇ=ˇ𝐾absent\displaystyle\check{K}=overroman_ˇ start_ARG italic_K end_ARG = (k(𝐱¯ˇ(0),𝐱¯ˇ(0))k(𝐱¯ˇ(0),𝐱¯ˇ(N1))k(𝐱¯ˇ(N1),𝐱¯ˇ(0))k(𝐱¯ˇ(N1),𝐱¯ˇ(N1)))matrix𝑘ˇ¯𝐱0ˇ¯𝐱0𝑘ˇ¯𝐱0ˇ¯𝐱𝑁1𝑘ˇ¯𝐱𝑁1ˇ¯𝐱0𝑘ˇ¯𝐱𝑁1ˇ¯𝐱𝑁1\displaystyle\begin{pmatrix}k(\check{\bar{\mathbf{x}}}(0),\check{\bar{\mathbf{% x}}}(0))&\dots&k(\check{\bar{\mathbf{x}}}(0),\check{\bar{\mathbf{x}}}(N-1))\\ \vdots&\ddots&\vdots\\ k(\check{\bar{\mathbf{x}}}(N-1),\check{\bar{\mathbf{x}}}(0))&\dots&k(\check{% \bar{\mathbf{x}}}(N-1),\check{\bar{\mathbf{x}}}(N-1))\end{pmatrix}( start_ARG start_ROW start_CELL italic_k ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( 0 ) , overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( 0 ) ) end_CELL start_CELL … end_CELL start_CELL italic_k ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( 0 ) , overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( italic_N - 1 ) ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_k ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( italic_N - 1 ) , overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( 0 ) ) end_CELL start_CELL … end_CELL start_CELL italic_k ( overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( italic_N - 1 ) , overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( italic_N - 1 ) ) end_CELL end_ROW end_ARG )
+diag(σ¯out2(0),,σ¯out2(N1))diagsubscriptsuperscript¯𝜎2out0subscriptsuperscript¯𝜎2out𝑁1\displaystyle+\mathrm{diag}(\bar{\sigma}^{2}_{\mathrm{out}}(0),\dots,\bar{% \sigma}^{2}_{\mathrm{out}}(N-1))+ roman_diag ( over¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( 0 ) , … , over¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_N - 1 ) ) (7)

with

σ¯out2(i)subscriptsuperscript¯𝜎2out𝑖absent\displaystyle\bar{\sigma}^{2}_{\mathrm{out}}(i)\coloneqqover¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_i ) ≔
m¯+(𝐱¯|X¯,Y¯)𝐱¯|𝐱¯=𝐱¯ˇ(i)Σr¯m¯+(𝐱¯|X¯,Y¯)𝐱¯|𝐱¯=𝐱¯ˇ(i)+σε¯2.\displaystyle\frac{\partial\bar{m}_{+}(\bar{\mathbf{x}}|\bar{X},\bar{Y})}{% \partial\bar{\mathbf{x}}}\Big{\rvert}_{\bar{\mathbf{x}}=\check{\bar{\mathbf{x}% }}(i)}\Sigma_{\bar{r}}\frac{\partial\bar{m}_{+}(\bar{\mathbf{x}}|\bar{X},\bar{% Y})}{\partial\bar{\mathbf{x}}}\Big{\rvert}_{\bar{\mathbf{x}}=\check{\bar{% \mathbf{x}}}(i)}^{\top}+\sigma_{\bar{\varepsilon}}^{2}.divide start_ARG ∂ over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG | over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_Y end_ARG ) end_ARG start_ARG ∂ over¯ start_ARG bold_x end_ARG end_ARG | start_POSTSUBSCRIPT over¯ start_ARG bold_x end_ARG = overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( italic_i ) end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG end_POSTSUBSCRIPT divide start_ARG ∂ over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG | over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_Y end_ARG ) end_ARG start_ARG ∂ over¯ start_ARG bold_x end_ARG end_ARG | start_POSTSUBSCRIPT over¯ start_ARG bold_x end_ARG = overroman_ˇ start_ARG over¯ start_ARG bold_x end_ARG end_ARG ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT over¯ start_ARG italic_ε end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

The expressions of the posterior mean and variance are analogous to (2) and (3), simply with K(X¯,X¯)+σε¯2IN𝐾¯𝑋¯𝑋superscriptsubscript𝜎¯𝜀2subscript𝐼𝑁K(\bar{X},\bar{X})+\sigma_{\bar{\varepsilon}}^{2}I_{N}italic_K ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_X end_ARG ) + italic_σ start_POSTSUBSCRIPT over¯ start_ARG italic_ε end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT replaced by Kˇˇ𝐾\check{K}overroman_ˇ start_ARG italic_K end_ARG from (7). Note that we have one further hyperparameter to determine, which is the variance of the input noise. The optimization of the hyperparameters must be adapted, since the covariance matrix now depends on the derivatives of the posterior mean. Hence, [12] proposes to iterate the computations of the slopes of the posterior mean and the optimization of the hyperparameters. Note that the approach does not differ from a standard GP for (i) negligible input noise levels and (ii) constant posterior mean gradients [12]. Finally, in simulation examples this approach often outperforms heteroscedastic GPs, standard GPs, as well as variational methods [12].

In this work, we focus on discrete-time nonlinear dynamical systems of the following form111To simplify the notation, we do not consider control inputs in (9). However, the results of this paper can be straightforwardly extended to systems with control inputs.

𝐱(t+1)𝐱𝑡1\displaystyle\mathbf{x}(t+1)bold_x ( italic_t + 1 ) =𝐟(𝐱(t))+𝐰(t)absent𝐟𝐱𝑡𝐰𝑡\displaystyle=\mathbf{f}(\mathbf{x}(t))+\mathbf{w}(t)= bold_f ( bold_x ( italic_t ) ) + bold_w ( italic_t ) (9)

with states 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, process noise 𝐰n𝐰superscript𝑛\mathbf{w}\in\mathbb{R}^{n}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (sometimes also referred to as system noise), and 𝐟:nn:𝐟superscript𝑛superscript𝑛\mathbf{f}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}bold_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The process noise 𝐰𝐰\mathbf{w}bold_w is assumed to be iid Gaussian noise with zero mean and variance Σw=diag(σw2,,σw2)subscriptΣ𝑤diagsuperscriptsubscript𝜎𝑤2superscriptsubscript𝜎𝑤2\Sigma_{w}=\mathrm{diag}(\sigma_{w}^{2},\dots,\sigma_{w}^{2})roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = roman_diag ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Here, we assume the same noise variance among all components to simplify the analysis. The objective of this work is to approximate the function 𝐟𝐟\mathbf{f}bold_f by (the posterior means of) Gaussian processes. To this end, we collect a sufficiently long (or multiple shorter) trajectory from the system. In the here considered setting of dynamical systems, we cannot collect arbitrary data points. This is due to the recursive structure of (9): the (noisy) outputs of the function 𝐟𝐟\mathbf{f}bold_f at some time instant correspond to the function inputs at the next time instant.

K~=(k(x~(0),x~(0))+σout2(0)k(x~(0),x~(1))0σr2k(x~(0),x~(N1)k(x~(1),x~(0))0σr2k(x~(1),x~(1))+σout2(1)k(x~(1),x~(N1)k(x~(N2),x~(0))k(x~(N2),x~(1))k(x~(N2),x~(N1))N2σr2k(x~(N1),x~(0))k(x~(N1),x~(1))k(x~(N1),x~(N1))+σout2(N1))\displaystyle\tilde{K}=\begin{pmatrix}k(\tilde{x}(0),\tilde{x}(0))+\sigma^{2}_% {\mathrm{out}}(0)&k(\tilde{x}(0),\tilde{x}(1))-\nabla_{0}\sigma^{2}_{r}&\dots&% k(\tilde{x}(0),\tilde{x}(N-1)\\ k(\tilde{x}(1),\tilde{x}(0))-\nabla_{0}\sigma^{2}_{r}&k(\tilde{x}(1),\tilde{x}% (1))+\sigma^{2}_{\mathrm{out}}(1)&\dots&k(\tilde{x}(1),\tilde{x}(N-1)\\ \vdots&\vdots&\ddots&\vdots\\ k(\tilde{x}(N-2),\tilde{x}(0))&k(\tilde{x}(N-2),\tilde{x}(1))&&k(\tilde{x}(N-2% ),\tilde{x}(N-1))-\nabla_{N-2}\sigma^{2}_{r}\\ k(\tilde{x}(N-1),\tilde{x}(0))&k(\tilde{x}(N-1),\tilde{x}(1))&\dots&k(\tilde{x% }(N-1),\tilde{x}(N-1))+\sigma^{2}_{\mathrm{out}}(N-1)\end{pmatrix}over~ start_ARG italic_K end_ARG = ( start_ARG start_ROW start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( 0 ) , over~ start_ARG italic_x end_ARG ( 0 ) ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( 0 ) end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( 0 ) , over~ start_ARG italic_x end_ARG ( 1 ) ) - ∇ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( 0 ) , over~ start_ARG italic_x end_ARG ( italic_N - 1 ) end_CELL end_ROW start_ROW start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( 1 ) , over~ start_ARG italic_x end_ARG ( 0 ) ) - ∇ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( 1 ) , over~ start_ARG italic_x end_ARG ( 1 ) ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( 1 ) end_CELL start_CELL … end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( 1 ) , over~ start_ARG italic_x end_ARG ( italic_N - 1 ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( italic_N - 2 ) , over~ start_ARG italic_x end_ARG ( 0 ) ) end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( italic_N - 2 ) , over~ start_ARG italic_x end_ARG ( 1 ) ) end_CELL start_CELL end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( italic_N - 2 ) , over~ start_ARG italic_x end_ARG ( italic_N - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_N - 2 end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( italic_N - 1 ) , over~ start_ARG italic_x end_ARG ( 0 ) ) end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( italic_N - 1 ) , over~ start_ARG italic_x end_ARG ( 1 ) ) end_CELL start_CELL … end_CELL start_CELL italic_k ( over~ start_ARG italic_x end_ARG ( italic_N - 1 ) , over~ start_ARG italic_x end_ARG ( italic_N - 1 ) ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_N - 1 ) end_CELL end_ROW end_ARG ) (\star)

 

When measuring a trajectory from the system, one has (in most applications) only access to noisy measurements of the trajectories (due to, e.g., noise coming from the sensors). This means that only

𝐱~(0)~𝐱0\displaystyle\tilde{\mathbf{x}}(0)over~ start_ARG bold_x end_ARG ( 0 ) =𝐱(0)+𝐫(0)absent𝐱0𝐫0\displaystyle=\mathbf{x}(0)+\mathbf{r}(0)= bold_x ( 0 ) + bold_r ( 0 ) (10)
𝐱~(1)~𝐱1\displaystyle\tilde{\mathbf{x}}(1)over~ start_ARG bold_x end_ARG ( 1 ) =𝐱(1)+𝐫(1)=𝐟(𝐱(0))+𝐰(0)+𝐫(1)absent𝐱1𝐫1𝐟𝐱0𝐰0𝐫1\displaystyle=\mathbf{x}(1)+\mathbf{r}(1)=\mathbf{f}(\mathbf{x}(0))+\mathbf{w}% (0)+\mathbf{r}(1)= bold_x ( 1 ) + bold_r ( 1 ) = bold_f ( bold_x ( 0 ) ) + bold_w ( 0 ) + bold_r ( 1 ) (11)
𝐱~(2)~𝐱2\displaystyle\tilde{\mathbf{x}}(2)over~ start_ARG bold_x end_ARG ( 2 ) =𝐱(2)+𝐫(2)=𝐟(𝐱(1))+𝐰(1)+𝐫(2)absent𝐱2𝐫2𝐟𝐱1𝐰1𝐫2\displaystyle=\mathbf{x}(2)+\mathbf{r}(2)=\mathbf{f}(\mathbf{x}(1))+\mathbf{w}% (1)+\mathbf{r}(2)= bold_x ( 2 ) + bold_r ( 2 ) = bold_f ( bold_x ( 1 ) ) + bold_w ( 1 ) + bold_r ( 2 ) (12)
\displaystyle\>\>\>\vdots
𝐱~(N)~𝐱𝑁\displaystyle\tilde{\mathbf{x}}(N)over~ start_ARG bold_x end_ARG ( italic_N ) =𝐱(N)+𝐫(N)absent𝐱𝑁𝐫𝑁\displaystyle=\mathbf{x}(N)+\mathbf{r}(N)= bold_x ( italic_N ) + bold_r ( italic_N )
=𝐟(𝐱(N1))+𝐰(N1)+𝐫(N)absent𝐟𝐱𝑁1𝐰𝑁1𝐫𝑁\displaystyle=\mathbf{f}(\mathbf{x}(N-1))+\mathbf{w}(N-1)+\mathbf{r}(N)= bold_f ( bold_x ( italic_N - 1 ) ) + bold_w ( italic_N - 1 ) + bold_r ( italic_N ) (13)

can be measured with 𝐫𝐫\mathbf{r}bold_r being iid Gaussian noise with variance Σr=diag(σr2,,σr2)subscriptΣ𝑟diagsuperscriptsubscript𝜎𝑟2superscriptsubscript𝜎𝑟2\Sigma_{r}=\mathrm{diag}(\sigma_{r}^{2},\dots,\sigma_{r}^{2})roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = roman_diag ( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Note that we consider some measurement noise 𝐫𝐫\mathbf{r}bold_r in addition to the standard process noise 𝐰𝐰\mathbf{w}bold_w (which is often considered in the context of GP based control and estimation, compare, e.g., [1, 4]). The measurement noise 𝐫𝐫\mathbf{r}bold_r and the process noise 𝐰𝐰\mathbf{w}bold_w are assumed to be independent. To approximate the function 𝐟𝐟\mathbf{f}bold_f, we have 𝐱~(0),,𝐱~(N1)~𝐱0~𝐱𝑁1\tilde{\mathbf{x}}(0),\dots,\tilde{\mathbf{x}}(N-1)over~ start_ARG bold_x end_ARG ( 0 ) , … , over~ start_ARG bold_x end_ARG ( italic_N - 1 ) as regression input data and 𝐱~(1),,𝐱~(N)~𝐱1~𝐱𝑁\tilde{\mathbf{x}}(1),\dots,\tilde{\mathbf{x}}(N)over~ start_ARG bold_x end_ARG ( 1 ) , … , over~ start_ARG bold_x end_ARG ( italic_N ) as regression output data available. We do not have access to the true regression inputs, i.e., 𝐱(0),,𝐱(N1)𝐱0𝐱𝑁1\mathbf{x}(0),\dots,\mathbf{x}(N-1)bold_x ( 0 ) , … , bold_x ( italic_N - 1 ).

The subject of this work is to propose a framework to account for the input noise in the case of dynamical systems, where only noisy trajectories are available as training data.

III SCALAR SYSTEMS

III-A Analysis of regression input noise

In this section, we consider f::𝑓f:\mathbb{R}\rightarrow\mathbb{R}italic_f : blackboard_R → blackboard_R and x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R. As training data, we assume that one trajectory of length N+1𝑁1N+1italic_N + 1 has been collected to set up the GP. We use the same approach as in (6) and introduce

im+(x|𝐗~in,𝐗~out)x|x=x~(i)\displaystyle\nabla_{i}\coloneqq\frac{\partial m_{+}(x|\tilde{\mathbf{X}}^{% \mathrm{in}},\tilde{\mathbf{X}}^{\mathrm{out}})}{\partial x}\Big{\rvert}_{x=% \tilde{x}(i)}∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x | over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x end_ARG | start_POSTSUBSCRIPT italic_x = over~ start_ARG italic_x end_ARG ( italic_i ) end_POSTSUBSCRIPT (14)

with 𝐗~in=(x~(0)x~(N1))superscript~𝐗inmatrix~𝑥0~𝑥𝑁1\tilde{\mathbf{X}}^{\mathrm{in}}=\begin{pmatrix}\tilde{x}(0)&\dots&\tilde{x}(N% -1)\end{pmatrix}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL over~ start_ARG italic_x end_ARG ( 0 ) end_CELL start_CELL … end_CELL start_CELL over~ start_ARG italic_x end_ARG ( italic_N - 1 ) end_CELL end_ROW end_ARG ) and 𝐗~out=(x~(1)x~(N))superscript~𝐗outmatrix~𝑥1~𝑥𝑁\tilde{\mathbf{X}}^{\mathrm{out}}=\begin{pmatrix}\tilde{x}(1)&\dots&\tilde{x}(% N)\end{pmatrix}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL over~ start_ARG italic_x end_ARG ( 1 ) end_CELL start_CELL … end_CELL start_CELL over~ start_ARG italic_x end_ARG ( italic_N ) end_CELL end_ROW end_ARG ) to denote the derivative of the posterior mean approximating the function f𝑓fitalic_f at the location x~(i)~𝑥𝑖\tilde{x}(i)over~ start_ARG italic_x end_ARG ( italic_i ). Together with (10) - (13), this results in

x~(i)f(x~(i1))i1r(i1)+w(i1)+r(i).~𝑥𝑖𝑓~𝑥𝑖1subscript𝑖1𝑟𝑖1𝑤𝑖1𝑟𝑖\displaystyle\tilde{x}(i)\approx f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-1)+r% (i).over~ start_ARG italic_x end_ARG ( italic_i ) ≈ italic_f ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_r ( italic_i - 1 ) + italic_w ( italic_i - 1 ) + italic_r ( italic_i ) .

The variance corresponds to

cov(x~(i),x~(i))cov~𝑥𝑖~𝑥𝑖\displaystyle\mathrm{cov}\big{(}\tilde{x}(i),\tilde{x}(i)\big{)}roman_cov ( over~ start_ARG italic_x end_ARG ( italic_i ) , over~ start_ARG italic_x end_ARG ( italic_i ) )
𝔼{(f(x~(i1))i1r(i1)+w(i1)+r(i)\displaystyle\approx\mathbb{E}\bigg{\{}\Big{(}f(\tilde{x}(i-1))-\nabla_{i-1}r(% i-1)+w(i-1)+r(i)≈ blackboard_E { ( italic_f ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_r ( italic_i - 1 ) + italic_w ( italic_i - 1 ) + italic_r ( italic_i )
𝔼{f(x~(i1))i1r(i1)+w(i1)+r(i)})2}\displaystyle-\mathbb{E}\big{\{}f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-1)+r(% i)\big{\}}\Big{)}^{2}\bigg{\}}- blackboard_E { italic_f ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_r ( italic_i - 1 ) + italic_w ( italic_i - 1 ) + italic_r ( italic_i ) } ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }
=k(x~(i1),x~(i1))+i1Σri1+σw2+σr2absent𝑘~𝑥𝑖1~𝑥𝑖1subscript𝑖1subscriptΣ𝑟subscript𝑖1superscriptsubscript𝜎𝑤2superscriptsubscript𝜎𝑟2\displaystyle=k(\tilde{x}(i-1),\tilde{x}(i-1))+\nabla_{i-1}\Sigma_{r}\nabla_{i% -1}+\sigma_{w}^{2}+\sigma_{r}^{2}= italic_k ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) , over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) + ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
k(x~(i1),x~(i1))+σout2(i1)absent𝑘~𝑥𝑖1~𝑥𝑖1subscriptsuperscript𝜎2out𝑖1\displaystyle\eqqcolon k(\tilde{x}(i-1),\tilde{x}(i-1))+\sigma^{2}_{\mathrm{% out}}(i-1)≕ italic_k ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) , over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_i - 1 ) (15)

for all i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N, since (i) 𝐰𝐰\mathbf{w}bold_w and 𝐫𝐫\mathbf{r}bold_r are independent and (ii) 𝐫𝐫\mathbf{r}bold_r is assumed to be iid.

We compute the covariance of two subsequent samples

cov(x~(i+1),x~(i))cov~𝑥𝑖1~𝑥𝑖absent\displaystyle\mathrm{cov}\big{(}\tilde{x}(i+1),\tilde{x}(i)\big{)}\approxroman_cov ( over~ start_ARG italic_x end_ARG ( italic_i + 1 ) , over~ start_ARG italic_x end_ARG ( italic_i ) ) ≈
𝔼{(f(x~(i))ir(i)+w(i)+r(i+1)\displaystyle\mathbb{E}\bigg{\{}\Big{(}f(\tilde{x}(i))-\nabla_{i}r(i)+w(i)+r(i% +1)blackboard_E { ( italic_f ( over~ start_ARG italic_x end_ARG ( italic_i ) ) - ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r ( italic_i ) + italic_w ( italic_i ) + italic_r ( italic_i + 1 )
𝔼{f(x~(i))ir(i)+w(i)+r(i+1)})\displaystyle-\mathbb{E}\big{\{}f(\tilde{x}(i))-\nabla_{i}r(i)+w(i)+r(i+1)\big% {\}}\Big{)}- blackboard_E { italic_f ( over~ start_ARG italic_x end_ARG ( italic_i ) ) - ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r ( italic_i ) + italic_w ( italic_i ) + italic_r ( italic_i + 1 ) } )
(f(x~(i1))i1r(i1)+w(i1)+r(i)\displaystyle\hskip 14.22636pt\Big{(}f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-% 1)+r(i)( italic_f ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_r ( italic_i - 1 ) + italic_w ( italic_i - 1 ) + italic_r ( italic_i )
𝔼{f(x~(i1))i1r(i1)+w(i1)+r(i)})},\displaystyle-\mathbb{E}\big{\{}f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-1)+r(% i)\big{\}}\Big{)}\bigg{\}},- blackboard_E { italic_f ( over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_r ( italic_i - 1 ) + italic_w ( italic_i - 1 ) + italic_r ( italic_i ) } ) } ,

resulting in

cov(x~(i+1),x~(i))cov~𝑥𝑖1~𝑥𝑖\displaystyle\mathrm{cov}\big{(}\tilde{x}(i+1),\tilde{x}(i)\big{)}roman_cov ( over~ start_ARG italic_x end_ARG ( italic_i + 1 ) , over~ start_ARG italic_x end_ARG ( italic_i ) ) k(x~(i),x~(i1))𝔼{ir(i)r(i)}absent𝑘~𝑥𝑖~𝑥𝑖1𝔼subscript𝑖𝑟𝑖𝑟𝑖\displaystyle\approx k(\tilde{x}(i),\tilde{x}(i-1))-\mathbb{E}\big{\{}\nabla_{% i}r(i)r(i)\big{\}}≈ italic_k ( over~ start_ARG italic_x end_ARG ( italic_i ) , over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - blackboard_E { ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r ( italic_i ) italic_r ( italic_i ) }
=k(x~(i),x~(i1))iσr2absent𝑘~𝑥𝑖~𝑥𝑖1subscript𝑖superscriptsubscript𝜎𝑟2\displaystyle=k(\tilde{x}(i),\tilde{x}(i-1))-\nabla_{i}\sigma_{r}^{2}= italic_k ( over~ start_ARG italic_x end_ARG ( italic_i ) , over~ start_ARG italic_x end_ARG ( italic_i - 1 ) ) - ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (16)

for all i=1,,N1𝑖1𝑁1i=1,\dots,N-1italic_i = 1 , … , italic_N - 1 and similarly for cov(x~(i),x~(i+1))cov~𝑥𝑖~𝑥𝑖1\mathrm{cov}\big{(}\tilde{x}(i),\tilde{x}(i+1)\big{)}roman_cov ( over~ start_ARG italic_x end_ARG ( italic_i ) , over~ start_ARG italic_x end_ARG ( italic_i + 1 ) ). The term iσr2subscript𝑖superscriptsubscript𝜎𝑟2-\nabla_{i}\sigma_{r}^{2}- ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in (16) appears only in the covariance of two consecutive data points (i.e., x(i)𝑥𝑖x(i)italic_x ( italic_i ) and x(i+1)𝑥𝑖1x(i+1)italic_x ( italic_i + 1 )) and is caused by the recursive nature of (9) and the propagation of the input noise to the output in (6). For this reason, this term does not appear in the developments in [12], where dynamical systems are not the central focus. In our case, the covariance matrix of the measured data corresponds to the expression given in (\starII) above, where the term iσr2subscript𝑖superscriptsubscript𝜎𝑟2-\nabla_{i}\sigma_{r}^{2}- ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT appears only in the entries immediately above and below the main diagonal.

If one does not consider consecutive samples in the training data, the additional term in (16) vanishes. In the context of dynamical systems, this could happen if (i) one only uses every second data point (which would be data-inefficient since half of the data are lost) or (ii) one performs one-step experiments, such that a regression output does not become a regression input. Intuitively, this means that one considers some initial condition, measures the next state and then considers a different initial condition, which is not meaningful/possible for many applications. Note that our above theoretical analysis focuses on collecting one single trajectory. If one considers multiple trajectories, the entries in the covariance matrix describing the transition from one trajectory to another do not contain the additional term iσr2subscript𝑖subscriptsuperscript𝜎2𝑟-\nabla_{i}\sigma^{2}_{r}- ∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

The last step is to set up the posterior mean and the posterior variance, which is once again analogous to (2) and (3) with K(X¯,X¯)+σε¯2𝐾¯𝑋¯𝑋superscriptsubscript𝜎¯𝜀2K(\bar{X},\bar{X})+\sigma_{\bar{\varepsilon}}^{2}italic_K ( over¯ start_ARG italic_X end_ARG , over¯ start_ARG italic_X end_ARG ) + italic_σ start_POSTSUBSCRIPT over¯ start_ARG italic_ε end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT replaced by K~~𝐾\tilde{K}over~ start_ARG italic_K end_ARG from (\starII).

III-B Application to logistic growth example

We evaluate the effect of the additional off-diagonal terms for a logistic growth example222The code of the simulations is available here: https://doi.org/10.25835/xwkni4f6. We use a zero prior mean and a squared exponential kernel. We consider the following (Euler-discretized) system

x(k+1)=x(k)+Tqx(k)(1x(k)C)+w(k)𝑥𝑘1𝑥𝑘𝑇𝑞𝑥𝑘1𝑥𝑘𝐶𝑤𝑘\displaystyle x(k+1)=x(k)+Tqx(k)\bigg{(}1-\frac{x(k)}{C}\bigg{)}+w(k)italic_x ( italic_k + 1 ) = italic_x ( italic_k ) + italic_T italic_q italic_x ( italic_k ) ( 1 - divide start_ARG italic_x ( italic_k ) end_ARG start_ARG italic_C end_ARG ) + italic_w ( italic_k ) (17)

with T=1,q=0.1,C=100formulae-sequence𝑇1formulae-sequence𝑞0.1𝐶100T=1,q=0.1,C=100italic_T = 1 , italic_q = 0.1 , italic_C = 100, which corresponds to a logistic growth example [16]. Note that the relatively small value of T𝑇Titalic_T and the rather large value of C𝐶Citalic_C imply that we only have to deal with a small nonlinearity and almost constant gradients. We collect three trajectories of length 100. We consider normally distributed process noise with mean μw=0subscript𝜇𝑤0\mu_{w}=0italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 0 and variance σw2=103superscriptsubscript𝜎𝑤2superscript103\sigma_{w}^{2}=10^{-3}italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (and in a second run normally distributed process noise with μw=0subscript𝜇𝑤0\mu_{w}=0italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 0 and variance σw2=101superscriptsubscript𝜎𝑤2superscript101\sigma_{w}^{2}=10^{-1}italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) as well as normally distributed measurement noise with mean μr=0subscript𝜇𝑟0\mu_{r}=0italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0 and various variances as illustrated in Figure 1. We use five iterations of slope/hyperparameter computations, compare [12]. To test the performance of the GPs, we consider N=500subscript𝑁500N_{\ast}=500italic_N start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 500 random samples from a uniform distribution 𝒰(0,100)𝒰0100\mathcal{U}(0,100)caligraphic_U ( 0 , 100 ) and compute the posterior mean. We compare our method to the one proposed by [12] and to a standard GP [13]. In all cases, we then compute the mean squared error (MSE) defined as

MSE1Nk=1N||f(x(i))m+(x(i)|𝐗~in,𝐗~out)||2.\displaystyle\text{MSE}\coloneqq\frac{1}{N_{\ast}}\sum_{k=1}^{N_{\ast}}||f(x_{% \ast}(i))-m_{+}(x_{\ast}(i)|\tilde{\mathbf{X}}^{\mathrm{in}},\tilde{\mathbf{X}% }^{\mathrm{out}})||^{2}.MSE ≔ divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | | italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_i ) ) - italic_m start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_i ) | over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (18)
Refer to caption
Figure 1: Simulation results of example (17) considering two different process noise variances as indicated in the titles of the plots. We implement the here proposed extension (referred to as “CCS” standing for “covariance of consecutive samples”), a standard GP (called “ST”) and the approach proposed by [12] (called “NI” standing for “noisy inputs”, which is the abbreviation given by the authors in [12] to describe their framework). We report the MSE as defined in (18), respectively.

The simulation results are displayed in Figure 1. We observe that our proposed extension substantially outperforms the other approaches for both process noise variances, in particular for large measurement noise variances. This is due to the explicit consideration of the covariance between two consecutive samples, which is not considered in the framework from [12] and a standard GP [13].

When a larger process noise variance is considered (compare Fig. 1, right plot), our proposed approach still outperforms the other two, although the difference becomes slightly smaller. This is due to the fact that in this case, the diagonal terms in the covariance matrix become more dominant and the advantage of our approach (that considers the noise variance in the entries immediately above and below the main diagonal) becomes less prominent.

Moreover, we observe that the performance of the standard GP and the method proposed by [12] is similar (with a slight advantage for the standard GP). This is due to the considered system which is almost linear. In this case, the gradients of the posterior mean are almost constant and a standard GP can achieve a similar effect (than the extension of [12]) by simply increasing the noise variance. The slight advantage for the standard GP may be due to a minor overfitting in case of the approach proposed by [12], where we have one more hyperparameter to determine.

IV MULTIDIMENSIONAL SYSTEMS

IV-A Analysis of regression input noise

In this section, we now focus on multidimensional systems. This means that we consider some function 𝐟:nn:𝐟superscript𝑛superscript𝑛\mathbf{f}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}bold_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The most common approach to approximate these systems is to consider the individual components of the function 𝐟𝐟\mathbf{f}bold_f to be independent [1, 2, 4]. In this case, scalar GPs are used to approximate each component of the function 𝐟𝐟\mathbf{f}bold_f. Alternatively, one can use a linear model of coregionalization [17], where all components are learned jointly (and hence also correlations among the components can be learned).

Refer to caption
Figure 2: Simulation results for a batch reactor, a two-link planar robot, and a cart pole system. The figures show the performances of the here proposed framework (referred to as “CCS” standing for “covariance of consecutive samples”), the extension by [12] (called “NI” standing for “noisy inputs”, which is the abbreviation given by the authors in [12] to describe their framework), and a standard GP (called “ST”) for randomly sampled test data. We report the MSE as defined in (18).

However, the above works rely on the assumption that the regression input data are noise-free. As mentioned in the previous section, this is rarely the case in the context of dynamical systems, since the measurements of the states are corrupted by some noise. In the following, we again consider the input noise by applying the approach proposed in [12] (compare (6)) to dynamical systems. As shown in the following derivation, analogous to Section III, we obtain additional terms in the covariance between two consecutive observations. Moreover, in addition to the scalar case, we also obtain covariance terms between the regression outputs corresponding to the different components of f𝑓fitalic_f. In particular, since

x~j(i)subscript~𝑥𝑗𝑖absent\displaystyle\tilde{x}_{j}(i)\approxover~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) ≈ fj(𝐱~(i1))m+,j(𝐱|X~in,X~out)𝐱|𝐱=𝐱~(i1)×\displaystyle f_{j}(\tilde{\mathbf{x}}(i-1))-\frac{\partial m_{+,j}(\mathbf{x}% |\tilde{X}^{\mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{|}% _{\mathbf{x}=\tilde{\mathbf{x}}(i-1)}\timesitalic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) ) - divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , italic_j end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i - 1 ) end_POSTSUBSCRIPT ×
r(i1)+wj(i1)+rj(i)r𝑖1subscript𝑤𝑗𝑖1subscript𝑟𝑗𝑖\displaystyle\textbf{r}(i-1)+w_{j}(i-1)+r_{j}(i)r ( italic_i - 1 ) + italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i - 1 ) + italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) (19)

for all j=1,,n𝑗1𝑛j=1,\dots,nitalic_j = 1 , … , italic_n, we obtain

cov(x~j(i),x~(i))covsubscript~𝑥𝑗𝑖subscript~𝑥𝑖\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),\tilde{x}_{\ell}(i)\big{)}roman_cov ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_i ) )
𝔼{(fj(𝐱~(i1))𝔼{fj(𝐱~(i1))})×\displaystyle\approx\mathbb{E}\Bigg{\{}\bigg{(}f_{j}(\tilde{\mathbf{x}}(i-1))-% \mathbb{E}\Big{\{}f_{j}(\tilde{\mathbf{x}}(i-1))\Big{\}}\bigg{)}\times≈ blackboard_E { ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) ) - blackboard_E { italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) ) } ) ×
(f(𝐱~(i1))𝔼{f(𝐱~(i1))})}\displaystyle\qquad\bigg{(}f_{\ell}(\tilde{\mathbf{x}}(i-1))-\mathbb{E}\Big{\{% }f_{\ell}(\tilde{\mathbf{x}}(i-1))\Big{\}}\bigg{)}^{\top}\Bigg{\}}( italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) ) - blackboard_E { italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) ) } ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT }
+𝔼{m+,j(𝐱|X~in,X~out)𝐱|𝐱=𝐱~(i1)r(i1)×\displaystyle+\mathbb{E}\Bigg{\{}\frac{\partial m_{+,j}(\mathbf{x}|\tilde{X}^{% \mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{|}_{\mathbf{x}% =\tilde{\mathbf{x}}(i-1)}\textbf{r}(i-1)\times+ blackboard_E { divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , italic_j end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i - 1 ) end_POSTSUBSCRIPT r ( italic_i - 1 ) ×
𝐫(i1)m+,(𝐱|X~in,X~out)𝐱|𝐱=𝐱~(i1)}\displaystyle\qquad\mathbf{r}(i-1)^{\top}\frac{\partial m_{+,{\ell}}(\mathbf{x% }|\tilde{X}^{\mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{|% }_{\mathbf{x}=\tilde{\mathbf{x}}(i-1)}^{\top}\Bigg{\}}bold_r ( italic_i - 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , roman_ℓ end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT }
+(σr2+σw2)δjsubscriptsuperscript𝜎2𝑟subscriptsuperscript𝜎2𝑤subscript𝛿𝑗\displaystyle+(\sigma^{2}_{r}+\sigma^{2}_{w})\delta_{j\ell}+ ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_j roman_ℓ end_POSTSUBSCRIPT (20)

with δ,jsubscript𝛿𝑗\delta_{\ell,j}italic_δ start_POSTSUBSCRIPT roman_ℓ , italic_j end_POSTSUBSCRIPT denoting the Kronecker delta. To simplify the analysis, we assume that the different GPs modeling the different components are mutually independent (as commonly done in the context of GP based control/estimation [1, 2, 4]). Consequently, it holds that

𝔼{fj(𝐱~)f(𝐱~)}=𝔼{fj(𝐱~)}𝔼{f(𝐱~)}𝔼subscript𝑓𝑗~𝐱subscript𝑓~𝐱𝔼subscript𝑓𝑗~𝐱𝔼subscript𝑓~𝐱\displaystyle\mathbb{E}\{f_{j}(\tilde{\mathbf{x}})f_{\ell}(\tilde{\mathbf{x}})% \}=\mathbb{E}\{f_{j}(\tilde{\mathbf{x}})\}\mathbb{E}\{f_{\ell}(\tilde{\mathbf{% x}})\}blackboard_E { italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ) italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ) } = blackboard_E { italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ) } blackboard_E { italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ) } (21)

and therefore

cov(x~j(i),x~(i))(k(𝐱~(i1),𝐱~(i1))+σr2+σw2)δjcovsubscript~𝑥𝑗𝑖subscript~𝑥𝑖𝑘~𝐱𝑖1~𝐱𝑖1subscriptsuperscript𝜎2𝑟subscriptsuperscript𝜎2𝑤subscript𝛿𝑗\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),\tilde{x}_{\ell}(i)\big{)}% \approx\big{(}k(\tilde{\mathbf{x}}(i-1),\tilde{\mathbf{x}}(i-1))+\sigma^{2}_{r% }+\sigma^{2}_{w}\big{)}\delta_{j\ell}roman_cov ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_i ) ) ≈ ( italic_k ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) , over~ start_ARG bold_x end_ARG ( italic_i - 1 ) ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_j roman_ℓ end_POSTSUBSCRIPT
+𝔼{m+,j(𝐱|X~in,X~out)𝐱|𝐱=𝐱~(i1)Σr×\displaystyle+\mathbb{E}\Bigg{\{}\frac{\partial m_{+,j}(\mathbf{x}|\tilde{X}^{% \mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{|}_{\mathbf{x}% =\tilde{\mathbf{x}}(i-1)}\Sigma_{r}\times+ blackboard_E { divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , italic_j end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i - 1 ) end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ×
m+,(𝐱|X~in,X~out)𝐱|𝐱=𝐱~(i1)}\displaystyle\hskip 56.9055pt\frac{\partial m_{+,{\ell}}(\mathbf{x}|\tilde{X}^% {\mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{|}_{\mathbf{x% }=\tilde{\mathbf{x}}(i-1)}^{\top}\Bigg{\}}divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , roman_ℓ end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i - 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT } (22)

for all i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N and j,=1,,nformulae-sequence𝑗1𝑛j,\ell=1,\dots,nitalic_j , roman_ℓ = 1 , … , italic_n. Hence, although assuming independence among the different GPs (modeling the different components), the observations covary due to the input noise. Moreover, as in the previous section (compare (16)), we need to consider the covariance within the same component, but for subsequent time instants

Kmd=subscript𝐾mdabsent\displaystyle K_{\mathrm{md}}=italic_K start_POSTSUBSCRIPT roman_md end_POSTSUBSCRIPT = (Kx+σr2IN+σw2IN)In+(0Σr01σr20n×n0n×n1σr21Σr12σr20n×n0n×n2σr22Σr20n×n0n×n0n×n0n×nN1ΣrN1)tensor-productsubscript𝐾𝑥superscriptsubscript𝜎𝑟2subscript𝐼𝑁superscriptsubscript𝜎𝑤2subscript𝐼𝑁subscript𝐼𝑛matrixsubscript0subscriptΣ𝑟superscriptsubscript0topsuperscriptsubscript1topsuperscriptsubscript𝜎𝑟2subscript0𝑛𝑛subscript0𝑛𝑛subscript1superscriptsubscript𝜎𝑟2subscript1subscriptΣ𝑟superscriptsubscript1topsuperscriptsubscript2topsuperscriptsubscript𝜎𝑟2subscript0𝑛𝑛subscript0𝑛𝑛subscript2superscriptsubscript𝜎𝑟2subscript2subscriptΣ𝑟superscriptsubscript2topsubscript0𝑛𝑛subscript0𝑛𝑛subscript0𝑛𝑛subscript0𝑛𝑛subscript𝑁1subscriptΣ𝑟superscriptsubscript𝑁1top\displaystyle(K_{x}+\sigma_{r}^{2}I_{N}+\sigma_{w}^{2}I_{N})\otimes I_{n}+% \begin{pmatrix}\nabla_{0}\Sigma_{r}\nabla_{0}^{\top}&-\nabla_{1}^{\top}\sigma_% {r}^{2}&0_{n\times n}&\dots&0_{n\times n}\\ -\nabla_{1}\sigma_{r}^{2}&\nabla_{1}\Sigma_{r}\nabla_{1}^{\top}&-\nabla_{2}^{% \top}\sigma_{r}^{2}&\dots&0_{n\times n}\\ 0_{n\times n}&-\nabla_{2}\sigma_{r}^{2}&\nabla_{2}\Sigma_{r}\nabla_{2}^{\top}&% \dots&0_{n\times n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0_{n\times n}&0_{n\times n}&0_{n\times n}&\dots&\nabla_{N-1}\Sigma_{r}\nabla_{% N-1}^{\top}\\ \end{pmatrix}( italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( start_ARG start_ROW start_CELL ∇ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL - ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL - ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL start_CELL - ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL ∇ start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (\star\star⋆ ⋆)

 

cov(x~j(i),\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),roman_cov ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) , x~j(i+1))k(𝐱~(i1),𝐱~(i))\displaystyle\tilde{x}_{j}(i+1)\big{)}\approx k(\tilde{\mathbf{x}}(i-1),\tilde% {\mathbf{x}}(i))over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i + 1 ) ) ≈ italic_k ( over~ start_ARG bold_x end_ARG ( italic_i - 1 ) , over~ start_ARG bold_x end_ARG ( italic_i ) )
m+,j(𝐱|X~in,X~out)xj|𝐱=𝐱~(i)σr2evaluated-atsubscript𝑚𝑗conditional𝐱superscript~𝑋insuperscript~𝑋outsubscript𝑥𝑗𝐱~𝐱𝑖superscriptsubscript𝜎𝑟2\displaystyle-\frac{\partial m_{+,j}(\mathbf{x}|\tilde{X}^{\mathrm{in}},\tilde% {X}^{\mathrm{out}})}{\partial x_{j}}\Big{|}_{\mathbf{x}=\tilde{\mathbf{x}}(i)}% \sigma_{r}^{2}- divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , italic_j end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i ) end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (23)

for all j=1,,n𝑗1𝑛j=1,\dots,nitalic_j = 1 , … , italic_n and i=1,,N1𝑖1𝑁1i=1,\dots,N-1italic_i = 1 , … , italic_N - 1 and similarly for cov(x~j(i+1),x~j(i))covsubscript~𝑥𝑗𝑖1subscript~𝑥𝑗𝑖\mathrm{cov}\big{(}\tilde{x}_{j}(i+1),\tilde{x}_{j}(i)\big{)}roman_cov ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i + 1 ) , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) ). Finally, we need to consider the covariance between observations of two different (not necessarily adjacent) components and subsequent time instants as, e.g.,

cov(x~j(i),x~(i+1))m+,(𝐱|X~in,X~out)xj|𝐱=𝐱~(i)σr2.covsubscript~𝑥𝑗𝑖subscript~𝑥𝑖1evaluated-atsubscript𝑚conditional𝐱superscript~𝑋insuperscript~𝑋outsubscript𝑥𝑗𝐱~𝐱𝑖subscriptsuperscript𝜎2𝑟\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),\tilde{x}_{\ell}(i+1)\big{)}% \approx-\frac{\partial m_{+,{\ell}}(\mathbf{x}|\tilde{X}^{\mathrm{in}},\tilde{% X}^{\mathrm{out}})}{\partial x_{j}}\Big{|}_{\mathbf{x}=\tilde{\mathbf{x}}(i)}% \sigma^{2}_{r}.roman_cov ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i ) , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_i + 1 ) ) ≈ - divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , roman_ℓ end_POSTSUBSCRIPT ( bold_x | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i ) end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT . (24)

for all j,=1,,nformulae-sequence𝑗1𝑛j,\ell=1,\dots,nitalic_j , roman_ℓ = 1 , … , italic_n (but j𝑗j\neq\ellitalic_j ≠ roman_ℓ) and i=1,,N1𝑖1𝑁1i=1,\dots,N-1italic_i = 1 , … , italic_N - 1 and similarly for cov(x~j(i+1),x~(i))covsubscript~𝑥𝑗𝑖1subscript~𝑥𝑖\mathrm{cov}(\tilde{x}_{j}(i+1),\tilde{x}_{\ell}(i))roman_cov ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i + 1 ) , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_i ) ).

To set up a GP for this case, we cannot proceed in the standard way by simply learning individual GPs. This is due to the correlations among the different components, compare (22), which cannot be considered by learning the components individually. Instead, we here learn all the components of the GP jointly. To this end, we set up the vector of observations

Xout=superscript𝑋outabsent\displaystyle X^{\mathrm{out}}=italic_X start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT =
(x~1(1)x~n(1)x~1(2)x~n(N)).superscriptmatrixsubscript~𝑥11subscript~𝑥𝑛1subscript~𝑥12subscript~𝑥𝑛𝑁top\displaystyle\begin{pmatrix}\tilde{x}_{1}(1)&\dots&\tilde{x}_{n}(1)&\tilde{x}_% {1}(2)&\dots&\tilde{x}_{n}(N)\end{pmatrix}^{\top}.( start_ARG start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 ) end_CELL start_CELL … end_CELL start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 ) end_CELL start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 2 ) end_CELL start_CELL … end_CELL start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_N ) end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (25)

The covariance matrix corresponds to the expression given in (\star\star⋆ ⋆IV-A) below with Kx=(k(𝐱~(i),𝐱~(j)))𝐱~(i),𝐱~(j)X~insubscript𝐾𝑥subscript𝑘~𝐱𝑖~𝐱𝑗~𝐱𝑖~𝐱𝑗superscript~𝑋inK_{x}=(k(\tilde{\mathbf{x}}(i),\tilde{\mathbf{x}}(j)))_{\tilde{\mathbf{x}}(i),% \tilde{\mathbf{x}}(j)\in\tilde{X}^{\mathrm{in}}}italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ( italic_k ( over~ start_ARG bold_x end_ARG ( italic_i ) , over~ start_ARG bold_x end_ARG ( italic_j ) ) ) start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG ( italic_i ) , over~ start_ARG bold_x end_ARG ( italic_j ) ∈ over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and isubscript𝑖\mathbf{\nabla}_{i}∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defined as

i(m+,1x1|𝐱=𝐱~(i)m+,1x2|𝐱=𝐱~(i)m+,2x1|𝐱=𝐱~(i)m+,2x2|𝐱=𝐱~(i)).subscript𝑖matrixevaluated-atsubscript𝑚1subscript𝑥1𝐱~𝐱𝑖evaluated-atsubscript𝑚1subscript𝑥2𝐱~𝐱𝑖evaluated-atsubscript𝑚2subscript𝑥1𝐱~𝐱𝑖evaluated-atsubscript𝑚2subscript𝑥2𝐱~𝐱𝑖\displaystyle\mathbf{\nabla}_{i}\coloneqq\begin{pmatrix}\frac{\partial m_{+,1}% }{\partial x_{1}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&\frac{\partial m_{+,1}}{% \partial x_{2}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&\dots\\ \frac{\partial m_{+,2}}{\partial x_{1}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&% \frac{\partial m_{+,2}}{\partial x_{2}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&% \dots\\ \vdots&\vdots&\ddots\\ \end{pmatrix}.∇ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ( start_ARG start_ROW start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , 1 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i ) end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , 1 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i ) end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , 2 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i ) end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG ∂ italic_m start_POSTSUBSCRIPT + , 2 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT bold_x = over~ start_ARG bold_x end_ARG ( italic_i ) end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL end_ROW end_ARG ) .

The predictive mean and variance are given by

𝐦+(𝐱|X~in,X~out)=(𝐤(𝐱,Xin)In)Kmd1Xoutsubscript𝐦conditionalsubscript𝐱superscript~𝑋insuperscript~𝑋outtensor-product𝐤subscript𝐱superscript𝑋insubscript𝐼𝑛superscriptsubscript𝐾md1superscript𝑋out\displaystyle\mathbf{m}_{+}(\mathbf{x}_{\ast}|\tilde{X}^{\mathrm{in}},\tilde{X% }^{\mathrm{out}})=(\mathbf{k}(\mathbf{x}_{\ast},X^{\mathrm{in}})\otimes I_{n})% K_{\mathrm{md}}^{-1}X^{\mathrm{out}}bold_m start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) = ( bold_k ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_K start_POSTSUBSCRIPT roman_md end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT (26)
Σ+(𝐱|X~in,X~out)=k(𝐱,𝐱)InsubscriptΣconditionalsubscript𝐱superscript~𝑋insuperscript~𝑋outtensor-product𝑘subscript𝐱subscript𝐱subscript𝐼𝑛\displaystyle\Sigma_{+}(\mathbf{x}_{\ast}|\tilde{X}^{\mathrm{in}},\tilde{X}^{% \mathrm{out}})=k(\mathbf{x}_{\ast},\mathbf{x}_{\ast})\otimes I_{n}roman_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT , over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT roman_out end_POSTSUPERSCRIPT ) = italic_k ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
(𝐤(𝐱,Xin)In)Kmd1(𝐤(𝐱,Xin)In).tensor-product𝐤subscript𝐱superscript𝑋insubscript𝐼𝑛superscriptsubscript𝐾md1superscripttensor-product𝐤subscript𝐱superscript𝑋insubscript𝐼𝑛top\displaystyle\quad-(\mathbf{k}(\mathbf{x}_{\ast},X^{\mathrm{in}})\otimes I_{n}% )K_{\mathrm{md}}^{-1}(\mathbf{k}(\mathbf{x}_{\ast},X^{\mathrm{in}})\otimes I_{% n})^{\top}.- ( bold_k ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_K start_POSTSUBSCRIPT roman_md end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_k ( bold_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT roman_in end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (27)

The above derivation focuses once again on one single trajectory as offline data. If multiple trajectories have been collected, no covariance is needed at the transition between the different trajectories, as in the previous section.

Remark 1

In this paper, we assume independence among the different GPs modeling the different components of the unknown function 𝐟𝐟\mathbf{f}bold_f. One interesting subject for future work is to omit this assumption. In this case, one could combine the here proposed approach with an intrinsic coregionalization method or a linear model of coregionalization [17].

Remark 2

Within this paper, we only focused on the state transition function 𝐟𝐟\mathbf{f}bold_f. The approximation of an output map 𝐡𝐡\mathbf{h}bold_h is more straightforward since the noise affecting the regression inputs does not get propagated through the GP. In this case, we can apply the common approach to learn each component of the function 𝐡𝐡\mathbf{h}bold_h individually, since there are no covariances among the components and use the approach suggested in [12].

IV-B Application to batch reactor, two-link planar robot, and cart-pole system

We evaluate our approach in several numerical examples. For all numerical examples, we use a zero prior mean and a squared exponential kernel. Once again, for space reasons, we only explain the simulation and evaluation setting in detail for the first example. We consider the following dynamics

x1(t+1)subscript𝑥1𝑡1\displaystyle x_{1}(t+1)italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + 1 ) =x1(t)+T(2c1x12(t)+2c2x2(t))+w1(t)absentsubscript𝑥1𝑡𝑇2subscript𝑐1superscriptsubscript𝑥12𝑡2subscript𝑐2subscript𝑥2𝑡subscript𝑤1𝑡\displaystyle=x_{1}(t)+T(-2c_{1}x_{1}^{2}(t)+2c_{2}x_{2}(t))+w_{1}(t)= italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) + italic_T ( - 2 italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) + 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ) + italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) (28a)
x2(t+1)subscript𝑥2𝑡1\displaystyle x_{2}(t+1)italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + 1 ) =x2(t)+T(c1x12(t)c2x2(t))+w2(t),absentsubscript𝑥2𝑡𝑇subscript𝑐1superscriptsubscript𝑥12𝑡subscript𝑐2subscript𝑥2𝑡subscript𝑤2𝑡\displaystyle=x_{2}(t)+T(c_{1}x_{1}^{2}(t)-c_{2}x_{2}(t))+w_{2}(t),= italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) + italic_T ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ) + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) , (28b)

which corresponds to a discretized batch reactor [18]. We consider T=0.1𝑇0.1T=0.1italic_T = 0.1, c1=0.16subscript𝑐10.16c_{1}=0.16italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.16, c2=0.0064subscript𝑐20.0064c_{2}=0.0064italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.0064, normally distributed process noise with mean μw=0subscript𝜇𝑤0\mu_{w}=0italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 0 and variance σw2=106Insubscriptsuperscript𝜎2𝑤superscript106subscript𝐼𝑛\sigma^{2}_{w}=10^{-6}I_{n}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and normally distributed measurement noise with mean μr=0subscript𝜇𝑟0\mu_{r}=0italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0 and different variances as shown in Figure 2. We collect three trajectories containing 50 samples.

Next, we test our approach for two four-dimensional systems with highly complex nonlinear dynamics. We consider a two-link planar robot with the dynamics and numerical parameter values as given in [19] and a cart-pole system with the numerical parameters values from [20]. The considered measurement noise variances are illustrated in Figure 2 (middle and right plot). In both cases, we collect three trajectories containing 50 samples.

In all examples, we use five iterations of slope/hyperparameter computations, see [12]. Furthermore, we implement a standard GP as introduced at the beginning of this section and the method proposed by [12] (by assuming that the different components are independent). We evaluate the performance for N=500subscript𝑁500N_{\star}=500italic_N start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT = 500 random test data points sampled from a uniform distribution over some operating region of interest. More details can be found in the code of the simulations, which is provided under the link in footnote 2.

From Figure 2, one can see that the method proposed in this paper again outperforms the alternatives in terms of the MSE in all tested setting. Overall, the difference is more pronounced for larger noise levels. Furthermore, we can observe that the extension by [12] performs slightly better compared to the scalar case presented in the previous section. A reason for this observation may be that the extension proposed by [12] allows to learn the regression input noise variance using all outputs, which is not possible for a standard GP.

V CONCLUSION

In this work, we analyzed the impact of regression input noise in case of dynamical systems modeled by Gaussian processes and introduced approaches to account for this noise in case of scalar and multidimensional systems. In several numerical examples, we showed that the consideration of the proposed extension substantially improves the performance compared to the state-of-the-art approaches.

Several topics are left for future research. One could refine the framework by using second order approximations (as also suggested by [12]), which is likely to improve the performance further, although inducing a larger computational complexity. We expect that the method proposed in this paper will be beneficial for designing GP-based controllers and state estimators for nonlinear dynamical systems with improved performance.

References

  • [1] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using Gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2019.
  • [2] T. Beckers, D. Kulić, and S. Hirche, “Stable Gaussian process based tracking control of Euler–Lagrange systems,” Automatica, vol. 103, pp. 390–397, 2019.
  • [3] M. Maiworm, D. Limon, and R. Findeisen, “Online learning-based model predictive control with Gaussian process models and stability guarantees,” International Journal of Robust and Nonlinear Control, vol. 31, no. 18, pp. 8785–8812, 2021.
  • [4] J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, 2009.
  • [5] M. Buisson-Fenet, V. Morgenthaler, S. Trimpe, and F. Di Meglio, “Joint state and dynamics estimation with high-gain observers and Gaussian process models,” in 2021 American Control Conference (ACC).   IEEE, 2021, pp. 4027–4032.
  • [6] T. M. Wolff, V. G. Lopez, and M. A. Müller, “Gaussian process-based nonlinear moving horizon estimation,” arXiv preprint arXiv:2402.04665, 2024.
  • [7] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 393–400.
  • [8] P. Goldberg, C. Williams, and C. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” Advances in neural information processing systems, vol. 10, 1997.
  • [9] A. McHutchon and C. Rasmussen, “Gaussian process training with input noise,” Advances in neural information processing systems, vol. 24, 2011.
  • [10] M. Titsias and N. D. Lawrence, “Bayesian Gaussian process latent variable model,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2010, pp. 844–851.
  • [11] A. Doerr, C. Daniel, M. Schiegg, N.-T. Duy, S. Schaal, M. Toussaint, and S. Trimpe, “Probabilistic recurrent state-space models,” in International conference on machine learning.   PMLR, 2018, pp. 1280–1289.
  • [12] A. J. McHutchon, “Nonlinear modelling and control using Gaussian processes,” Ph.D. dissertation, Citeseer, 2015.
  • [13] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning.   Springer, 2006, vol. 1.
  • [14] M. P. Deisenroth, Efficient reinforcement learning using Gaussian processes.   KIT Scientific Publishing, 2010, vol. 9.
  • [15] E. Solak, R. Murray-Smith, W. Leithead, D. Leith, and C. Rasmussen, “Derivative observations in Gaussian process models of dynamic systems,” in Advances in Neural Information Processing Systems, vol. 15.   MIT Press, 2002.
  • [16] A. Tsoularis and J. Wallace, “Analysis of logistic growth models,” Mathematical Biosciences, vol. 179, no. 1, pp. 21–55, 2002.
  • [17] M. A. Alvarez, L. Rosasco, N. D. Lawrence et al., “Kernels for vector-valued functions: A review,” Foundations and Trends® in Machine Learning, vol. 4, no. 3, pp. 195–266, 2012.
  • [18] J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model predictive control: theory, computation, and design.   Nob Hill Publishing Madison, WI, 2017, vol. 2.
  • [19] M. Buisson-Fenet, F. Solowjow, and S. Trimpe, “Actively learning Gaussian process dynamics,” in Learning for dynamics and control.   PMLR, 2020, pp. 5–15.
  • [20] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.