Gaussian Processes with Noisy Regression Inputs for Dynamical Systems

Tobias M. Wolff, Victor G. Lopez, and Matthias A. Müller This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 948679).Tobias M. Wolff, Victor G. Lopez, and Matthias A. Müller are with the Leibniz University Hannover, Institute of Automatic Control, 30167 Hannover, Germany {wolff,lopez,mueller}@irt.uni-hannover.de

Abstract

This paper is centered around the approximation of dynamical systems by means of Gaussian processes. To this end, trajectories of such systems must be collected to be used as training data. The measurements of these trajectories are typically noisy, which implies that both the regression inputs and outputs are corrupted by noise. However, most of the literature considers only noise in the regression outputs. In this paper, we show how to account for the noise in the regression inputs in an extended Gaussian process framework to approximate scalar and multidimensional systems. We demonstrate the potential of our framework by comparing it to different state-of-the-art methods in several simulation examples.

I INTRODUCTION

The application of Gaussian process (GP) regression in the context of dynamical systems has received a substantial interest in recent years. It has been applied for a variety of applications such as, e.g., control [1, 2, 3] and state estimation [4, 5, 6]. The most common setup for GP regression considers two major assumptions on the measured data. First, it is assumed that the available regression input data are noise-free. Second, the measured regression output data are assumed to be corrupted by independent and identically distributed (iid) Gaussian noise.

One frequently applied approach to approximate dynamical systems by GPs is to model each component of the transition function $f$ by the posterior means of independently learned GPs [1, 2, 4, 6]. To approximate these functions, it is assumed that the states (together with the control inputs) can be measured. Subsequently, the control input and state trajectory are used as regression input data, and the (by one time instant shifted) state trajectory is used as regression output data.

In most practical applications, the state measurements are corrupted by noise. On the one hand, this implies that the regression outputs are corrupted by noise, which is in accordance to the standard GP setting. On the other hand, this entails that the regression inputs are also corrupted by noise, which is not covered by the standard GP setting.

To cope with regression input noise in GP regression, one can use heteroscedastic GPs [7, 8] (where a second GP is used to model the noise variance and rather large amounts of data are needed [9]) or variational methods [10, 11]. An alternative, which is simple, but very effective has been proposed by [12, 9]. The key idea is to propagate the input noise to the output by using first order Taylor approximations of the posterior means (see Section II below for the details). In [12], the authors show that this approach can outperform variational methods, heteroscedastic GPs and standard GPs.

Our work can be considered as an extension of the framework suggested in [12] to dynamical systems. Here, one major difference is that one cannot arbitrarily sample training data points to set up a GP. Instead, one typically can only collect trajectories. We show that these trajectories induce correlations that must be taken into account when setting up a GP to correctly represent dynamical systems. Alongside these theoretical derivations, we illustrate the performance of our proposed extension by means of several simulation examples and compare it to the cases where a dynamical system is directly approximated using the method proposed in [12] and a standard GP [13].

This paper is organized as follows. In Section II, we explain some preliminaries and introduce the problem setting. In Sections III and IV, we introduce our framework for scalar and multidimensional systems, respectively. We close this paper with a conclusion in Section V.

II PRELIMINARIES AND PROBLEM SETTING

The set of real numbers is denoted by $\mathbb{R}$ . The identity matrix of dimension $N$ is denoted by $I_{N}$ . A diagonal matrix with $q_{1},\dots,q_{n}$ on its diagonal entries is denoted by $\mathrm{diag}(q_{1},\dots,q_{n})$ . We denote the Kronecker product by $\otimes$ . We denote scalars by small letters, vectors by small bold letters and matrices by capital letters. A vector of zeros of length $n$ is denoted by $\mathbf{0}_{n}$ . A square matrix of zeros of dimension $n$ is denoted by $0_{n\times n}$ .

We briefly review the fundamentals of standard Gaussian processes; a more detailed introduction to GPs can be found in [13]. GPs are commonly applied to approximate some nonlinear function $\bar{f}:\mathbb{R}^{\bar{n}}\rightarrow\mathbb{R}$ . They are fully described by a mean function $m:\mathbb{R}^{\bar{n}}\rightarrow\mathbb{R}$ and a covariance function (also referred to as kernel) $k:\mathbb{R}^{\bar{n}}\times\mathbb{R}^{\bar{n}}\rightarrow\mathbb{R}$ . For some $\bar{\mathbf{x}},\bar{\mathbf{x}}^{\prime}\in\mathbb{R}^{\bar{n}}$ , we write

\displaystyle\bar{f}(\bar{\mathbf{x}})\sim\mathcal{GP}(m(\bar{\mathbf{x}}),k(% \bar{\mathbf{x}},\bar{\mathbf{x}}^{\prime}))

(1)

to denote that the function $\bar{f}$ follows a GP with mean function $m$ and covariance function $k$ . We collect $N$ regression input and output data points from the unknown function and use them to define $\bar{X}=\begin{pmatrix}\bar{\mathbf{x}}(0)&\dots&\bar{\mathbf{x}}(N-1)\end{pmatrix}$ and $\bar{\textbf{Y}}=\begin{pmatrix}\bar{y}(0)&\dots&\bar{y}(N-1)\end{pmatrix}^{\top}$ , respectively. The regression outputs are given by $\bar{y}=\bar{f}(\bar{\mathbf{x}})+\bar{\varepsilon}$ with $\bar{\varepsilon}$ being iid Gaussian noise with zero mean and variance $\sigma_{\bar{\varepsilon}}^{2}$ . The key idea of Gaussian processes is to condition the prior distribution on the training data, which results in a posterior distribution. For some test input $\bar{\mathbf{x}}_{\ast}$ , the mean and variance of the posterior distribution are given by [13, Ch. 2]

	$\displaystyle\bar{m}_{+}(\bar{\mathbf{x}}_{\ast}\|\bar{X},\bar{\textbf{Y}})=% \mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})(K(\bar{X},\bar{X})+\sigma_{\bar{% \varepsilon}}^{2}I_{N})^{-1}\bar{\mathbf{Y}}$		(2)
	$\displaystyle\bar{\sigma}_{+}^{2}(\bar{\mathbf{x}}_{\ast}\|\bar{X},\bar{\textbf% {Y}})=$
	$\displaystyle\mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{\mathbf{x}}_{\ast})-% \mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})(K(\bar{X},\bar{X})+\sigma_{\bar{% \varepsilon}}^{2}I_{N})^{-1}\mathbf{k}(\bar{X},\bar{\mathbf{x}}_{\ast}),$		(3)

for $\mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})=\begin{pmatrix}k(\bar{\mathbf{x}}_% {\ast},\bar{\mathbf{x}}_{i})\end{pmatrix}_{\bar{\mathbf{x}}_{i}\in\bar{X}}=% \mathbf{k}(\bar{X},\bar{\mathbf{x}}_{\ast})^{\top}$ , with $\mathbf{k}(\bar{\mathbf{x}}_{\ast},\bar{X})\in\mathbb{R}^{1\times N}$ , and $K(\bar{X},\bar{X})=(k(\bar{\mathbf{x}}_{i},\bar{\mathbf{x}}_{j}))_{\bar{% \mathbf{x}}_{i},\bar{\mathbf{x}}_{j}\in\bar{X}}$ with $K(\bar{X},\bar{X})\in\mathbb{R}^{N\times N}$ . The kernel depends on hyperparameters (such as, e.g., the signal variance and the length scales in case of the squared exponential kernel) that are commonly determined by maximizing the log marginal likelihood, see, e.g., [13, Eq. (2.30)].

These standard results in Gaussian processes rely on the assumption that the regression input data are noise-free. In turn, if the regression input data points are affected by some noise such that only

\check{\bar{\mathbf{x}}}\coloneqq\bar{\mathbf{x}}+\bar{\mathbf{r}}

(4)

is available with $\bar{\mathbf{r}}$ being some iid Gaussian noise with variance $\Sigma_{\bar{r}}=\mathrm{diag}(\sigma^{2}_{\bar{r}},\dots,\sigma^{2}_{\bar{r}})$ , we cannot use standard GP tools anymore, since the problem of exact GP regression based on noisy regression inputs is intractable [14, Sec. 2.3.2]. We here briefly review the work of [12, Ch. 2] (which is more detailed than the original work [9]) to handle this issue. First, a Taylor series expansion around the noisy regression input is done (and truncated after the first-order term), which results in

\displaystyle\bar{f}(\bar{\mathbf{x}})=\bar{f}(\check{\bar{\mathbf{x}}}-\bar{% \mathbf{r}})\approx\bar{f}(\check{\bar{\mathbf{x}}})-\frac{\partial\bar{f}(% \mathbf{x})}{\partial\mathbf{x}}\Big{\rvert}_{\mathbf{x}=\check{\bar{\mathbf{x% }}}}\bar{\mathbf{r}}.

(5)

The second term depends on the derivative of a GP, which is again a GP [15]. Although one can compute the first and second moment of this expression, it is much simpler to perform another approximation by replacing the derivative of the GP by the derivative of its posterior mean [12]. In this case, we consider the following model

\displaystyle\bar{y}\approx\bar{f}(\check{\bar{\mathbf{x}}})-\frac{\partial% \bar{m}_{+}(\bar{\mathbf{x}}|\bar{X},\bar{Y})}{\partial\bar{\mathbf{x}}}\Big{% \rvert}_{\bar{\mathbf{x}}=\check{\bar{\mathbf{x}}}}\bar{\mathbf{r}}+\bar{% \varepsilon}.

(6)

This model results in the following covariance matrix

	$\displaystyle\check{K}=$	$\displaystyle\begin{pmatrix}k(\check{\bar{\mathbf{x}}}(0),\check{\bar{\mathbf{% x}}}(0))&\dots&k(\check{\bar{\mathbf{x}}}(0),\check{\bar{\mathbf{x}}}(N-1))\\ \vdots&\ddots&\vdots\\ k(\check{\bar{\mathbf{x}}}(N-1),\check{\bar{\mathbf{x}}}(0))&\dots&k(\check{% \bar{\mathbf{x}}}(N-1),\check{\bar{\mathbf{x}}}(N-1))\end{pmatrix}$
		$\displaystyle+\mathrm{diag}(\bar{\sigma}^{2}_{\mathrm{out}}(0),\dots,\bar{% \sigma}^{2}_{\mathrm{out}}(N-1))$		(7)

with

	$\displaystyle\bar{\sigma}^{2}_{\mathrm{out}}(i)\coloneqq$
	$\displaystyle\frac{\partial\bar{m}_{+}(\bar{\mathbf{x}}\|\bar{X},\bar{Y})}{% \partial\bar{\mathbf{x}}}\Big{\rvert}_{\bar{\mathbf{x}}=\check{\bar{\mathbf{x}% }}(i)}\Sigma_{\bar{r}}\frac{\partial\bar{m}_{+}(\bar{\mathbf{x}}\|\bar{X},\bar{% Y})}{\partial\bar{\mathbf{x}}}\Big{\rvert}_{\bar{\mathbf{x}}=\check{\bar{% \mathbf{x}}}(i)}^{\top}+\sigma_{\bar{\varepsilon}}^{2}.$		(8)

The expressions of the posterior mean and variance are analogous to (2) and (3), simply with $K(\bar{X},\bar{X})+\sigma_{\bar{\varepsilon}}^{2}I_{N}$ replaced by $\check{K}$ from (7). Note that we have one further hyperparameter to determine, which is the variance of the input noise. The optimization of the hyperparameters must be adapted, since the covariance matrix now depends on the derivatives of the posterior mean. Hence, [12] proposes to iterate the computations of the slopes of the posterior mean and the optimization of the hyperparameters. Note that the approach does not differ from a standard GP for (i) negligible input noise levels and (ii) constant posterior mean gradients [12]. Finally, in simulation examples this approach often outperforms heteroscedastic GPs, standard GPs, as well as variational methods [12].

In this work, we focus on discrete-time nonlinear dynamical systems of the following form¹¹1To simplify the notation, we do not consider control inputs in (9). However, the results of this paper can be straightforwardly extended to systems with control inputs.

\displaystyle\mathbf{x}(t+1)

\displaystyle=\mathbf{f}(\mathbf{x}(t))+\mathbf{w}(t)

(9)

with states $\mathbf{x}\in\mathbb{R}^{n}$ , process noise $\mathbf{w}\in\mathbb{R}^{n}$ (sometimes also referred to as system noise), and $\mathbf{f}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ . The process noise $\mathbf{w}$ is assumed to be iid Gaussian noise with zero mean and variance $\Sigma_{w}=\mathrm{diag}(\sigma_{w}^{2},\dots,\sigma_{w}^{2})$ . Here, we assume the same noise variance among all components to simplify the analysis. The objective of this work is to approximate the function $\mathbf{f}$ by (the posterior means of) Gaussian processes. To this end, we collect a sufficiently long (or multiple shorter) trajectory from the system. In the here considered setting of dynamical systems, we cannot collect arbitrary data points. This is due to the recursive structure of (9): the (noisy) outputs of the function $\mathbf{f}$ at some time instant correspond to the function inputs at the next time instant.

\displaystyle\tilde{K}=\begin{pmatrix}k(\tilde{x}(0),\tilde{x}(0))+\sigma^{2}_% {\mathrm{out}}(0)&k(\tilde{x}(0),\tilde{x}(1))-\nabla_{0}\sigma^{2}_{r}&\dots&% k(\tilde{x}(0),\tilde{x}(N-1)\\ k(\tilde{x}(1),\tilde{x}(0))-\nabla_{0}\sigma^{2}_{r}&k(\tilde{x}(1),\tilde{x}% (1))+\sigma^{2}_{\mathrm{out}}(1)&\dots&k(\tilde{x}(1),\tilde{x}(N-1)\\ \vdots&\vdots&\ddots&\vdots\\ k(\tilde{x}(N-2),\tilde{x}(0))&k(\tilde{x}(N-2),\tilde{x}(1))&&k(\tilde{x}(N-2% ),\tilde{x}(N-1))-\nabla_{N-2}\sigma^{2}_{r}\\ k(\tilde{x}(N-1),\tilde{x}(0))&k(\tilde{x}(N-1),\tilde{x}(1))&\dots&k(\tilde{x% }(N-1),\tilde{x}(N-1))+\sigma^{2}_{\mathrm{out}}(N-1)\end{pmatrix}

(

\star

)

When measuring a trajectory from the system, one has (in most applications) only access to noisy measurements of the trajectories (due to, e.g., noise coming from the sensors). This means that only

$\displaystyle\tilde{\mathbf{x}}(0)$	$\displaystyle=\mathbf{x}(0)+\mathbf{r}(0)$	(10)
$\displaystyle\tilde{\mathbf{x}}(1)$	$\displaystyle=\mathbf{x}(1)+\mathbf{r}(1)=\mathbf{f}(\mathbf{x}(0))+\mathbf{w}% (0)+\mathbf{r}(1)$	(11)
$\displaystyle\tilde{\mathbf{x}}(2)$	$\displaystyle=\mathbf{x}(2)+\mathbf{r}(2)=\mathbf{f}(\mathbf{x}(1))+\mathbf{w}% (1)+\mathbf{r}(2)$	(12)
	$\displaystyle\>\>\>\vdots$
$\displaystyle\tilde{\mathbf{x}}(N)$	$\displaystyle=\mathbf{x}(N)+\mathbf{r}(N)$
	$\displaystyle=\mathbf{f}(\mathbf{x}(N-1))+\mathbf{w}(N-1)+\mathbf{r}(N)$	(13)

can be measured with $\mathbf{r}$ being iid Gaussian noise with variance $\Sigma_{r}=\mathrm{diag}(\sigma_{r}^{2},\dots,\sigma_{r}^{2})$ . Note that we consider some measurement noise $\mathbf{r}$ in addition to the standard process noise $\mathbf{w}$ (which is often considered in the context of GP based control and estimation, compare, e.g., [1, 4]). The measurement noise $\mathbf{r}$ and the process noise $\mathbf{w}$ are assumed to be independent. To approximate the function $\mathbf{f}$ , we have $\tilde{\mathbf{x}}(0),\dots,\tilde{\mathbf{x}}(N-1)$ as regression input data and $\tilde{\mathbf{x}}(1),\dots,\tilde{\mathbf{x}}(N)$ as regression output data available. We do not have access to the true regression inputs, i.e., $\mathbf{x}(0),\dots,\mathbf{x}(N-1)$ .

The subject of this work is to propose a framework to account for the input noise in the case of dynamical systems, where only noisy trajectories are available as training data.

III SCALAR SYSTEMS

III-A Analysis of regression input noise

In this section, we consider $f:\mathbb{R}\rightarrow\mathbb{R}$ and $x\in\mathbb{R}$ . As training data, we assume that one trajectory of length $N+1$ has been collected to set up the GP. We use the same approach as in (6) and introduce

\displaystyle\nabla_{i}\coloneqq\frac{\partial m_{+}(x|\tilde{\mathbf{X}}^{% \mathrm{in}},\tilde{\mathbf{X}}^{\mathrm{out}})}{\partial x}\Big{\rvert}_{x=% \tilde{x}(i)}

(14)

with $\tilde{\mathbf{X}}^{\mathrm{in}}=\begin{pmatrix}\tilde{x}(0)&\dots&\tilde{x}(N% -1)\end{pmatrix}$ and $\tilde{\mathbf{X}}^{\mathrm{out}}=\begin{pmatrix}\tilde{x}(1)&\dots&\tilde{x}(% N)\end{pmatrix}$ to denote the derivative of the posterior mean approximating the function $f$ at the location $\tilde{x}(i)$ . Together with (10) - (13), this results in

\displaystyle\tilde{x}(i)\approx f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-1)+r% (i).

The variance corresponds to

	$\displaystyle\mathrm{cov}\big{(}\tilde{x}(i),\tilde{x}(i)\big{)}$
	$\displaystyle\approx\mathbb{E}\bigg{\{}\Big{(}f(\tilde{x}(i-1))-\nabla_{i-1}r(% i-1)+w(i-1)+r(i)$
	$\displaystyle-\mathbb{E}\big{\{}f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-1)+r(% i)\big{\}}\Big{)}^{2}\bigg{\}}$
	$\displaystyle=k(\tilde{x}(i-1),\tilde{x}(i-1))+\nabla_{i-1}\Sigma_{r}\nabla_{i% -1}+\sigma_{w}^{2}+\sigma_{r}^{2}$
	$\displaystyle\eqqcolon k(\tilde{x}(i-1),\tilde{x}(i-1))+\sigma^{2}_{\mathrm{% out}}(i-1)$		(15)

for all $i=1,\dots,N$ , since (i) $\mathbf{w}$ and $\mathbf{r}$ are independent and (ii) $\mathbf{r}$ is assumed to be iid.

We compute the covariance of two subsequent samples

	$\displaystyle\mathrm{cov}\big{(}\tilde{x}(i+1),\tilde{x}(i)\big{)}\approx$
	$\displaystyle\mathbb{E}\bigg{\{}\Big{(}f(\tilde{x}(i))-\nabla_{i}r(i)+w(i)+r(i% +1)$
	$\displaystyle-\mathbb{E}\big{\{}f(\tilde{x}(i))-\nabla_{i}r(i)+w(i)+r(i+1)\big% {\}}\Big{)}$
	$\displaystyle\hskip 14.22636pt\Big{(}f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-% 1)+r(i)$
	$\displaystyle-\mathbb{E}\big{\{}f(\tilde{x}(i-1))-\nabla_{i-1}r(i-1)+w(i-1)+r(% i)\big{\}}\Big{)}\bigg{\}},$

resulting in

	$\displaystyle\mathrm{cov}\big{(}\tilde{x}(i+1),\tilde{x}(i)\big{)}$	$\displaystyle\approx k(\tilde{x}(i),\tilde{x}(i-1))-\mathbb{E}\big{\{}\nabla_{% i}r(i)r(i)\big{\}}$
		$\displaystyle=k(\tilde{x}(i),\tilde{x}(i-1))-\nabla_{i}\sigma_{r}^{2}$		(16)

for all $i=1,\dots,N-1$ and similarly for $\mathrm{cov}\big{(}\tilde{x}(i),\tilde{x}(i+1)\big{)}$ . The term $-\nabla_{i}\sigma_{r}^{2}$ in (16) appears only in the covariance of two consecutive data points (i.e., $x(i)$ and $x(i+1)$ ) and is caused by the recursive nature of (9) and the propagation of the input noise to the output in (6). For this reason, this term does not appear in the developments in [12], where dynamical systems are not the central focus. In our case, the covariance matrix of the measured data corresponds to the expression given in ( $\star$ ‣ II) above, where the term $-\nabla_{i}\sigma_{r}^{2}$ appears only in the entries immediately above and below the main diagonal.

If one does not consider consecutive samples in the training data, the additional term in (16) vanishes. In the context of dynamical systems, this could happen if (i) one only uses every second data point (which would be data-inefficient since half of the data are lost) or (ii) one performs one-step experiments, such that a regression output does not become a regression input. Intuitively, this means that one considers some initial condition, measures the next state and then considers a different initial condition, which is not meaningful/possible for many applications. Note that our above theoretical analysis focuses on collecting one single trajectory. If one considers multiple trajectories, the entries in the covariance matrix describing the transition from one trajectory to another do not contain the additional term $-\nabla_{i}\sigma^{2}_{r}$ .

The last step is to set up the posterior mean and the posterior variance, which is once again analogous to (2) and (3) with $K(\bar{X},\bar{X})+\sigma_{\bar{\varepsilon}}^{2}$ replaced by $\tilde{K}$ from ( $\star$ ‣ II).

III-B Application to logistic growth example

We evaluate the effect of the additional off-diagonal terms for a logistic growth example²²2The code of the simulations is available here: https://doi.org/10.25835/xwkni4f6. We use a zero prior mean and a squared exponential kernel. We consider the following (Euler-discretized) system

\displaystyle x(k+1)=x(k)+Tqx(k)\bigg{(}1-\frac{x(k)}{C}\bigg{)}+w(k)

(17)

with $T=1,q=0.1,C=100$ , which corresponds to a logistic growth example [16]. Note that the relatively small value of $T$ and the rather large value of $C$ imply that we only have to deal with a small nonlinearity and almost constant gradients. We collect three trajectories of length 100. We consider normally distributed process noise with mean $\mu_{w}=0$ and variance $\sigma_{w}^{2}=10^{-3}$ (and in a second run normally distributed process noise with $\mu_{w}=0$ and variance $\sigma_{w}^{2}=10^{-1}$ ) as well as normally distributed measurement noise with mean $\mu_{r}=0$ and various variances as illustrated in Figure 1. We use five iterations of slope/hyperparameter computations, compare [12]. To test the performance of the GPs, we consider $N_{\ast}=500$ random samples from a uniform distribution $\mathcal{U}(0,100)$ and compute the posterior mean. We compare our method to the one proposed by [12] and to a standard GP [13]. In all cases, we then compute the mean squared error (MSE) defined as

\displaystyle\text{MSE}\coloneqq\frac{1}{N_{\ast}}\sum_{k=1}^{N_{\ast}}||f(x_{% \ast}(i))-m_{+}(x_{\ast}(i)|\tilde{\mathbf{X}}^{\mathrm{in}},\tilde{\mathbf{X}% }^{\mathrm{out}})||^{2}.

(18)

Refer to caption — Figure 1: Simulation results of example (17) considering two different process noise variances as indicated in the titles of the plots. We implement the here proposed extension (referred to as “CCS” standing for “covariance of consecutive samples”), a standard GP (called “ST”) and the approach proposed by [12] (called “NI” standing for “noisy inputs”, which is the abbreviation given by the authors in [12] to describe their framework). We report the MSE as defined in (18), respectively.

The simulation results are displayed in Figure 1. We observe that our proposed extension substantially outperforms the other approaches for both process noise variances, in particular for large measurement noise variances. This is due to the explicit consideration of the covariance between two consecutive samples, which is not considered in the framework from [12] and a standard GP [13].

When a larger process noise variance is considered (compare Fig. 1, right plot), our proposed approach still outperforms the other two, although the difference becomes slightly smaller. This is due to the fact that in this case, the diagonal terms in the covariance matrix become more dominant and the advantage of our approach (that considers the noise variance in the entries immediately above and below the main diagonal) becomes less prominent.

Moreover, we observe that the performance of the standard GP and the method proposed by [12] is similar (with a slight advantage for the standard GP). This is due to the considered system which is almost linear. In this case, the gradients of the posterior mean are almost constant and a standard GP can achieve a similar effect (than the extension of [12]) by simply increasing the noise variance. The slight advantage for the standard GP may be due to a minor overfitting in case of the approach proposed by [12], where we have one more hyperparameter to determine.

IV MULTIDIMENSIONAL SYSTEMS

IV-A Analysis of regression input noise

In this section, we now focus on multidimensional systems. This means that we consider some function $\mathbf{f}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ with $\mathbf{x}\in\mathbb{R}^{n}$ . The most common approach to approximate these systems is to consider the individual components of the function $\mathbf{f}$ to be independent [1, 2, 4]. In this case, scalar GPs are used to approximate each component of the function $\mathbf{f}$ . Alternatively, one can use a linear model of coregionalization [17], where all components are learned jointly (and hence also correlations among the components can be learned).

However, the above works rely on the assumption that the regression input data are noise-free. As mentioned in the previous section, this is rarely the case in the context of dynamical systems, since the measurements of the states are corrupted by some noise. In the following, we again consider the input noise by applying the approach proposed in [12] (compare (6)) to dynamical systems. As shown in the following derivation, analogous to Section III, we obtain additional terms in the covariance between two consecutive observations. Moreover, in addition to the scalar case, we also obtain covariance terms between the regression outputs corresponding to the different components of $f$ . In particular, since

	$\displaystyle\tilde{x}_{j}(i)\approx$	$\displaystyle f_{j}(\tilde{\mathbf{x}}(i-1))-\frac{\partial m_{+,j}(\mathbf{x}% \|\tilde{X}^{\mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{\|}% _{\mathbf{x}=\tilde{\mathbf{x}}(i-1)}\times$
		$\displaystyle\textbf{r}(i-1)+w_{j}(i-1)+r_{j}(i)$		(19)

for all $j=1,\dots,n$ , we obtain

	$\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),\tilde{x}_{\ell}(i)\big{)}$
	$\displaystyle\approx\mathbb{E}\Bigg{\{}\bigg{(}f_{j}(\tilde{\mathbf{x}}(i-1))-% \mathbb{E}\Big{\{}f_{j}(\tilde{\mathbf{x}}(i-1))\Big{\}}\bigg{)}\times$
	$\displaystyle\qquad\bigg{(}f_{\ell}(\tilde{\mathbf{x}}(i-1))-\mathbb{E}\Big{\{% }f_{\ell}(\tilde{\mathbf{x}}(i-1))\Big{\}}\bigg{)}^{\top}\Bigg{\}}$
	$\displaystyle+\mathbb{E}\Bigg{\{}\frac{\partial m_{+,j}(\mathbf{x}\|\tilde{X}^{% \mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{\|}_{\mathbf{x}% =\tilde{\mathbf{x}}(i-1)}\textbf{r}(i-1)\times$
	$\displaystyle\qquad\mathbf{r}(i-1)^{\top}\frac{\partial m_{+,{\ell}}(\mathbf{x% }\|\tilde{X}^{\mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{\|% }_{\mathbf{x}=\tilde{\mathbf{x}}(i-1)}^{\top}\Bigg{\}}$
	$\displaystyle+(\sigma^{2}_{r}+\sigma^{2}_{w})\delta_{j\ell}$		(20)

with $\delta_{\ell,j}$ denoting the Kronecker delta. To simplify the analysis, we assume that the different GPs modeling the different components are mutually independent (as commonly done in the context of GP based control/estimation [1, 2, 4]). Consequently, it holds that

\displaystyle\mathbb{E}\{f_{j}(\tilde{\mathbf{x}})f_{\ell}(\tilde{\mathbf{x}})% \}=\mathbb{E}\{f_{j}(\tilde{\mathbf{x}})\}\mathbb{E}\{f_{\ell}(\tilde{\mathbf{% x}})\}

(21)

and therefore

	$\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),\tilde{x}_{\ell}(i)\big{)}% \approx\big{(}k(\tilde{\mathbf{x}}(i-1),\tilde{\mathbf{x}}(i-1))+\sigma^{2}_{r% }+\sigma^{2}_{w}\big{)}\delta_{j\ell}$
	$\displaystyle+\mathbb{E}\Bigg{\{}\frac{\partial m_{+,j}(\mathbf{x}\|\tilde{X}^{% \mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{\|}_{\mathbf{x}% =\tilde{\mathbf{x}}(i-1)}\Sigma_{r}\times$
	$\displaystyle\hskip 56.9055pt\frac{\partial m_{+,{\ell}}(\mathbf{x}\|\tilde{X}^% {\mathrm{in}},\tilde{X}^{\mathrm{out}})}{\partial\mathbf{x}}\Big{\|}_{\mathbf{x% }=\tilde{\mathbf{x}}(i-1)}^{\top}\Bigg{\}}$		(22)

for all $i=1,\dots,N$ and $j,\ell=1,\dots,n$ . Hence, although assuming independence among the different GPs (modeling the different components), the observations covary due to the input noise. Moreover, as in the previous section (compare (16)), we need to consider the covariance within the same component, but for subsequent time instants

\displaystyle K_{\mathrm{md}}=

\displaystyle(K_{x}+\sigma_{r}^{2}I_{N}+\sigma_{w}^{2}I_{N})\otimes I_{n}+% \begin{pmatrix}\nabla_{0}\Sigma_{r}\nabla_{0}^{\top}&-\nabla_{1}^{\top}\sigma_% {r}^{2}&0_{n\times n}&\dots&0_{n\times n}\\ -\nabla_{1}\sigma_{r}^{2}&\nabla_{1}\Sigma_{r}\nabla_{1}^{\top}&-\nabla_{2}^{% \top}\sigma_{r}^{2}&\dots&0_{n\times n}\\ 0_{n\times n}&-\nabla_{2}\sigma_{r}^{2}&\nabla_{2}\Sigma_{r}\nabla_{2}^{\top}&% \dots&0_{n\times n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0_{n\times n}&0_{n\times n}&0_{n\times n}&\dots&\nabla_{N-1}\Sigma_{r}\nabla_{% N-1}^{\top}\\ \end{pmatrix}

(

\star\star

)

	$\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),$	$\displaystyle\tilde{x}_{j}(i+1)\big{)}\approx k(\tilde{\mathbf{x}}(i-1),\tilde% {\mathbf{x}}(i))$
		$\displaystyle-\frac{\partial m_{+,j}(\mathbf{x}\|\tilde{X}^{\mathrm{in}},\tilde% {X}^{\mathrm{out}})}{\partial x_{j}}\Big{\|}_{\mathbf{x}=\tilde{\mathbf{x}}(i)}% \sigma_{r}^{2}$		(23)

for all $j=1,\dots,n$ and $i=1,\dots,N-1$ and similarly for $\mathrm{cov}\big{(}\tilde{x}_{j}(i+1),\tilde{x}_{j}(i)\big{)}$ . Finally, we need to consider the covariance between observations of two different (not necessarily adjacent) components and subsequent time instants as, e.g.,

\displaystyle\mathrm{cov}\big{(}\tilde{x}_{j}(i),\tilde{x}_{\ell}(i+1)\big{)}% \approx-\frac{\partial m_{+,{\ell}}(\mathbf{x}|\tilde{X}^{\mathrm{in}},\tilde{% X}^{\mathrm{out}})}{\partial x_{j}}\Big{|}_{\mathbf{x}=\tilde{\mathbf{x}}(i)}% \sigma^{2}_{r}.

(24)

for all $j,\ell=1,\dots,n$ (but $j\neq\ell$ ) and $i=1,\dots,N-1$ and similarly for $\mathrm{cov}(\tilde{x}_{j}(i+1),\tilde{x}_{\ell}(i))$ .

To set up a GP for this case, we cannot proceed in the standard way by simply learning individual GPs. This is due to the correlations among the different components, compare (22), which cannot be considered by learning the components individually. Instead, we here learn all the components of the GP jointly. To this end, we set up the vector of observations

	$\displaystyle X^{\mathrm{out}}=$
	$\displaystyle\begin{pmatrix}\tilde{x}_{1}(1)&\dots&\tilde{x}_{n}(1)&\tilde{x}_% {1}(2)&\dots&\tilde{x}_{n}(N)\end{pmatrix}^{\top}.$		(25)

The covariance matrix corresponds to the expression given in ( $\star\star$ ‣ IV-A) below with $K_{x}=(k(\tilde{\mathbf{x}}(i),\tilde{\mathbf{x}}(j)))_{\tilde{\mathbf{x}}(i),% \tilde{\mathbf{x}}(j)\in\tilde{X}^{\mathrm{in}}}$ and $\mathbf{\nabla}_{i}$ defined as

\displaystyle\mathbf{\nabla}_{i}\coloneqq\begin{pmatrix}\frac{\partial m_{+,1}% }{\partial x_{1}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&\frac{\partial m_{+,1}}{% \partial x_{2}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&\dots\\ \frac{\partial m_{+,2}}{\partial x_{1}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&% \frac{\partial m_{+,2}}{\partial x_{2}}|_{\mathbf{x}=\tilde{\mathbf{x}}(i)}&% \dots\\ \vdots&\vdots&\ddots\\ \end{pmatrix}.

The predictive mean and variance are given by

	$\displaystyle\mathbf{m}_{+}(\mathbf{x}_{\ast}\|\tilde{X}^{\mathrm{in}},\tilde{X% }^{\mathrm{out}})=(\mathbf{k}(\mathbf{x}_{\ast},X^{\mathrm{in}})\otimes I_{n})% K_{\mathrm{md}}^{-1}X^{\mathrm{out}}$		(26)
	$\displaystyle\Sigma_{+}(\mathbf{x}_{\ast}\|\tilde{X}^{\mathrm{in}},\tilde{X}^{% \mathrm{out}})=k(\mathbf{x}_{\ast},\mathbf{x}_{\ast})\otimes I_{n}$
	$\displaystyle\quad-(\mathbf{k}(\mathbf{x}_{\ast},X^{\mathrm{in}})\otimes I_{n}% )K_{\mathrm{md}}^{-1}(\mathbf{k}(\mathbf{x}_{\ast},X^{\mathrm{in}})\otimes I_{% n})^{\top}.$		(27)

The above derivation focuses once again on one single trajectory as offline data. If multiple trajectories have been collected, no covariance is needed at the transition between the different trajectories, as in the previous section.

Remark 1

In this paper, we assume independence among the different GPs modeling the different components of the unknown function $\mathbf{f}$ . One interesting subject for future work is to omit this assumption. In this case, one could combine the here proposed approach with an intrinsic coregionalization method or a linear model of coregionalization [17].

Remark 2

Within this paper, we only focused on the state transition function $\mathbf{f}$ . The approximation of an output map $\mathbf{h}$ is more straightforward since the noise affecting the regression inputs does not get propagated through the GP. In this case, we can apply the common approach to learn each component of the function $\mathbf{h}$ individually, since there are no covariances among the components and use the approach suggested in [12].

IV-B Application to batch reactor, two-link planar robot, and cart-pole system

We evaluate our approach in several numerical examples. For all numerical examples, we use a zero prior mean and a squared exponential kernel. Once again, for space reasons, we only explain the simulation and evaluation setting in detail for the first example. We consider the following dynamics


$\displaystyle x_{1}(t+1)$	$\displaystyle=x_{1}(t)+T(-2c_{1}x_{1}^{2}(t)+2c_{2}x_{2}(t))+w_{1}(t)$	(28a)
$\displaystyle x_{2}(t+1)$	$\displaystyle=x_{2}(t)+T(c_{1}x_{1}^{2}(t)-c_{2}x_{2}(t))+w_{2}(t),$	(28b)

which corresponds to a discretized batch reactor [18]. We consider $T=0.1$ , $c_{1}=0.16$ , $c_{2}=0.0064$ , normally distributed process noise with mean $\mu_{w}=0$ and variance $\sigma^{2}_{w}=10^{-6}I_{n}$ , and normally distributed measurement noise with mean $\mu_{r}=0$ and different variances as shown in Figure 2. We collect three trajectories containing 50 samples.

Next, we test our approach for two four-dimensional systems with highly complex nonlinear dynamics. We consider a two-link planar robot with the dynamics and numerical parameter values as given in [19] and a cart-pole system with the numerical parameters values from [20]. The considered measurement noise variances are illustrated in Figure 2 (middle and right plot). In both cases, we collect three trajectories containing 50 samples.

In all examples, we use five iterations of slope/hyperparameter computations, see [12]. Furthermore, we implement a standard GP as introduced at the beginning of this section and the method proposed by [12] (by assuming that the different components are independent). We evaluate the performance for $N_{\star}=500$ random test data points sampled from a uniform distribution over some operating region of interest. More details can be found in the code of the simulations, which is provided under the link in footnote 2.

From Figure 2, one can see that the method proposed in this paper again outperforms the alternatives in terms of the MSE in all tested setting. Overall, the difference is more pronounced for larger noise levels. Furthermore, we can observe that the extension by [12] performs slightly better compared to the scalar case presented in the previous section. A reason for this observation may be that the extension proposed by [12] allows to learn the regression input noise variance using all outputs, which is not possible for a standard GP.

V CONCLUSION

In this work, we analyzed the impact of regression input noise in case of dynamical systems modeled by Gaussian processes and introduced approaches to account for this noise in case of scalar and multidimensional systems. In several numerical examples, we showed that the consideration of the proposed extension substantially improves the performance compared to the state-of-the-art approaches.

Several topics are left for future research. One could refine the framework by using second order approximations (as also suggested by [12]), which is likely to improve the performance further, although inducing a larger computational complexity. We expect that the method proposed in this paper will be beneficial for designing GP-based controllers and state estimators for nonlinear dynamical systems with improved performance.

References

[1] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using Gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2019.
[2] T. Beckers, D. Kulić, and S. Hirche, “Stable Gaussian process based tracking control of Euler–Lagrange systems,” Automatica, vol. 103, pp. 390–397, 2019.
[3] M. Maiworm, D. Limon, and R. Findeisen, “Online learning-based model predictive control with Gaussian process models and stability guarantees,” International Journal of Robust and Nonlinear Control, vol. 31, no. 18, pp. 8785–8812, 2021.
[4] J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, 2009.
[5] M. Buisson-Fenet, V. Morgenthaler, S. Trimpe, and F. Di Meglio, “Joint state and dynamics estimation with high-gain observers and Gaussian process models,” in 2021 American Control Conference (ACC). IEEE, 2021, pp. 4027–4032.
[6] T. M. Wolff, V. G. Lopez, and M. A. Müller, “Gaussian process-based nonlinear moving horizon estimation,” arXiv preprint arXiv:2402.04665, 2024.
[7] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 393–400.
[8] P. Goldberg, C. Williams, and C. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” Advances in neural information processing systems, vol. 10, 1997.
[9] A. McHutchon and C. Rasmussen, “Gaussian process training with input noise,” Advances in neural information processing systems, vol. 24, 2011.
[10] M. Titsias and N. D. Lawrence, “Bayesian Gaussian process latent variable model,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 844–851.
[11] A. Doerr, C. Daniel, M. Schiegg, N.-T. Duy, S. Schaal, M. Toussaint, and S. Trimpe, “Probabilistic recurrent state-space models,” in International conference on machine learning. PMLR, 2018, pp. 1280–1289.
[12] A. J. McHutchon, “Nonlinear modelling and control using Gaussian processes,” Ph.D. dissertation, Citeseer, 2015.
[13] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning. Springer, 2006, vol. 1.
[14] M. P. Deisenroth, Efficient reinforcement learning using Gaussian processes. KIT Scientific Publishing, 2010, vol. 9.
[15] E. Solak, R. Murray-Smith, W. Leithead, D. Leith, and C. Rasmussen, “Derivative observations in Gaussian process models of dynamic systems,” in Advances in Neural Information Processing Systems, vol. 15. MIT Press, 2002.
[16] A. Tsoularis and J. Wallace, “Analysis of logistic growth models,” Mathematical Biosciences, vol. 179, no. 1, pp. 21–55, 2002.
[17] M. A. Alvarez, L. Rosasco, N. D. Lawrence et al., “Kernels for vector-valued functions: A review,” Foundations and Trends® in Machine Learning, vol. 4, no. 3, pp. 195–266, 2012.
[18] J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model predictive control: theory, computation, and design. Nob Hill Publishing Madison, WI, 2017, vol. 2.
[19] M. Buisson-Fenet, F. Solowjow, and S. Trimpe, “Actively learning Gaussian process dynamics,” in Learning for dynamics and control. PMLR, 2020, pp. 5–15.
[20] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.