Gaussian Processes with Noisy Regression Inputs for Dynamical Systems
Abstract
This paper is centered around the approximation of dynamical systems by means of Gaussian processes. To this end, trajectories of such systems must be collected to be used as training data. The measurements of these trajectories are typically noisy, which implies that both the regression inputs and outputs are corrupted by noise. However, most of the literature considers only noise in the regression outputs. In this paper, we show how to account for the noise in the regression inputs in an extended Gaussian process framework to approximate scalar and multidimensional systems. We demonstrate the potential of our framework by comparing it to different state-of-the-art methods in several simulation examples.
I INTRODUCTION
The application of Gaussian process (GP) regression in the context of dynamical systems has received a substantial interest in recent years. It has been applied for a variety of applications such as, e.g., control [1, 2, 3] and state estimation [4, 5, 6]. The most common setup for GP regression considers two major assumptions on the measured data. First, it is assumed that the available regression input data are noise-free. Second, the measured regression output data are assumed to be corrupted by independent and identically distributed (iid) Gaussian noise.
One frequently applied approach to approximate dynamical systems by GPs is to model each component of the transition function by the posterior means of independently learned GPs [1, 2, 4, 6]. To approximate these functions, it is assumed that the states (together with the control inputs) can be measured. Subsequently, the control input and state trajectory are used as regression input data, and the (by one time instant shifted) state trajectory is used as regression output data.
In most practical applications, the state measurements are corrupted by noise. On the one hand, this implies that the regression outputs are corrupted by noise, which is in accordance to the standard GP setting. On the other hand, this entails that the regression inputs are also corrupted by noise, which is not covered by the standard GP setting.
To cope with regression input noise in GP regression, one can use heteroscedastic GPs [7, 8] (where a second GP is used to model the noise variance and rather large amounts of data are needed [9]) or variational methods [10, 11]. An alternative, which is simple, but very effective has been proposed by [12, 9]. The key idea is to propagate the input noise to the output by using first order Taylor approximations of the posterior means (see Section II below for the details). In [12], the authors show that this approach can outperform variational methods, heteroscedastic GPs and standard GPs.
Our work can be considered as an extension of the framework suggested in [12] to dynamical systems. Here, one major difference is that one cannot arbitrarily sample training data points to set up a GP. Instead, one typically can only collect trajectories. We show that these trajectories induce correlations that must be taken into account when setting up a GP to correctly represent dynamical systems. Alongside these theoretical derivations, we illustrate the performance of our proposed extension by means of several simulation examples and compare it to the cases where a dynamical system is directly approximated using the method proposed in [12] and a standard GP [13].
II PRELIMINARIES AND PROBLEM SETTING
The set of real numbers is denoted by . The identity matrix of dimension is denoted by . A diagonal matrix with on its diagonal entries is denoted by . We denote the Kronecker product by . We denote scalars by small letters, vectors by small bold letters and matrices by capital letters. A vector of zeros of length is denoted by . A square matrix of zeros of dimension is denoted by .
We briefly review the fundamentals of standard Gaussian processes; a more detailed introduction to GPs can be found in [13]. GPs are commonly applied to approximate some nonlinear function . They are fully described by a mean function and a covariance function (also referred to as kernel) . For some , we write
(1) |
to denote that the function follows a GP with mean function and covariance function . We collect regression input and output data points from the unknown function and use them to define and , respectively. The regression outputs are given by with being iid Gaussian noise with zero mean and variance . The key idea of Gaussian processes is to condition the prior distribution on the training data, which results in a posterior distribution. For some test input , the mean and variance of the posterior distribution are given by [13, Ch. 2]
(2) | |||
(3) |
for , with , and with . The kernel depends on hyperparameters (such as, e.g., the signal variance and the length scales in case of the squared exponential kernel) that are commonly determined by maximizing the log marginal likelihood, see, e.g., [13, Eq. (2.30)].
These standard results in Gaussian processes rely on the assumption that the regression input data are noise-free. In turn, if the regression input data points are affected by some noise such that only
(4) |
is available with being some iid Gaussian noise with variance , we cannot use standard GP tools anymore, since the problem of exact GP regression based on noisy regression inputs is intractable [14, Sec. 2.3.2]. We here briefly review the work of [12, Ch. 2] (which is more detailed than the original work [9]) to handle this issue. First, a Taylor series expansion around the noisy regression input is done (and truncated after the first-order term), which results in
(5) |
The second term depends on the derivative of a GP, which is again a GP [15]. Although one can compute the first and second moment of this expression, it is much simpler to perform another approximation by replacing the derivative of the GP by the derivative of its posterior mean [12]. In this case, we consider the following model
(6) |
This model results in the following covariance matrix
(7) |
with
(8) |
The expressions of the posterior mean and variance are analogous to (2) and (3), simply with replaced by from (7). Note that we have one further hyperparameter to determine, which is the variance of the input noise. The optimization of the hyperparameters must be adapted, since the covariance matrix now depends on the derivatives of the posterior mean. Hence, [12] proposes to iterate the computations of the slopes of the posterior mean and the optimization of the hyperparameters. Note that the approach does not differ from a standard GP for (i) negligible input noise levels and (ii) constant posterior mean gradients [12]. Finally, in simulation examples this approach often outperforms heteroscedastic GPs, standard GPs, as well as variational methods [12].
In this work, we focus on discrete-time nonlinear dynamical systems of the following form111To simplify the notation, we do not consider control inputs in (9). However, the results of this paper can be straightforwardly extended to systems with control inputs.
(9) |
with states , process noise (sometimes also referred to as system noise), and . The process noise is assumed to be iid Gaussian noise with zero mean and variance . Here, we assume the same noise variance among all components to simplify the analysis. The objective of this work is to approximate the function by (the posterior means of) Gaussian processes. To this end, we collect a sufficiently long (or multiple shorter) trajectory from the system. In the here considered setting of dynamical systems, we cannot collect arbitrary data points. This is due to the recursive structure of (9): the (noisy) outputs of the function at some time instant correspond to the function inputs at the next time instant.
() |
When measuring a trajectory from the system, one has (in most applications) only access to noisy measurements of the trajectories (due to, e.g., noise coming from the sensors). This means that only
(10) | ||||
(11) | ||||
(12) | ||||
(13) |
can be measured with being iid Gaussian noise with variance . Note that we consider some measurement noise in addition to the standard process noise (which is often considered in the context of GP based control and estimation, compare, e.g., [1, 4]). The measurement noise and the process noise are assumed to be independent. To approximate the function , we have as regression input data and as regression output data available. We do not have access to the true regression inputs, i.e., .
The subject of this work is to propose a framework to account for the input noise in the case of dynamical systems, where only noisy trajectories are available as training data.
III SCALAR SYSTEMS
III-A Analysis of regression input noise
In this section, we consider and . As training data, we assume that one trajectory of length has been collected to set up the GP. We use the same approach as in (6) and introduce
(14) |
with and to denote the derivative of the posterior mean approximating the function at the location . Together with (10) - (13), this results in
The variance corresponds to
(15) |
for all , since (i) and are independent and (ii) is assumed to be iid.
We compute the covariance of two subsequent samples
resulting in
(16) |
for all and similarly for . The term in (16) appears only in the covariance of two consecutive data points (i.e., and ) and is caused by the recursive nature of (9) and the propagation of the input noise to the output in (6). For this reason, this term does not appear in the developments in [12], where dynamical systems are not the central focus. In our case, the covariance matrix of the measured data corresponds to the expression given in ( ‣ II) above, where the term appears only in the entries immediately above and below the main diagonal.
If one does not consider consecutive samples in the training data, the additional term in (16) vanishes. In the context of dynamical systems, this could happen if (i) one only uses every second data point (which would be data-inefficient since half of the data are lost) or (ii) one performs one-step experiments, such that a regression output does not become a regression input. Intuitively, this means that one considers some initial condition, measures the next state and then considers a different initial condition, which is not meaningful/possible for many applications. Note that our above theoretical analysis focuses on collecting one single trajectory. If one considers multiple trajectories, the entries in the covariance matrix describing the transition from one trajectory to another do not contain the additional term .
III-B Application to logistic growth example
We evaluate the effect of the additional off-diagonal terms for a logistic growth example222The code of the simulations is available here: https://doi.org/10.25835/xwkni4f6. We use a zero prior mean and a squared exponential kernel. We consider the following (Euler-discretized) system
(17) |
with , which corresponds to a logistic growth example [16]. Note that the relatively small value of and the rather large value of imply that we only have to deal with a small nonlinearity and almost constant gradients. We collect three trajectories of length 100. We consider normally distributed process noise with mean and variance (and in a second run normally distributed process noise with and variance ) as well as normally distributed measurement noise with mean and various variances as illustrated in Figure 1. We use five iterations of slope/hyperparameter computations, compare [12]. To test the performance of the GPs, we consider random samples from a uniform distribution and compute the posterior mean. We compare our method to the one proposed by [12] and to a standard GP [13]. In all cases, we then compute the mean squared error (MSE) defined as
(18) |
The simulation results are displayed in Figure 1. We observe that our proposed extension substantially outperforms the other approaches for both process noise variances, in particular for large measurement noise variances. This is due to the explicit consideration of the covariance between two consecutive samples, which is not considered in the framework from [12] and a standard GP [13].
When a larger process noise variance is considered (compare Fig. 1, right plot), our proposed approach still outperforms the other two, although the difference becomes slightly smaller. This is due to the fact that in this case, the diagonal terms in the covariance matrix become more dominant and the advantage of our approach (that considers the noise variance in the entries immediately above and below the main diagonal) becomes less prominent.
Moreover, we observe that the performance of the standard GP and the method proposed by [12] is similar (with a slight advantage for the standard GP). This is due to the considered system which is almost linear. In this case, the gradients of the posterior mean are almost constant and a standard GP can achieve a similar effect (than the extension of [12]) by simply increasing the noise variance. The slight advantage for the standard GP may be due to a minor overfitting in case of the approach proposed by [12], where we have one more hyperparameter to determine.
IV MULTIDIMENSIONAL SYSTEMS
IV-A Analysis of regression input noise
In this section, we now focus on multidimensional systems. This means that we consider some function with . The most common approach to approximate these systems is to consider the individual components of the function to be independent [1, 2, 4]. In this case, scalar GPs are used to approximate each component of the function . Alternatively, one can use a linear model of coregionalization [17], where all components are learned jointly (and hence also correlations among the components can be learned).
However, the above works rely on the assumption that the regression input data are noise-free. As mentioned in the previous section, this is rarely the case in the context of dynamical systems, since the measurements of the states are corrupted by some noise. In the following, we again consider the input noise by applying the approach proposed in [12] (compare (6)) to dynamical systems. As shown in the following derivation, analogous to Section III, we obtain additional terms in the covariance between two consecutive observations. Moreover, in addition to the scalar case, we also obtain covariance terms between the regression outputs corresponding to the different components of . In particular, since
(19) |
for all , we obtain
(20) |
with denoting the Kronecker delta. To simplify the analysis, we assume that the different GPs modeling the different components are mutually independent (as commonly done in the context of GP based control/estimation [1, 2, 4]). Consequently, it holds that
(21) |
and therefore
(22) |
for all and . Hence, although assuming independence among the different GPs (modeling the different components), the observations covary due to the input noise. Moreover, as in the previous section (compare (16)), we need to consider the covariance within the same component, but for subsequent time instants
() |
(23) |
for all and and similarly for . Finally, we need to consider the covariance between observations of two different (not necessarily adjacent) components and subsequent time instants as, e.g.,
(24) |
for all (but ) and and similarly for .
To set up a GP for this case, we cannot proceed in the standard way by simply learning individual GPs. This is due to the correlations among the different components, compare (22), which cannot be considered by learning the components individually. Instead, we here learn all the components of the GP jointly. To this end, we set up the vector of observations
(25) |
The covariance matrix corresponds to the expression given in ( ‣ IV-A) below with and defined as
The predictive mean and variance are given by
(26) | |||
(27) |
The above derivation focuses once again on one single trajectory as offline data. If multiple trajectories have been collected, no covariance is needed at the transition between the different trajectories, as in the previous section.
Remark 1
In this paper, we assume independence among the different GPs modeling the different components of the unknown function . One interesting subject for future work is to omit this assumption. In this case, one could combine the here proposed approach with an intrinsic coregionalization method or a linear model of coregionalization [17].
Remark 2
Within this paper, we only focused on the state transition function . The approximation of an output map is more straightforward since the noise affecting the regression inputs does not get propagated through the GP. In this case, we can apply the common approach to learn each component of the function individually, since there are no covariances among the components and use the approach suggested in [12].
IV-B Application to batch reactor, two-link planar robot, and cart-pole system
We evaluate our approach in several numerical examples. For all numerical examples, we use a zero prior mean and a squared exponential kernel. Once again, for space reasons, we only explain the simulation and evaluation setting in detail for the first example. We consider the following dynamics
(28a) | ||||
(28b) |
which corresponds to a discretized batch reactor [18]. We consider , , , normally distributed process noise with mean and variance , and normally distributed measurement noise with mean and different variances as shown in Figure 2. We collect three trajectories containing 50 samples.
Next, we test our approach for two four-dimensional systems with highly complex nonlinear dynamics. We consider a two-link planar robot with the dynamics and numerical parameter values as given in [19] and a cart-pole system with the numerical parameters values from [20]. The considered measurement noise variances are illustrated in Figure 2 (middle and right plot). In both cases, we collect three trajectories containing 50 samples.
In all examples, we use five iterations of slope/hyperparameter computations, see [12]. Furthermore, we implement a standard GP as introduced at the beginning of this section and the method proposed by [12] (by assuming that the different components are independent). We evaluate the performance for random test data points sampled from a uniform distribution over some operating region of interest. More details can be found in the code of the simulations, which is provided under the link in footnote 2.
From Figure 2, one can see that the method proposed in this paper again outperforms the alternatives in terms of the MSE in all tested setting. Overall, the difference is more pronounced for larger noise levels. Furthermore, we can observe that the extension by [12] performs slightly better compared to the scalar case presented in the previous section. A reason for this observation may be that the extension proposed by [12] allows to learn the regression input noise variance using all outputs, which is not possible for a standard GP.
V CONCLUSION
In this work, we analyzed the impact of regression input noise in case of dynamical systems modeled by Gaussian processes and introduced approaches to account for this noise in case of scalar and multidimensional systems. In several numerical examples, we showed that the consideration of the proposed extension substantially improves the performance compared to the state-of-the-art approaches.
Several topics are left for future research. One could refine the framework by using second order approximations (as also suggested by [12]), which is likely to improve the performance further, although inducing a larger computational complexity. We expect that the method proposed in this paper will be beneficial for designing GP-based controllers and state estimators for nonlinear dynamical systems with improved performance.
References
- [1] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using Gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2019.
- [2] T. Beckers, D. Kulić, and S. Hirche, “Stable Gaussian process based tracking control of Euler–Lagrange systems,” Automatica, vol. 103, pp. 390–397, 2019.
- [3] M. Maiworm, D. Limon, and R. Findeisen, “Online learning-based model predictive control with Gaussian process models and stability guarantees,” International Journal of Robust and Nonlinear Control, vol. 31, no. 18, pp. 8785–8812, 2021.
- [4] J. Ko and D. Fox, “GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models,” Autonomous Robots, vol. 27, no. 1, pp. 75–90, 2009.
- [5] M. Buisson-Fenet, V. Morgenthaler, S. Trimpe, and F. Di Meglio, “Joint state and dynamics estimation with high-gain observers and Gaussian process models,” in 2021 American Control Conference (ACC). IEEE, 2021, pp. 4027–4032.
- [6] T. M. Wolff, V. G. Lopez, and M. A. Müller, “Gaussian process-based nonlinear moving horizon estimation,” arXiv preprint arXiv:2402.04665, 2024.
- [7] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 393–400.
- [8] P. Goldberg, C. Williams, and C. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” Advances in neural information processing systems, vol. 10, 1997.
- [9] A. McHutchon and C. Rasmussen, “Gaussian process training with input noise,” Advances in neural information processing systems, vol. 24, 2011.
- [10] M. Titsias and N. D. Lawrence, “Bayesian Gaussian process latent variable model,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 844–851.
- [11] A. Doerr, C. Daniel, M. Schiegg, N.-T. Duy, S. Schaal, M. Toussaint, and S. Trimpe, “Probabilistic recurrent state-space models,” in International conference on machine learning. PMLR, 2018, pp. 1280–1289.
- [12] A. J. McHutchon, “Nonlinear modelling and control using Gaussian processes,” Ph.D. dissertation, Citeseer, 2015.
- [13] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning. Springer, 2006, vol. 1.
- [14] M. P. Deisenroth, Efficient reinforcement learning using Gaussian processes. KIT Scientific Publishing, 2010, vol. 9.
- [15] E. Solak, R. Murray-Smith, W. Leithead, D. Leith, and C. Rasmussen, “Derivative observations in Gaussian process models of dynamic systems,” in Advances in Neural Information Processing Systems, vol. 15. MIT Press, 2002.
- [16] A. Tsoularis and J. Wallace, “Analysis of logistic growth models,” Mathematical Biosciences, vol. 179, no. 1, pp. 21–55, 2002.
- [17] M. A. Alvarez, L. Rosasco, N. D. Lawrence et al., “Kernels for vector-valued functions: A review,” Foundations and Trends® in Machine Learning, vol. 4, no. 3, pp. 195–266, 2012.
- [18] J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model predictive control: theory, computation, and design. Nob Hill Publishing Madison, WI, 2017, vol. 2.
- [19] M. Buisson-Fenet, F. Solowjow, and S. Trimpe, “Actively learning Gaussian process dynamics,” in Learning for dynamics and control. PMLR, 2020, pp. 5–15.
- [20] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.