CN110908280B - Optimization control method for trolley-two-stage inverted pendulum system - Google Patents

Optimization control method for trolley-two-stage inverted pendulum system Download PDF

Info

Publication number
CN110908280B
CN110908280B CN201911043225.4A CN201911043225A CN110908280B CN 110908280 B CN110908280 B CN 110908280B CN 201911043225 A CN201911043225 A CN 201911043225A CN 110908280 B CN110908280 B CN 110908280B
Authority
CN
China
Prior art keywords
control
state
inverted pendulum
value
interval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911043225.4A
Other languages
Chinese (zh)
Other versions
CN110908280A (en
Inventor
卢荣华
陈特欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo University
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201911043225.4A priority Critical patent/CN110908280B/en
Publication of CN110908280A publication Critical patent/CN110908280A/en
Application granted granted Critical
Publication of CN110908280B publication Critical patent/CN110908280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The embodiment of the invention discloses a vehicle-two-stage inverted pendulum system optimization control method, which comprises the following steps: s10, setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and a state variable as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment by combining probability distribution functions, namely a Gaussian process model; s20, determining an optimal control interval and a better discrete control sequence; s30, after a Gaussian process model and an optimal control interval are obtained, a total objective function value and a gradient value are obtained by combining iterative calculation of the Gaussian process model on the formula Gaussian process model and initial condition variation; and S40, calling an optimization solver based on the gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and carrying out iterative solution to obtain an optimal control force sequence through calculation of the gradient value and the total objective function value.

Description

Optimization control method for trolley-two-stage inverted pendulum system
Technical Field
The invention relates to the field of control of a trolley-two-stage inverted pendulum system, in particular to an optimal control method of the trolley-two-stage inverted pendulum system.
Background
The vehicle-two-stage inverted pendulum system is a classic rapid, multivariable, nonlinear and unstable system and is a classic control object in the control field. Many control algorithms, including PID, fuzzy PID, robust control, etc., have been implemented in this system. However, the design control is premised on modeling. At present, the control of a trolley-secondary inverted pendulum system is based on a mechanism model, and the control method is a deterministic model deduced through a physical principle. And the model parameters relate to the size of the vehicle, the size of the inverted pendulum, and so on.
With the development of intelligent algorithms, particularly algorithms such as reinforcement learning, control models have slowly shifted from certain mechanistic models to model-less control. The model-free control has certain disadvantages, such as too many learning times, difficult quantitative analysis of control effect, slow design efficiency of the controller and the like.
Disclosure of Invention
In view of the above technical problems, the present invention is directed to providing a vehicle-two-stage inverted pendulum system optimization control method.
In order to solve the technical problems, the invention adopts the following technical scheme:
a vehicle-secondary inverted pendulum system optimization control method is applied to a vehicle-secondary inverted pendulum system comprising a vehicle, a first inverted pendulum and a second inverted pendulum, and comprises the following steps:
s10, determining 6 states of the trolley-secondary inverted pendulum system, and obtaining the change conditions of the 6 states along with time after applying a certain acting force; setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and taking state variable quantity as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment, namely a Gaussian process model, through a joint probability distribution function;
s20, setting the state, the time step length and the discrete value of a control variable of the trolley-secondary inverted pendulum system; defining an expected form of a cost function, setting a reward punishment mechanism, and giving an updating rule of a Q value of each time step; the value range of the control quantity is continuously reduced in the early stage of the learning process, and the value range of the control quantity is continuously translated in the later stage of the learning process; when the convergence reaches a smaller value of the generated cost function, determining an optimal control interval and a better discrete control sequence;
s30, after obtaining the Gaussian process model and the optimal control interval, carrying out variation on the formula Gaussian process model and the initial condition to obtain delta mu i And Δ η i The iteration form of (1) is combined with the iterative calculation of a Gaussian process model to obtain a total objective function value and a gradient value;
and S40, calling an optimization solver based on the gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and carrying out iterative solution to obtain an optimal control force sequence through calculation of the gradient value and the total objective function value.
Preferably, in S10, the 6 states of the vehicle-secondary inverted pendulum system are determined as follows: displacement x of the car 1 Speed x of the trolley 2 Angular velocity x of first inverted pendulum 3 And angle x 5 Angular velocity x of the second inverted pendulum 4 And angle x 6 Let x = [ x ] 1 x 2 ... x 6 ] T T is matrix transposition, the acting force is set to be u (T), the trolley and the two inverted pendulums are in a static state at the initial time (0 moment), namely x 1 (0)=0,x 2 (0)=0,x 3 (0)=0,x 4 (0)=0,x 5 (0)=π,x 6 (0) = pi.wholeThe control time scale is T =1.2 seconds, each control action is 0.02 seconds, and the whole control time requires 60 control sequences, namely discrete time i =0,1.
Preferably, the gaussian modeling process is as follows: the mapping relationship to be learned by the gaussian process is the relationship between the current state control pair xp = { x (0), u (0),. Times, x (59), u (59) } to the state change amount Δ x = { Δ x (0),. Times, Δ x (59) }, where the symbol Δ is the change amount of the state, Δ x (i) = x (i + 1) -x (i),
control pairs xp for a given 1 state, according to Gaussian process modeling * ={x * ,u * A relationship between it and 60 state control pairs xp = { x (0), u (0),. ·, x (59), u (59) } is as follows:
Figure GDA0003895584350000031
in the formula (I), the compound is shown in the specification,
Figure GDA0003895584350000032
representing a Gaussian probability distribution, I being the identity matrix, σ w Is a hyper-parameter related to noise, and K, K are Gaussian kernel functions. When the vector is operated with the vector, the operation is marked as K; when the number and the number or the number and the vector are operated, the operation is marked as k; f is the dynamic model of the unknown vehicle-two-stage inverted pendulum system, and the kernel function generally defines the square exponential covariance function
Figure GDA0003895584350000033
In the formula, y m ,y n May be a fixed number or a matrix; sigma f A hyper-parameter related to variance; w is a weight matrix, only values exist on a diagonal line, the values are all hyper-parameters, and the specific values of the hyper-parameters are obtained through optimization through the input of specific control state pairs and the output of state variable quantities;
f (x (i)) is obtained by combining probability distributions (1), u (u)i) Mean value of)
Figure GDA0003895584350000034
Sum variance
Figure GDA0003895584350000035
Figure GDA0003895584350000037
Figure GDA0003895584350000036
Therefore, according to the distribution of the state control pairs at the ith time, the predicted value of the state distribution at the (i + 1) th time is obtained through accurate calculation:
Figure GDA0003895584350000041
Figure GDA0003895584350000042
in the formula (I), the compound is shown in the specification,
Figure GDA0003895584350000043
the covariance between the state and the corresponding state transition can be obtained by a matching method, so as to obtain a Gaussian process model, namely the distribution of the state control pair at the ith moment, and calculate the distribution of the state at the (i + 1) th moment.
Preferably, S20 specifically includes:
in the control process of the trolley-secondary inverted pendulum system, optimally designing control force to ensure that the secondary inverted pendulum system meets the condition that the angle between the first inverted pendulum and the second inverted pendulum is 2 pi at the moment T, and firstly adopting reinforcement learning to approach a globally optimal control interval as much as possible and obtain a better discrete control sequence;
in reinforcement learning, the optimization control system of the trolley-two-stage inverted pendulum system meets the Markov decision process (discrete process) of M = < X, U, P, R, gamma >. X, U, P, R, γ are each as defined below:
x is the 6 state vectors X of the vehicle-two-stage inverted pendulum system, i.e.
X=<x(j)>=<μ j ,η j >,j=0,1,…,N, (5)
Where < > denotes a set, j is a time step, N =10,
u is a discrete value (finite number) containing all possible operating action spaces, i.e., applied forces, and a reasonable control discrete value is set as a m Then, then
U={a m },m=1,2,...,M, (6)
Where M is the number of control quantities taken, M =100 each time, P is the state transition probability function, P [ x (j + 1) = x '| x (j) = x, u (j) = a ] = P (x, a, x'),
r is a reward of the state and the action of N at each time step j =0,1.. The objective of the learning control is to make the two-stage inverted pendulum system satisfy the angle of the first inverted pendulum and the second inverted pendulum at time T2 pi, the reward function can be defined by a control cost function, and the control cost function at time j is defined as:
Figure GDA0003895584350000051
in the formula (I), the compound is shown in the specification,
Figure GDA0003895584350000052
z is a weight coefficient of the state and the control respectively, and is given in advance according to the actual situation; x is the number of tg (j) Is the target location;
due to the state quantity x j The control cost function at the jth moment can be obtained by including mean value and variance and calculating expectation at two sides
Figure GDA0003895584350000053
Wherein, L represents an objective function, and when the value far exceeds or is far less than 2 pi, a punishment (a negative value of a reward function) should be given to the optimization control system of the vehicle-secondary inverted pendulum system; when approaching 2 pi, a reward should be given to the optimization control system of the vehicle-secondary inverted pendulum system, and the index C approaching 2 pi and far away from 2 pi j =x(j)-Z j Wherein Z is j Set value for approaching 2 pi trend at each moment
Figure GDA0003895584350000054
In the formula, taking lambda as pi/10;
gamma is the discount coefficient, and the corresponding reward is assumed to be discounted over time, the reward for each time step being based on the reward R of the previous step j And a discount coefficient gamma (gamma is more than or equal to 0 and less than or equal to 1), and taking gamma as 0.85; the jackpot is expressed as
Figure GDA0003895584350000055
Adopting a Q learning algorithm, and observing the current state and taking action on the state of the maximized reward by the optimization control system of the vehicle-secondary inverted pendulum system at each discrete time step; updating the Q value at each time step using rules
Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ J (x(j+1),a j+1 )-Q J (x(j).a j ))], (11)
Wherein Q is j (x(j),a j ) Is the state x (j) and action a at time j j The value of Q of (A) is,
Figure GDA0003895584350000061
is described
Figure GDA0003895584350000062
All combination training is explored under a large number of training by using the learning rate of the possibility in the algorithmIn practice, the Q learning algorithm should generate Q values for all state-action combinations, and in each state, the action with the largest Q value is selected as the best action, and further, a control interval adaptive reinforcement learning method is proposed for continuously narrowing the value range of the control amount in the learning process,
argmin IN E[L(x,a)],
IN=[min{a m },max{a m }], (12)
wherein, E [ L (x, a) ] is the sum of the cost (total cost) at each moment, IN represents the interval of adaptively optimizing and selecting the control quantity, a control interval IN a wider range is set to start the training process, and according to the result of the cost function, the control interval IN the action space generating the minimum cost is selected as the new control interval of the next learning, IN the process, the control interval is reduced while keeping the discrete number M of the control unchanged, the control is more and more refined, namely the action number of each control interval is constant, each interval is gradually reduced and the interval is gradually reduced, namely the control interval of the first learning and the control interval of the (L + 1) learning meet the requirement that the control interval of the first learning and the control interval of the (L + 1) learning meet the requirement
Figure GDA0003895584350000063
Figure GDA0003895584350000075
[min{a m },max{a m }](l) Is equal to IN equation (12) after one training time IN the control interval,
when a group of actions converge to a certain control interval, the optimization control system of the trolley-secondary inverted pendulum system starts to continuously test adjacent values through a mobile control interval, namely the control interval learned at the nth time and the control interval learned at the (n + 1) th time meet
Figure GDA0003895584350000073
Figure GDA0003895584350000074
Is the translation parameter(s) of the image,
the process of optimizing the adaptive selection of the control space with the cost function is iterative and once the control interval converges to a time interval that yields a smaller value of the cost function, the control interval becomes the optimal control interval and a better discrete control sequence is determined.
Preferably, in order to obtain an optimal control sequence, the optimal control of the Gaussian process model of the trolley-secondary inverted pendulum system is provided, and firstly, the variation of the formula (4) is obtained
Figure GDA0003895584350000071
Figure GDA0003895584350000072
Where Δ represents a micro variable, and since the initial condition is a fixed value, the variation of the initial condition is all 0, that is, Δ μ 0 =0,Δη 0 =0, due to Δ u i May be any value, which is set to Δ u for simplicity of form i =1, yield Δ μ i And Δ η i Of (2) iterative form
Figure GDA0003895584350000081
Figure GDA0003895584350000082
Expected value of the overall objective function
Figure GDA0003895584350000083
To the ith stepDesired E [ L (x (i), u) i )]Is subjected to variation to obtain
Figure GDA0003895584350000084
Variation of the overall objective function over the time interval, i.e. the gradient of the objective function becomes
Figure GDA0003895584350000085
Wherein, is i And Δ η i Given by equation (16), μ i And u i Given by equation (4).
The beneficial effects of the invention are:
(1) A data-driven Gaussian process model is provided for the trolley-secondary inverted pendulum system, the method is different from the traditional deterministic models such as mechanism modeling, the running state of the trolley-secondary inverted pendulum system is represented by means of the mean value and the variance, and the method is closer to the actual motion process of the system.
(2) Aiming at a trolley-two-stage inverted pendulum system (an uncertainty system), self-adaptive reinforcement learning and optimization control of a control interval are designed, and a learning and control method is popularized to the field of uncertainty systems.
(3) Considering the problem of Learning efficiency of the conventional reinforcement Learning, such as Q Learning, a control interval adaptive reinforcement Learning is proposed, that is, the range of control decision is continuously narrowed, an optimal control interval is adaptively selected, and a better discrete control sequence is determined.
(4) Aiming at the problem that the optimization problem is easy to fall into local optimum, the optimal control interval is determined through the proposed reinforcement learning, the optimal control curve (value) is determined by adopting optimization control by taking the obtained better discrete control sequence as an initial guess, and the global optimum solution is ensured to be searched to the maximum extent.
(5) Aiming at the problem that the traditional reinforcement learning control quantity is limited, after the reinforcement learning decision is made, an optimal control input is finally obtained by adopting an optimal control algorithm with continuous control intervals.
(6) The experimental effect of the invention is given by combining the experiment, the control interval before Q learning is [ -250,250], and the control interval after Q learning is [ -116,93]. After the optimization control, the optimal objective function value is 9338, and an objective function mean and variance graph and an optimal control curve at each moment under the optimal control are provided.
Drawings
FIG. 1 is a simplified diagram of experimental equipment for a two-stage inverted pendulum system of the present invention.
FIG. 2 is a Gaussian process modeling flow chart of the vehicle-two stage inverted pendulum system of the present invention.
FIG. 3 is a flow chart of the control interval adaptive reinforcement learning of the vehicle-two-stage inverted pendulum system of the present invention.
FIG. 4 is an optimized control flow chart of the vehicle-two-stage inverted pendulum system of the present invention.
Fig. 5 is a flow chart of gaussian process modeling, reinforcement learning of adaptive intervals and optimization control of the trolley-secondary inverted pendulum system of the invention.
FIG. 6 is a graph of the mean and variance of the objective function at various times under initial guessing for the vehicle-two stage inverted pendulum system of the present invention.
FIG. 7 is a graph of the mean and variance of the objective function of the vehicle-two stage inverted pendulum system of the present invention at various times under optimal control.
Fig. 8 is an optimal control curve for the trolley-two stage inverted pendulum system of the present invention.
FIG. 9 is a graph showing the variation of the angle-averaged value of the first inverted pendulum of the vehicle-two-stage inverted pendulum system of the present invention.
FIG. 10 is a graph showing the variation of the angle-averaged value of the second inverted pendulum of the vehicle-two-stage inverted pendulum system of the present invention.
Detailed Description
As shown in fig. 1: the invention discloses a simplified diagram of experimental equipment of a trolley-two-stage inverted pendulum system, which sequentially comprises a trolley 1, a 1 st 2 nd vertical pendulum, a second inverted pendulum 3 and the like. The arrow on the right is the force applied to the cart, i.e., the control input to the system. The curved arrow represents the rotation angle of the inverted pendulum.
The optimization control method of the trolley-secondary inverted pendulum system provided by the embodiment of the invention is applied to the trolley-secondary inverted pendulum system shown in figure 1, and specifically comprises the following steps:
s10, determining 6 states of the trolley-secondary inverted pendulum system, and obtaining the change conditions of the 6 states along with time after applying a certain acting force; setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and taking state variable quantity as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment, namely a Gaussian process model, through a joint probability distribution function;
s20, setting the state, the time step length and the discrete value of a control variable of the trolley-secondary inverted pendulum system; defining an expected form of a cost function, setting a reward punishment mechanism, and giving an updating rule of a Q value of each time step; the value range of the control quantity is continuously reduced in the early stage of the learning process, and the value range of the control quantity is continuously translated in the later stage of the learning process; when the convergence reaches a smaller value of the generated cost function, determining an optimal control interval and a better discrete control sequence;
s30, after obtaining the Gaussian process model (4) and the optimal control interval, carrying out variation on the formula Gaussian process model and the initial condition to obtain delta mu i And Δ η i In an iterative fashion. And (4) combining iterative calculation of a Gaussian process model to obtain a total objective function value (17) and a gradient value (19).
And S40, calling a gradient-based optimization solver, such as SQP in matlab, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and iteratively solving to obtain an optimal control force sequence through calculation of a gradient value (19) and a total objective function value (17).
As shown in FIG. 2, the Gaussian process modeling step of the vehicle-two-stage inverted pendulum system of the invention is as follows: the two-stage inverted pendulum system of the trolley comprises 6 states which are respectively the displacement x of the trolley 1 Speed x of the trolley 2 Angular velocity x of first inverted pendulum 3 And angle x 5 Angular velocity x of the second inverted pendulum 4 And angle x 6 . Let x = [ x ] 1 x 2 ... x 6 ] · And is a matrix transpose, and the force is set to u (t). The trolley and the two inverted pendulums are in a static state at the initial time (time 0), namely x 1 (0)=0,x 2 (0)=0,x 3 (0)=0,x 4 (0)=0,x 5 (0)=π,x 6 (0) The overall control timescale was T =1.2 seconds, with each control action being 0.02 seconds. Thus the total control time requires 60 control sequences, i.e. discrete time instants i =0,1, 60, the states and controls corresponding to the time instants are x (i) and u (i).
Process of gaussian process modeling: the dynamic process of the system, which is actually the current time state and current control, is the process of the state/state change at the next time. Therefore, the mapping relationship to be learned by the gaussian process is the relationship between the current state control pair xp = { x (0), u (0),. Times, x (59), u (59) } to the state change amount Δ x = { Δ x (0),. Times, Δ x (59) }, where the symbol Δ is the change amount of the state, and Δ x (i) = x (i + 1) -x (i).
Control pairs xp for a given 1 state, according to Gaussian process modeling * ={x * ,u * A relationship between it and 60 state control pairs xp = { x (0), u (0),.., x (59), u (59) } is as follows
Figure GDA0003895584350000121
In the formula (I), the compound is shown in the specification,
Figure GDA0003895584350000122
representing a Gaussian probability distribution, I being the identity matrix, σ w Is a hyper-parameter related to noise, and K, K are Gaussian kernel functions. When the vector and vector operation is, it is denoted as K. When the number and number or the number and vector are operated on, it is denoted as k. f is the unknown dynamic model of the vehicle-two stage inverted pendulum system. Kernel function generally defines a squared exponential covariance function
Figure GDA0003895584350000123
In the formula, y m ,y n May be a fixed number or a matrix; sigma f A hyper-parameter related to variance; w is the weight matrix, and there are values only on the diagonal, and these values are all hyperparameters. And optimizing to obtain a specific numerical value of the hyper-parameter through the input of the specific control state pair and the output of the state variable quantity.
Obtaining the mean of f (x (i), u (i)) by combining probability distributions (1)
Figure GDA0003895584350000131
Sum variance
Figure GDA0003895584350000132
Figure GDA0003895584350000133
Figure GDA0003895584350000134
Therefore, the predicted value of the state distribution at the (i + 1) th time is obtained by accurate calculation according to the distribution of the state control pairs at the (i) th time
Figure GDA0003895584350000135
Figure GDA0003895584350000136
In the formula (I), the compound is shown in the specification,
Figure GDA0003895584350000137
is the covariance between the state and the corresponding state transition, which can be obtained by a matching method. Therefore, a Gaussian process model, namely the distribution of the state control pairs at the ith moment is obtained, and the distribution of the state at the (i + 1) th moment is calculated.
The self-adaptive reinforcement learning step of the control interval of the trolley-two-stage inverted pendulum system comprises the following steps: in the control process of the trolley-secondary inverted pendulum system, the control force is optimally designed, so that the secondary inverted pendulum system meets the condition that the angle of the first inverted pendulum and the second inverted pendulum is 2 pi at the time T, namely the 1 st standing pendulum and the 2 nd standing pendulum are vertical. Since the traditional optimization is easy to fall into local optimization, reinforcement learning is firstly adopted to get as close as possible to the globally optimal control interval and obtain a better discrete control sequence.
In reinforcement learning, the optimization control system of the trolley-two-stage inverted pendulum system meets the Markov decision process (discrete process) of M = < X, U, P, R, gamma >. X, U, P, R, γ are each as defined below:
x is the 6 state vectors X of the vehicle-two-stage inverted pendulum system, i.e.
X=<x(j)>=<μ j ,η j >,j=0,1,…,N, (5)
Where < > denotes a set, j is a time step, and considering that setting of the step may seriously affect the learning calculation efficiency, it is different from a time discrete number of 60 parts in the gaussian process, where N =10, i.e., 10 parts in time discrete.
U is a discrete value (finite number) containing all possible operating action spaces, i.e., applied forces, and a reasonable control discrete value is set as a m Then, it is
U={a m },m=1,2,...,M, (6)
In the formula, M is the number of control variables, and M =100 is taken each time. P is a state transition probability function, P [ x (j + 1) = x '| x (j) = x, u (j) = a ] = P (x, a, x').
R is a reward of the state and action of N at each time step j =0,1. The reward function of this patent may be defined by a control cost function. The control cost function at the jth time is defined as:
Figure GDA0003895584350000141
in the formula (I), the compound is shown in the specification,
Figure GDA0003895584350000142
z is a weight coefficient of the state and the control respectively, and is given in advance according to the actual situation; x is the number of tg (j) Is the target location.
Due to the state quantity x j The control cost function at the jth moment can be obtained by including mean value and variance and calculating expectation at two sides
Figure GDA0003895584350000143
Wherein L represents the objective function. When the value is far beyond or far below 2 pi, a penalty (negative value of a reward function) should be given to the optimization control system of the vehicle-secondary inverted pendulum system. When the distance is close to 2 pi, a reward is given to the optimization control system of the vehicle-secondary inverted pendulum system. In view of this, the index C is close to 2 π and far from 2 π j =x(j)-Z j Wherein Z is j Set value for approaching 2 pi trend at each moment
Figure GDA0003895584350000151
In the formula, lambda is pi/10.
γ is the discount coefficient. It is assumed that over time the corresponding rewards are discounted. Thus, the reward for each time step is based on the reward R of the previous step j And a discount coefficient gamma (gamma is more than or equal to 0 and less than or equal to 1), wherein gamma is taken as 0.85. The jackpot is expressed as
Figure GDA0003895584350000152
A Q learning algorithm is employed. At each discrete time step, the vehicle-secondary inverted pendulum system optimization control system observes the current state and takes action on the state that maximizes the reward. Updating the Q value at each time step using rules
Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ J (x(j+1),a j+1 )-Q J (x(j).a j ))], (11)
Wherein Q is j (x(j),a j ) Is the state x (j) and action a at time j j The Q value of (1).
Figure GDA0003895584350000153
Is described
Figure GDA0003895584350000154
The learning rate of the likelihood is utilized in the algorithm. Under a large number of training heuristics for all combinations of training, the Q learning algorithm should generate Q values for all state-action combinations. In each state, the action having the largest Q value is selected as the optimum action.
However, the learning of Q is very slow due to the setting of X, U, P, R, and γ, mainly because the discrete range of the applied force is very wide, and the discrete force intervals are very many, which results in the Q list having very high dimension and easily causing dimension disaster. However, if the discrete points of force are set to be small, the learning effect is poor, and a better control strategy is difficult to obtain. The range (control interval) and interval of the control quantity severely limit the efficiency of Q learning. Therefore, a control interval adaptive reinforcement learning method is provided for continuously reducing the value range of the control quantity in the learning process.
argmin IN E[L(x,a)],
IN=[min{a m },max{a m }], (12)
Where E [ L (x, a) ] is the sum of the costs at each time (total cost), and IN represents the interval for adaptively optimally selecting the control amount. To approximate the global optimal solution as closely as possible, a larger range of control intervals is set to start the training process. According to the result of the cost function, the control interval in the motion space that yields the minimum cost is selected as a new control interval for the next learning. In this process, the control section is in the process of narrowing down while keeping the discrete number M of controls constant, and the control becomes finer, that is, the number of actions per control section is constant, each section is gradually reduced, and the interval is gradually reduced. That is, the control interval of the l-th learning and the control interval of the l + 1-th learning satisfy
Figure GDA0003895584350000161
[min{a m },max{a m }](l) It is equal to IN equation (12) after i times of control interval training, which can be seen as a control interval adaptively "shrinking" step.
When a group of actions converge to a certain control interval, the optimization control system of the trolley-secondary inverted pendulum system starts to continuously test adjacent values through a mobile control interval, namely the control interval of the nth learning and the control interval of the (n + 1) th learning meet
Figure GDA0003895584350000162
Figure GDA0003895584350000175
Figure GDA0003895584350000171
Is a translation parameter. This can be seen as an adaptive "panning" step of the control interval.
The process of optimizing the adaptive selection of the control space with a cost function is iterative. And once the control interval converges to a time interval that yields a smaller value of the cost function, the control interval becomes the optimal control interval and a better discrete control sequence is determined.
The optimization control steps of the trolley-two-stage inverted pendulum system of the invention are as follows: aiming at a Gaussian process model, only a limited control decision set can be obtained by utilizing reinforcement learning of an adaptive control interval, the control decision set is related to the discrete degree of specific control quantity, and an optimal control sequence in a continuous control set cannot be obtained. In order to obtain an optimal control sequence, the optimization control of a Gaussian process model of the trolley-secondary inverted pendulum system is provided. First, the formula (4) is subjected to variation to obtain
Figure GDA0003895584350000172
Figure GDA0003895584350000173
Where Δ represents a micro-variable. Since the initial conditions are fixed values, the variation of the initial conditions is all 0, i.e., Δ μ 0 =0,Δη 0 =0. Due to delta u i May be any value, and for simplicity of form, Δ u may be expressed i Is set to delta u i =1, yield Δ μ i And Δ η i Iterative form of (2)
Figure GDA0003895584350000174
Figure GDA0003895584350000181
Expected value of the overall objective function
Figure GDA0003895584350000182
For the expectation of step i [ L (x (i), u) i )]Is subjected to variation to obtain
Figure GDA0003895584350000183
Variation of the overall objective function over the time interval, i.e. the gradient of the objective function becomes
Figure GDA0003895584350000184
Wherein, Δ μ i And Δ η i Given by equation (16), μ i And u i Given by equation (4). The optimization control method of the trolley-two-stage inverted pendulum system is implemented and verified, and the parameter data of the trolley is as follows: the weight of the trolley is 0.5kg, the weight of the 1 st and second inverted pendulums is 0.5kg, the length is 0.1m, and the friction coefficient of the trolley and the ground is 0.1. State weight coefficient
Figure GDA0003895584350000185
Is [ 01 11 15 ] or] · . The weight coefficient Z is controlled to be 0.01. The control interval before Q learning is [ -250,250 [ -250 [ ]]And the control interval after Q learning is [ -116,93 [ -116 [ ]]. After the optimization control, the optimal objective function value is 9338. FIG. 6 is a graph of the mean and variance of the objective function at various times under initial guessing for the Cart-secondary inverted pendulum system of the present invention. FIG. 7 is a graph of the mean and variance of the objective function at various times under optimal control by the system. Fig. 8 is an optimal control curve of the system. Fig. 9 is a graph showing the change of the angle average value of the first inverted pendulum of the system. FIG. 10 is a graph showing the variation of the average angle value of the second inverted pendulum of the system.

Claims (4)

1. A vehicle-secondary inverted pendulum system optimization control method is applied to a vehicle-secondary inverted pendulum system comprising a vehicle, a first inverted pendulum and a second inverted pendulum, and is characterized by comprising the following steps:
s10, determining 6 states of the trolley-secondary inverted pendulum system, and obtaining the change conditions of the 6 states along with time after applying a certain acting force; setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and taking state variable quantity as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment by combining probability distribution functions, namely a Gaussian process model;
s20, setting the state, the time step length and the discrete value of a control variable of the trolley-secondary inverted pendulum system; defining an expected form of a cost function, setting a reward punishment mechanism, and giving an updating rule of a Q value of each time step; the value range of the control quantity is continuously reduced in the early stage of the learning process, and the value range of the control quantity is continuously translated in the later stage of the learning process; when the convergence reaches a smaller value of the generated cost function, determining an optimal control interval and a better discrete control sequence;
s30, after obtaining the Gaussian process model and the optimal control interval, carrying out variation on the formula Gaussian process model and the initial condition to obtain delta mu i And Δ η i The iteration form of (1) is combined with the iterative calculation of a Gaussian process model to obtain a total objective function value and a gradient value;
s40, calling an optimization solver based on gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and obtaining an optimal control force sequence by iterative solution through calculation of gradient values and total objective function values;
in the control process of the trolley-secondary inverted pendulum system, optimally designing control force to ensure that the secondary inverted pendulum system meets the requirement that the angle of a first inverted pendulum and a second inverted pendulum is 2 pi at the moment T, and firstly adopting reinforcement learning to approach a globally optimal control interval as much as possible and obtain a better discrete control sequence;
in reinforcement learning, the optimization control system of the trolley-two-stage inverted pendulum system meets the decision process that M = < X, U, P, R, gamma > Markov, and the definitions of X, U, P, R, gamma are as follows:
x is the 6 state vectors X of the vehicle-two-stage inverted pendulum system, i.e.
X=<x(j)>=<μ jj >,j=0,1,...,N, (5)
Where < > denotes the set, j is the time step, N =10,
u is a discrete value containing all possible operating action spaces, i.e. applied force, and a reasonable control discrete value is set as a m Then, then
U={a m },m=1,2,...,M, (6)
Where M is the number of control variables, M =100, P is the state transition probability function, P [ x (j + 1) = x '| x (j) = x, u (j) = a ] = P (x, a, x'),
r is a reward of the state and the action of N at each time step j =0,1.. The objective of the learning control is to make the two-stage inverted pendulum system satisfy the angle of the first inverted pendulum and the second inverted pendulum at time T2 pi, the reward function can be defined by a control cost function, and the control cost function at time j is defined as:
Figure FDA0003887320220000021
in the formula (I), the compound is shown in the specification,
Figure FDA0003887320220000022
and R' are weight coefficients of state and control respectively, and are given in advance according to actual conditions; x is the number of tg (j) Is the target location;
due to the state quantity x j The control cost function at the jth moment can be obtained by including mean value and variance and calculating expectation at two sides
Figure FDA0003887320220000031
Wherein, L represents an objective function, and when the L far exceeds or is far less than 2 pi, a punishment is given to a vehicle-secondary inverted pendulum system optimization control system; when approaching 2 pi, a reward should be given to the optimization control system of the vehicle-secondary inverted pendulum system, and the index C approaching 2 pi and far away from 2 pi j =x(j)-Z j Wherein Z is j Set value for approaching 2 pi trend at each moment
Figure FDA0003887320220000032
In the formula, taking lambda as pi/10;
gamma is the discount coefficient, assuming that over timeThe corresponding reward is discounted, the reward for each time step being based on the reward R of the previous step j And the discount coefficient gamma is more than or equal to gamma and less than or equal to 1, and the gamma is 0.85; the jackpot is expressed as
Figure FDA0003887320220000033
Adopting a Q learning algorithm, and observing the current state and taking action on the state of the maximized reward by the optimization control system of the vehicle-secondary inverted pendulum system at each discrete time step; updating the Q value at each time step using rules
Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ j (x(j+1),a j+1 )-Q j (x(j).a j ))],
(11)
Wherein Q is j (x(j),a j ) Is the state x (j) and action a at time j j The method comprises the following steps that (1) the Q value belongs to a learning rate for describing the utilization possibility in an E-green algorithm, all combination training is searched under a large amount of training, the E belongs to a learning rate for describing the utilization possibility in the E-green algorithm, all combination training is searched under the large amount of training, the Q learning algorithm generates Q values for all state-action combinations, and the action with the maximum Q value is selected as the best action in each state;
s20 specifically comprises the following steps:
the Q learning algorithm should generate Q values for all the state-action combinations, and in each state, the action with the maximum Q value is selected as the best action, and further provides a control interval self-adaptive reinforcement learning method for continuously reducing the value range of the control quantity in the learning process,
argmin IN E[L(x,a)],
IN=[min{a m },max{a m }], (12)
wherein E [ L (x, a) ] is the sum of the cost at each moment, IN represents the interval for adaptively optimizing and selecting the control quantity, a larger range of control intervals is set to start the training process, and the control interval IN the action space generating the minimum cost is selected as a new control interval for next learning according to the result of the cost function, IN the process, the control interval is reduced while keeping the discrete number M of control unchanged, the control is more and more refined, namely the action number of each control interval is constant, each interval is gradually reduced and gradually reduced, namely the control interval of the first learning and the control interval of the (L + 1) th learning meet the requirement that the control interval of the first learning and the control interval of the (L + 1) th learning meet the requirement of the training process
Figure FDA0003887320220000041
[min{a m },max{a m }](l) Is equal to IN equation (12) after one training of the control interval,
when a group of actions converge to a certain control interval, the optimization control method of the trolley-two-stage inverted pendulum system starts to continuously test adjacent values through a mobile control interval, namely the control interval learned for the nth time and the control interval learned for the (n + 1) th time meet
Figure FDA0003887320220000051
Figure FDA0003887320220000052
Is the translation parameter(s) of the image,
the process of optimizing the adaptive selection of the control space with the cost function is iterative and once the control interval converges to a time interval that yields a smaller value of the cost function, the control interval becomes the optimal control interval and a better discrete control sequence is determined.
2. The optimization control method of the vehicle-secondary inverted pendulum system according to claim 1, wherein in S10, it is determined that the 6 states of the vehicle-secondary inverted pendulum system are respectively: displacement x of the car 1 Speed x of the trolley 2 Angular velocity x of first inverted pendulum 3 Angle of harmonyDegree x 5 Angular velocity x of the second inverted pendulum 4 And angle x 6 Let us order
Figure FDA0003887320220000053
Figure FDA0003887320220000054
Is matrix transposing, and the acting force is set as u (t), the trolley and the two inverted pendulums are in static state at the initial moment, namely x 1 (0)=0,x 2 (0)=0,x 3 (0)=0,x 4 (0)=0,x 5 (0)=π,x 6 (0) And = pi, the whole control time scale is T =1.2 seconds, each control action is 0.02 seconds, the whole control time needs 60 control sequences, namely discrete time i =0,1,. Once.., 60, and the states and controls corresponding to the time are x (i) and u (i).
3. The vehicle-two-stage inverted pendulum system optimization control method of claim 2, wherein the gaussian modeling process is as follows:
the mapping relationship to be learned by the gaussian process is the relationship between the current state control pair xp = { x (0), u (0),.., x (59), u (59) } to the state change amount Δ x = { Δ x (0),.., Δ x (59) }, where the symbol Δ is the change amount of the state, Δ x (i) = x (i + 1) -x (i),
control pairs xp for a given 1 state, according to Gaussian process modeling * ={x * ,u * A relationship between it and 60 state control pairs xp = { x (0), u (0),. ·, x (59), u (59) } is as follows:
Figure FDA0003887320220000061
in the formula, GP represents a Gaussian probability distribution, I is an identity matrix, sigma w Is a hyper-parameter related to noise, K, K is a Gaussian kernel function, and is marked as K when the vector is operated with the vector; when the number and the number or the number and the vector are operated, the operation is marked as k; f is the unknown dynamic model of the vehicle-two-stage inverted pendulum system, and the kernel function is generally fixedMean square exponential covariance function
Figure FDA0003887320220000062
In the formula, y m ,y n Is a fixed value or matrix; sigma f A hyper-parameter related to variance; w is a weight matrix, only values exist on a diagonal line, the values are all hyper-parameters, and the specific values of the hyper-parameters are obtained through optimization through the input of specific control state pairs and the output of state variable quantities;
obtaining the mean of f (x (i), u (i)) by combining probability distributions (1)
Figure FDA0003887320220000063
Sum variance
Figure FDA0003887320220000064
Figure FDA0003887320220000065
Figure FDA0003887320220000066
Therefore, according to the distribution of the state control pair at the ith time, a predicted value of the state distribution at the (i + 1) th time is obtained through accurate calculation:
Figure FDA0003887320220000067
Figure FDA0003887320220000068
in the formula (I), the compound is shown in the specification,
Figure FDA0003887320220000071
the covariance between the states and the corresponding state transitions is obtained by a matching method, thereby obtaining a gaussian process model, i.e. the distribution of the state control pairs at the ith moment, and calculating the distribution of the state at the (i + 1) th moment.
4. The optimization control method of the vehicle-secondary inverted pendulum system as claimed in claim 3, wherein in order to obtain the optimal control sequence, the optimization control of the Gaussian process model of the vehicle-secondary inverted pendulum system is proposed, and first, the formula (4) is subjected to variation to obtain
Figure FDA0003887320220000072
Figure FDA0003887320220000073
Where Δ represents a micro variable, and since the initial condition is a fixed value, the variables of the initial condition are all 0, that is, Δ μ 0 =0,Δη 0 =0, due to Δ u i Is an arbitrary number, which is set to Δ u for simplicity of form i =1, yield Δ μ i And Δ η i Of (2) iterative form
Figure FDA0003887320220000074
Figure FDA0003887320220000075
Expected value of the overall objective function
Figure FDA0003887320220000081
Expectation for step i [ L (x (i), u) i )]Is subjected to variation to obtain
Figure FDA0003887320220000082
Variation of the overall objective function over the time interval, i.e. the gradient of the objective function becomes
Figure FDA0003887320220000083
Wherein, Δ μ i And Δ η i Given by equation (16), μ i And η i Given by equation (4).
CN201911043225.4A 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system Active CN110908280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911043225.4A CN110908280B (en) 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911043225.4A CN110908280B (en) 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system

Publications (2)

Publication Number Publication Date
CN110908280A CN110908280A (en) 2020-03-24
CN110908280B true CN110908280B (en) 2023-01-03

Family

ID=69814671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911043225.4A Active CN110908280B (en) 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system

Country Status (1)

Country Link
CN (1) CN110908280B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111580392B (en) * 2020-07-14 2021-06-15 江南大学 Finite frequency range robust iterative learning control method of series inverted pendulum

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN110134011A (en) * 2019-04-23 2019-08-16 浙江工业大学 A kind of inverted pendulum adaptive iteration study back stepping control method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011068222A (en) * 2009-09-24 2011-04-07 Honda Motor Co Ltd Control device of inverted pendulum type vehicle

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN110134011A (en) * 2019-04-23 2019-08-16 浙江工业大学 A kind of inverted pendulum adaptive iteration study back stepping control method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
直线二级倒立摆系统的H_∞鲁棒最优控制;王春平等;《机电工程》;20170520(第05期);全文 *

Also Published As

Publication number Publication date
CN110908280A (en) 2020-03-24

Similar Documents

Publication Publication Date Title
Li et al. A policy search method for temporal logic specified reinforcement learning tasks
Chiou et al. A PSO-based adaptive fuzzy PID-controllers
Juang Combination of online clustering and Q-value based GA for reinforcement fuzzy system design
Badgwell et al. Reinforcement learning–overview of recent progress and implications for process control
CN111241952A (en) Reinforced learning reward self-learning method in discrete manufacturing scene
CN108520155B (en) Vehicle behavior simulation method based on neural network
CN101566829A (en) Method for computer-aided open loop and/or closed loop control of a technical system
CN110442129A (en) A kind of control method and system that multiple agent is formed into columns
CN114384931B (en) Multi-target optimal control method and equipment for unmanned aerial vehicle based on strategy gradient
CN116892866B (en) Rocket sublevel recovery track planning method, rocket sublevel recovery track planning equipment and storage medium
CN110908280B (en) Optimization control method for trolley-two-stage inverted pendulum system
Pazis et al. Binary action search for learning continuous-action control policies
CN116587275A (en) Mechanical arm intelligent impedance control method and system based on deep reinforcement learning
CN111930010A (en) LSTM network-based general MFA controller design method
CN115765050A (en) Power system safety correction control method, system, equipment and storage medium
CN114330119A (en) Deep learning-based pumped storage unit adjusting system identification method
Ahmad et al. Enhancing time series forecasting with an optimized binary gravitational search algorithm for echo state networks
CN111240201B (en) Disturbance suppression control method
Yang et al. Continuous control for searching and planning with a learned model
CN117818643A (en) Speed and acceleration prediction-based man-vehicle collaborative driving method
CN117787384A (en) Reinforced learning model training method for unmanned aerial vehicle air combat decision
Jung et al. Learning robocup-keepaway with kernels
CN113485107B (en) Reinforced learning robot control method and system based on consistency constraint modeling
CN110450164A (en) Robot control method, device, robot and storage medium
Li et al. Morphing Strategy Design for UAV based on Prioritized Sweeping Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant