CN108803321B - Autonomous underwater vehicle track tracking control method based on deep reinforcement learning - Google Patents
Autonomous underwater vehicle track tracking control method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN108803321B CN108803321B CN201810535773.8A CN201810535773A CN108803321B CN 108803321 B CN108803321 B CN 108803321B CN 201810535773 A CN201810535773 A CN 201810535773A CN 108803321 B CN108803321 B CN 108803321B
- Authority
- CN
- China
- Prior art keywords
- auv
- network
- strategy
- evaluation
- tracking control
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 96
- 230000002787 reinforcement Effects 0.000 title claims abstract description 34
- 238000011156 evaluation Methods 0.000 claims abstract description 113
- 230000008569 process Effects 0.000 claims abstract description 18
- 239000013598 vector Substances 0.000 claims description 69
- 230000009471 action Effects 0.000 claims description 47
- 230000006870 function Effects 0.000 claims description 45
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 4
- 238000012804 iterative process Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 239000003795 chemical substances by application Substances 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- RZVHIXYEVGDQDX-UHFFFAOYSA-N 9,10-anthraquinone Chemical compound C1=CC=C2C(=O)C3=CC=CC=C3C(=O)C2=C1 RZVHIXYEVGDQDX-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention provides an autonomous underwater vehicle track tracking control method based on deep reinforcement learning, and belongs to the field of deep reinforcement learning and intelligent control. Firstly, defining an AUV trajectory tracking control problem; then establishing a Markov decision process model of the AUV trajectory tracking problem; then constructing a mixed strategy-evaluation network, wherein the network consists of a plurality of strategy networks and evaluation networks; and finally, solving an AUV trajectory tracking control target strategy by the constructed hybrid strategy-evaluation network, evaluating the performance of each evaluation network by defining expected Bellman absolute errors for a plurality of evaluation networks, only updating the evaluation network with the worst performance at each time step, randomly selecting a strategy network at each time step for the plurality of strategy networks, updating by adopting a deterministic strategy gradient, and finally learning the strategy which is the mean value of all the strategy networks. The method is not easily influenced by the historical tracking track of the bad AUV, and has high precision.
Description
Technical Field
The invention belongs to the field of deep reinforcement learning and intelligent control, and relates to an Autonomous Underwater Vehicle (AUV) trajectory tracking control method based on deep reinforcement learning.
Background
The development of deep sea seabed science highly depends on deep sea detection technology and equipment, and due to the complex deep sea environment and extreme conditions, deep sea operation type autonomous underwater vehicles are mainly adopted to replace or assist people in detecting, observing and sampling deep sea at present. Aiming at task scenes that humans cannot reach field operation, such as ocean resource exploration, submarine investigation, ocean surveying and mapping, the autonomous and controllable AUV underwater motion is guaranteed to be the most basic and important functional requirement, and the premise is that various complex operation tasks are realized. However, many offshore applications of AUVs (e.g., trajectory tracking control, target tracking control, etc.) are extremely challenging, which is mainly caused by the characteristics of the AUV system in three ways. Firstly, an AUV (autonomous Underwater vehicle) is used as a multi-input multi-output system, and a dynamics and kinematics model (hereinafter referred to as a model) of the AUV is complex and has the characteristics of high nonlinearity, strong coupling, existence of input or state constraint, time variation and the like; secondly, uncertainty exists in model parameters or hydrodynamic environment, so that the AUV system is difficult to model; thirdly, most of the AUVs currently belong to under-actuated systems, i.e. the degrees of freedom are greater than the number of independent actuators (each independent actuator corresponds to one degree of freedom). Generally, the model and parameters of the AUV are determined by a method combining mathematical and physical mechanism derivation, numerical simulation and physical experiment, and the uncertain parts in the model are reasonably depicted. The control problem of the AUV is also very complicated due to the complex model. Moreover, with the continuous expansion of the application scenarios of the AUV, people put higher requirements on the accuracy and stability of the motion control of the AUV, and how to improve the control effect of the AUV in various motion scenarios has become an important research direction.
In the past decades, researchers have designed various AUV motion control methods and verified their effectiveness for different application scenarios such as trajectory tracking, waypoint tracking, path planning, and formation control. The representative of the method is a model-based output feedback control method proposed by Refsnes et al, which adopts two decoupled system models: the three-degree-of-freedom ocean current induced ship body model is used for depicting ocean current load, and the five-degree-of-freedom model is used for describing system dynamics. In addition, Healey et al have designed a state feedback-based tracking control method that employs a fixed forward motion velocity and linearizes the system model, and that employs three decoupled models: a surge model, a horizontal steering model (surge and yaw), and a vertical model (heave and pitch). However, these methods all perform decoupling or linearization processing on the system model, so it is difficult to meet the requirement of high-precision control of the AUV in a specific application scenario.
Due to the limitations of the classical motion control methods and the strong self-learning ability of reinforcement learning, in recent years, researchers have shown great research interest in intelligent control methods represented by reinforcement learning. Various intelligent control methods based on reinforcement learning technologies (such as Q learning, direct strategy search, strategy-evaluation network and adaptive reinforcement learning) are also continuously proposed and successfully applied to different complex application scenarios, such as robot motion control, unmanned plane flight control, hypersonic aircraft tracking control, and road signal lamp control. The core idea of the control method based on reinforcement learning is to realize the performance optimization of the control system on the premise of no prior knowledge. For AUV systems, many researchers have designed various reinforcement learning based control methods and actually verified their feasibility. For the problem of autonomous underwater cable tracking control, EI-Fakdi et al adopt a direct strategy search technique to learn a state/action mapping relation, but the method is only suitable for the case that both state and action spaces are discrete; for a continuous motion space, Paula et al adopts a radial basis network to approximate a strategy function, however, the control method cannot guarantee high tracking control accuracy due to the weak function approximation capability of the radial basis network.
In recent years, with the development of Deep Neural Network (DNN) training technologies such as batch learning, empirical review, and batch regularization, deep reinforcement learning shows excellent performance in complex tasks such as robot motion control, autonomous ground vehicle motion control, quad-rotor control, and autopilot. In particular, the recently proposed Deep Q Networks (DQN) exhibit human-level control accuracy in a number of very challenging tasks. However, DQN does not address the problem of having both a high dimensional state space and a continuous motion space. On the basis of DQN, a depth deterministic strategy gradient (DDPG) algorithm is further proposed and realizes continuous control. However, the DDPG estimates the target value of the evaluation network using the target evaluation network, so that the evaluation network cannot effectively evaluate the strategy learned by the strategy network, and the learned action value function has a large variance, and thus when the DDPG is applied to the AUV trajectory tracking control problem, the requirements of high tracking control accuracy and stable learning cannot be satisfied.
Disclosure of Invention
The invention aims to provide an AUV (autonomous Underwater vehicle) trajectory tracking control method based on deep reinforcement learning, which adopts a mixed strategy-evaluation network structure and adopts a plurality of quasi-Q learning and deterministic strategy gradients to respectively train an evaluation network and a strategy network, overcomes the problems that the conventional reinforcement learning-based method is low in control precision, cannot realize continuous control, is unstable in learning process and the like, and realizes high-precision AUV trajectory tracking control and stable learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
an autonomous underwater vehicle track tracking control method based on deep reinforcement learning comprises the following steps:
1) defining AUV (autonomous Underwater vehicle) track tracking control problem
Defining the AUV trajectory tracking control problem includes four parts: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:
1-1) determining AUV System inputs
Let AUV system input vector be tauk=[ξk,k]TWherein ξk、kPropeller thrust and rudder angle of the AUV, respectively, with the subscript k denoting the kth time step ξk、kRespectively have a value range ofAndmaximum propeller thrust and maximum rudder angle respectively;
1-2) determining AUV System output
Let AUV system output vector be ηk=[xk,yk,ψk]TWherein x isk、ykThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectivelykThe included angle between the advancing direction of the AUV at the kth time step and the X axis;
1-3) defining a tracking control error
According to the line of AUVDriving path selection reference trackDefining the AUV trajectory tracking control error of the kth time step as:
1-4) establishing AUV track tracking control target
For the reference track d in step 1-3)kAn objective function of the form:
wherein γ is a discount factor and H is a weight matrix;
establishing AUV track tracking control target to find an optimal system input sequence tau*So that the target function P at the initial moment0(τ) is minimal, the calculation formula is as follows:
2) markov decision process model for establishing AUV trajectory tracking problem
Performing Markov decision process modeling on the AUV trajectory tracking problem in the step 1), and specifically performing the following steps:
2-1) defining a state vector
Defining the velocity vector of the AUV system as phik=[uk,vk,χk]TWherein u isk、vkLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUVkThe angular velocity of the k time step AUV around the advancing direction;
AUV system output vector η determined according to step 1-2)kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:
2-2) defining motion vectors
Defining the motion vector of the k time step as the AUV system input vector of the time step, namely ak=τk;
2-3) defining a reward function
The reward function at the kth time step being used to characterize the state skTaking action akAccording to the trajectory tracking control error e defined in the step 1-3)kAnd the motion vector a defined in step 2-2)kThe AUV reward function defining the kth time step is as follows:
2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)*AUV trajectory tracking control target converted into reinforcement learning framework
Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:
wherein,expected values representing reward functions, states and actions; k is the maximum time step;
the action value function is used for describing expected accumulated discount rewards when a strategy pi is adopted in all the current and the following states, so that an AUV (autonomous Underwater vehicle) track tracking control target learns an optimal target strategy pi through interaction with the environment where the AUV is located in an enhanced learning framework*The calculation formula is as follows, so that the action value at the initial time is maximum:
wherein, p(s)0) Is in an initial state s0The distribution of (a); a is0Is an initial motion vector;
tracking and controlling the target tau of the AUV track established in the step 1-4)*Is converted into pi*Solving;
2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework
Solving the action value function in step 2-4) by iterating the Bellman equation as follows:
assuming that the policy pi is deterministic, that is, the state vector space of the AUV is in one-to-one mapping relation to the motion vector space of the AUV, and is recorded as μ, the iterative bellman equation is simplified as follows:
for the deterministic strategy mu, the optimal target strategy pi in the step 2-4) is used*Reduction to deterministic optimal target strategy mu*:
3) Constructing hybrid policy-evaluation networks
Separately estimating deterministic optimal target strategies mu by constructing hybrid strategy-evaluation networks*And corresponding optimal action value functionThe construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:
3-1) constructing a policy network
Hybrid policy-evaluation network architectureBy constructing n policy networksTo estimate a deterministic optimal target strategy mu*(ii) a Wherein, thetapThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network respectively, and comprises an input layer, two hidden layers and an output layer; the input of each policy network is a state vector skThe output of each policy network is an action vector ak;
3-2) construction of evaluation network
The mixed strategy-evaluation network structure is realized by constructing m evaluation networksTo estimate the optimal action value functionWherein, wqThe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network, and comprises an input layer, two hidden layers and an output layer; the input to each evaluation network is a state vector skAnd motion vector akWherein the state vector skThe motion vector a is input from the input layer to each evaluation networkkFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector skLower taken action vector akAn action value of (d);
3-3) determining a target policy
According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time stepf(sk) Defined as the average of n strategic network outputs, the calculation formula is as follows:
4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking controlf(sk) The method comprises the following specific steps:
4-1) parameter setting
Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation networkωLearning rate α of each policy networkθA weight matrix H in the discount factor gamma and the reward function;
4-2) initializing hybrid policies-evaluating networks
Randomly initializing n policy networksAnd m evaluation networksWeight parameter θ ofpAnd wq(ii) a Randomly selecting the d-th policy network from the n policy networks asd=1,…,n;
Establishing an experience queue set R, setting the maximum capacity of the experience queue set R as B, and initializing the experience queue set R as empty;
4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;
4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV0Let the state variable s of the current time stepk=s0(ii) a And generates an exploratory Noisek;
4-5) network based on n current policiesAnd exploration NoisekDetermining a motion vector a for a current time stepkComprises the following steps:
4-6) AUV in Current State skLower execution action akObtaining the reward function r according to the step 2-3)k+1And observe a new state sk+1(ii) a Note ek=(sk,ak,rk+1,sk+1) Is an experience sample; if the number of samples in the experience queuing set R reaches the maximum capacity B, deleting the first added sample, and then adding the experience sample ekStoring the data into an experience queue set R; otherwise, the experience sample e is directly usedkStoring the data into an experience queue set R;
a experience samples are selected from the experience queue set R, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience queuing set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set Rl,al,rl+1,sl+1);
4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samplesqFor characterizing the performance of each evaluation network, the formula is as follows:
selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:
4-8) evaluating the network by the cThe motion vector of each experience sample at the next time step is obtained through the following greedy strategy:
4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methodsThe formula is as follows:
4-10) calculating the loss function L (w) for the c-th evaluation networkc) The formula is as follows:
4-11) through a loss function L (w)c) For the weight parameter wcTo update the weight parameters of the c-th evaluation network, the formula is as follows:
the weight parameters of the rest evaluation networks are kept unchanged;
4-12) randomly selecting one policy network from the n policy networks to reset the d policy network
4-13) calculating the d policy network according to the updated c evaluation networkDeterministic strategy gradientsAnd update the d-th policy network accordinglyWeight parameter θ ofdThe calculation formulas are respectively as follows:
the weight parameters of the rest strategy networks are kept unchanged;
4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, entering the step 4-15);
4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, entering the step 4-16);
4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)f(sk) And the target strategy realizes the trajectory tracking control of the AUV.
The invention has the characteristics and beneficial effects that:
the method provided by the invention adopts a plurality of strategy networks and evaluation networks. For a plurality of evaluation networks, the performance of each evaluation network is evaluated by defining an expected Bellman absolute error, only one evaluation network with the worst performance is updated at each time step, and different from the existing control method based on reinforcement learning, the invention provides a plurality of quasi-Q learning methods to calculate a more accurate evaluation network target value. For multiple policy networks, one policy network is randomly selected at each time step and updated with a deterministic policy gradient. The final learned strategy is the average of all strategy networks.
1) The AUV trajectory tracking control method provided by the invention does not depend on a model, a target strategy for enabling a control target to reach the optimum is automatically learned through sampling data of the AUV in the driving process, no assumption needs to be made on the AUV model in the process, and the method is particularly suitable for the AUV working in a complex deep sea environment and has high practical application value.
2) The method adopts a plurality of quasi-Q learning to obtain the evaluation network target value which is more accurate than that of the existing method, thereby not only reducing the variance of the action value function obtained by the evaluation network approximation, but also solving the problem of over-estimation of the action value function, further obtaining a more optimal target strategy and realizing high-precision AUV trajectory tracking control.
3) The method determines which evaluation network should be updated at each time step based on the expected Bellman absolute error, and the updating rule can weaken the influence of poor evaluation networks, thereby ensuring the rapid convergence of the learning process.
4) The method adopts a plurality of evaluation networks, so that the learning process is not easily influenced by the severe AUV historical tracking track, the robustness is good, and the learning process is stable.
5) The method combines reinforcement learning and a deep neural network, has strong self-learning capability, can realize high-precision self-adaptive control on the AUV in an uncertain deep sea environment, and has good application prospects in scenes such as AUV track tracking, underwater obstacle avoidance and the like.
Drawings
FIG. 1 is a graph comparing the performance of the proposed method of the present invention with the existing DDPG method; wherein, the graph (a) is a comparison graph of learning curves, and the graph (b) is a comparison graph of AUV trajectory tracking effect.
FIG. 2 is a graph comparing the performance of the proposed method of the present invention with a neural network PID method; wherein, the graph (a) is a comparison graph of the coordinate track tracking effect of the AUV along the X, Y direction, and the graph (b) is a comparison graph of the tracking error of the AUV in the X, Y direction.
Detailed Description
The invention provides an autonomous underwater vehicle track tracking control method based on deep reinforcement learning, which is further described in detail below by combining the accompanying drawings and specific embodiments.
The invention provides an autonomous underwater vehicle tracking control algorithm based on deep reinforcement learning, which mainly comprises four parts: defining an AUV trajectory tracking control problem, establishing a Markov decision process model of the AUV trajectory tracking problem, constructing a hybrid strategy-evaluation network structure and solving a target strategy of AUV trajectory tracking control.
1) Defining AUV trajectory tracking control problem
Defining the AUV trajectory tracking control problem includes four components: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:
1-1) determining AUV System inputs
Let AUV system input vector be tauk=[ξk,k]TWherein ξk、kRespectively the propeller thrust and rudder angle of the AUV, subscript k indicating the value of the kth time step, i.e. time k.t, where t is the time step, the same applies below ξk、kRespectively have a value range ofAndwhereinThe maximum propeller thrust and the maximum rudder angle are respectively determined according to the model of the propeller adopted by the AUV.
1-2) determining AUV System output
Let AUV system output vector be ηk=[xk,yk,ψk]TWherein x isk、ykThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectivelykIs the angle between the advancing direction of the AUV at the kth time step and the X axis.
1-3) defining a tracking control error
Selecting a reference trajectory according to the driving path of the AUVDefining the AUV trajectory tracking control error of the kth time step as:
1-4) establishing AUV track tracking control target
For the reference track d in step 1-3)kAn objective function of the form:
wherein γ is a discount factor and H is a weight matrix;
establishing AUV track tracking control target to find an optimal system input sequence tau*So that the target function P at the initial moment0(τ) is minimal, the calculation formula is as follows:
2) markov decision process model for establishing AUV trajectory tracking problem
The Markov Decision Process (MDP) is the basis of the reinforcement learning theory, and therefore MDP modeling is required for the AUV trajectory tracking problem in step 1). The main elements of reinforcement learning comprise an agent, an environment, a state, an action and an incentive function, and the agent learns an optimal action (or control input) sequence through interaction with the environment of the AUV to maximize accumulated incentive (or minimize accumulated tracking control error), so that the AUV trajectory tracking objective is solved. The method comprises the following specific steps:
2-1) defining a state vector
Defining the velocity vector of the AUV system as phik=[uk,vk,χk]TWherein u isk、vkLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUVkIs the k-thTime step AUV surrounds the angular velocity of the heading.
AUV system output vector η determined according to step 1-2)kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:
2-2) defining motion vectors
Defining the motion vector of the kth time step as the AUV system input vector of the time step, namely: a isk=τk。
2-3) defining a reward function
The reward function at the kth time step being used to characterize the state skTaking action akAccording to the trajectory tracking control error e defined in the step 1-3)kAnd the motion vector a defined in step 2-2)kThe AUV reward function defining the kth time step is as follows:
2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)*AUV trajectory tracking control target converted into reinforcement learning framework
Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:
wherein,expected values representing reward functions, states and actions (the same below); k is the maximum time step;
the action value function is used to describe the expected cumulative discount reward when strategy pi is taken in all states, current and later, and thus, under the reinforcement learning frameworkThe AUV track tracking control target (namely the target of the agent) learns an optimal target strategy pi through interaction with the environment where the AUV is located*So that the action value at the initial time is maximum, namely:
wherein, p(s)0) Is in an initial state s0The distribution of (a); a is0Is the initial motion vector.
Therefore, the target tau of the AUV track tracking control established in the step 1-4)*Can be converted into pi*And (4) solving.
2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework
Similar to dynamic programming, many reinforcement learning methods solve the action value function in step 2-4) using the following iterative bellman equation:
assuming that the policy pi is deterministic, i.e. the state vector space of the AUV is a one-to-one mapping relation to the action vector space of the AUV, and is denoted as μ, then the above iterative bellman equation can be simplified as:
furthermore, for a deterministic strategy μ, the optimal target strategy π in step 2-4) is set*Reduction to deterministic optimal target strategy mu*:
3) Constructing hybrid policy-evaluation networks
From the step 2-5), the core of solving the AUV trajectory tracking problem by using reinforcement learning is how to solve the deterministic optimal target strategy mu*And the corresponding mostFunction of optimal action valueThe method adopts a mixed strategy-evaluation network to respectively estimate mu*Andthe construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:
3-1) constructing a policy network
The mixed strategy-evaluation network structure is characterized in that n strategy networks (the values of which are not too large or too small in order to balance the tracking control precision of the algorithm and the network training speed) are constructedTo estimate a deterministic optimal target strategy mu*. Wherein, thetapThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network, each strategy network comprises an input layer, two hidden layers and an output layer, and the input of each strategy network is a state vector skThe output of each policy network is an action vector akThe two hidden layers contain 400 and 300 cells, respectively.
3-2) construction of evaluation network
The mixed strategy-evaluation network structure is characterized in that m (the selection basis of the number of evaluation networks is the same as that of the strategy networks) evaluation networks are constructedTo estimate the optimal action value functionWherein,w qthe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network respectively, and each evaluation network is realized by using a fully-connected deep neural networkRespectively comprises an input layer, two hidden layers and an output layer, wherein the two hidden layers respectively comprise 400 units and 300 units; the input to each evaluation network is a state vector skAnd motion vector akWherein the state vector skThe motion vector a is input from the input layer to each evaluation networkkFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector skLower taken action vector akThe action value of (1).
3-3) determining a target policy
According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time stepf(sk) Defined as the average of n strategic network outputs, the calculation formula is as follows:
4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking controlf(sk) The method comprises the following specific steps:
4-1) parameter setting
Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation networkωLearning rate α of each policy networkθThe discount factor y and the weight matrix H in the reward function, in this example, M is 1500, K is 1000 (each time step t is 0.2s), N is 64, α for each evaluation networkωα for each policy network ═ 0.01θ=0.001,γ=0.99,H=[0.001,0;0,0.001];
4-2) initializing hybrid policies-evaluating networks
Randomly initializing n policy networksAnd m evaluation networksWeight parameter θ ofpAnd wq(ii) a Randomly selecting the d (d is 1, …, n) th policy network from the n policy networks and recording the d (d is 1, …, n) th policy network
Constructing an empirical queuing set R, setting the maximum capacity of the empirical queuing set R as B (in this embodiment, B is 10000), and initializing to be empty;
4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;
4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV0Let the state variable s of the current time stepk=s 0(ii) a And generates an exploratory Noisek(this example uses Ornstein-Uhlenbeck to explore noise);
4-5) network based on n current policiesAnd exploration NoisekDetermining a motion vector a for a current time stepkComprises the following steps:
4-6) AUV in Current State skLower execution action akObtaining the reward function r according to the step 2-3)k+1And observe a new state sk+1(ii) a Note ek=(sk,ak,rk+1,sk+1) Is an experience sample; if the number of samples in the experience queuing set R reaches the maximum capacity B, deleting the first added sample, and then adding the experience sample ekStoring the data into an experience queue set R; otherwise, the experience sample e is directly usedkStoring the data into an experience queue set R;
selecting A experience samples from an experience queue set R, wherein A is less than or equal to N, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience is queuedWhen the set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set Rl,al,rl+1,sl+1) L is the time step of the selected experience sample;
4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samplesqFor characterizing the performance of each evaluation network, the formula is as follows:
selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:
4-8) evaluating the network by the cThe motion vector of each experience sample at the next time step is obtained through the following greedy strategy:
4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methodsThe formula is as follows:
4-10) calculating the loss function L (w) for the c-th evaluation networkc) The formula is as follows:
4-11) through a loss function L (w)c) For the weight parameter wcTo update the weight parameters of the c-th evaluation network, the formula is as follows:
the weight parameters of the rest evaluation networks are kept unchanged;
4-12) randomly selecting one policy network from the n policy networks to reset the d policy network
4-13) calculating the d policy network according to the updated c evaluation networkDeterministic strategy gradientsAnd update the d-th policy network accordinglyWeight parameter θ ofdThe calculation formulas are respectively as follows:
the weight parameters of the remaining policy networks remain unchanged.
4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, go to step 4-15).
4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, go to step 4-16).
4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)f(sk) And the target strategy realizes the trajectory tracking control of the AUV.
Validity verification of embodiments of the invention
The performance analysis of the AUV trajectory tracking control method (MPQ-DPG for short) based on deep reinforcement learning provided by the invention is shown as follows, and all comparison experiments are based on the widely used REMUS autonomous unmanned aircraft, and the maximum propeller thrust of the REMUS autonomous unmanned aircraft isRudder angle86N and 0.24rad, respectively; and the following reference trajectories are adopted:
in addition, in the embodiment of the present invention, the evaluation network number m is the same as the policy network number n, and is hereinafter collectively referred to as n.
1) Comparison analysis of MPQ-DPG and existing DDPG method
Fig. 1 is a comparison between the learning curve and the trajectory tracking effect of the AUV-proposed trajectory tracking control method (MPQ-DPG) for deep reinforcement learning proposed by the present invention and the existing DDPG method in the training process. The learning curve in graph (a) is obtained by five independent experiments, and Ref in graph (b) represents a reference trajectory.
Analyzing fig. 1, the following conclusions can be drawn:
a) compared with a DDPG method, the learning stability of the MPQ-DPG is better, because the MPQ-DPG adopts a plurality of evaluation networks and strategy networks, the influence of poor samples on the learning stability can be reduced.
b) The average accumulated reward of final convergence of the MPQ-DPG method is obviously higher than that of the DDPG method, which shows that the tracking control precision of the MPQ-DPG method is obviously higher than that of the DDPG method.
c) It can be observed from fig. 1(b) that the tracking trajectory obtained by the MPQ-DPG method almost coincides with the reference trajectory, which shows that the MPQ-DPG method can realize high-precision AUV tracking control.
d) With the increase of the number of strategy networks and evaluation networks, the tracking control precision of the MPQ-DPG method can be gradually improved, but the improvement amplitude is not obvious after n > 4.
2) Comparison analysis of MPQ-DPG method and existing neural network PID method
FIG. 2 is a comparison between the MPQ-DPG method and the neural network PID method provided for the underwater unmanned vehicle trajectory tracking control in the invention on a coordinate trajectory tracking curve and a coordinate trajectory tracking error. In the figure, Ref represents a reference coordinate track, PIDNN represents a neural network PID algorithm, and n is 4.
Analyzing the figure 2, the tracking performance of the neural network PID control method is obviously inferior to that of the MPQ-DPG method provided by the invention; in addition, the tracking error in fig. 2(b) shows that the MPQ-DPG method can achieve faster convergence of the error, and particularly in the initial stage, the MPQ-DPG method can still achieve fast and high-precision tracking performance, while the response time of the neural network PID method is significantly longer than that of the MPQ-DPG method, and the convergence of the tracking error is poor.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (1)
1. An autonomous underwater vehicle track tracking control method based on deep reinforcement learning is characterized by comprising the following steps:
1) defining AUV (autonomous Underwater vehicle) track tracking control problem
Defining the AUV trajectory tracking control problem includes four parts: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:
1-1) determining AUV System inputs
Let AUV system input vector be tauk=[ξk,k]TWherein ξk、kPropeller thrust and rudder angle of the AUV, respectively, with the subscript k denoting the kth time step ξk、kRespectively have a value range ofAnd maximum propeller thrust and maximum rudder angle respectively;
1-2) determining AUV System output
Let AUV system output vector be ηk=[xk,yk,ψk]TWherein x isk、ykThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectivelykThe included angle between the advancing direction of the AUV at the kth time step and the X axis;
1-3) defining a tracking control error
Selecting a reference trajectory according to the driving path of the AUVDefining the AUV trajectory tracking control error of the kth time step as:
1-4) establishing AUV track tracking control target
For the reference track d in step 1-3)kSelecting an objective function of the formNumber:
wherein γ is a discount factor and H is a weight matrix;
establishing AUV track tracking control target to find an optimal system input sequence tau*So that the target function P at the initial moment0(τ) is minimal, the calculation formula is as follows:
2) markov decision process model for establishing AUV trajectory tracking problem
Performing Markov decision process modeling on the AUV trajectory tracking problem in the step 1), and specifically performing the following steps:
2-1) defining a state vector
Defining the velocity vector of the AUV system as phik=[uk,vk,χk]TWherein u isk、vkLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUVkThe angular velocity of the k time step AUV around the advancing direction;
AUV system output vector η determined according to step 1-2)kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:
2-2) defining motion vectors
Defining the motion vector of the k time step as the AUV system input vector of the time step, namely ak=τk;
2-3) defining a reward function
The reward function at the kth time step being used to characterize the state skTaking action akAccording to the track defined in step 1-3)Tracking control error ekAnd the motion vector a defined in step 2-2)kThe AUV reward function defining the kth time step is as follows:
2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)*AUV trajectory tracking control target converted into reinforcement learning framework
Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:
wherein,expected values representing reward functions, states and actions; k is the maximum time step;
the action value function is used for describing expected accumulated discount rewards when a strategy pi is adopted in all the current and the following states, so that an AUV (autonomous Underwater vehicle) track tracking control target learns an optimal target strategy pi through interaction with the environment where the AUV is located in an enhanced learning framework*The calculation formula is as follows, so that the action value at the initial time is maximum:
wherein, p(s)0) Is in an initial state s0The distribution of (a); a is0Is an initial motion vector;
tracking and controlling the target tau of the AUV track established in the step 1-4)*Is converted into pi*Solving;
2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework
Solving the action value function in step 2-4) by iterating the Bellman equation as follows:
assuming that the policy pi is deterministic, that is, the state vector space of the AUV is in one-to-one mapping relation to the motion vector space of the AUV, and is recorded as μ, the iterative bellman equation is simplified as follows:
for the deterministic strategy mu, the optimal target strategy pi in the step 2-4) is used*Reduction to deterministic optimal target strategy mu*:
3) Constructing hybrid policy-evaluation networks
Separately estimating deterministic optimal target strategies mu by constructing hybrid strategy-evaluation networks*And corresponding optimal action value functionThe construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:
3-1) constructing a policy network
Hybrid policy-evaluation network architecture by constructing n policy networksTo estimate a deterministic optimal target strategy mu*(ii) a Wherein, thetapThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network respectively, and comprises an input layer, two hidden layers and an output layer; the input of each policy network is a state vectorskThe output of each policy network is an action vector ak;
3-2) construction of evaluation network
The mixed strategy-evaluation network structure is realized by constructing m evaluation networksTo estimate the optimal action value functionWherein, wqThe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network, and comprises an input layer, two hidden layers and an output layer; the input to each evaluation network is a state vector skAnd motion vector akWherein the state vector skThe motion vector a is input from the input layer to each evaluation networkkFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector skLower taken action vector akAn action value of (d);
3-3) determining a target policy
According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time stepf(sk) Defined as the average of n strategic network outputs, the calculation formula is as follows:
4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking controlf(sk) The method comprises the following specific steps:
4-1) parameter setting
Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation networkωLearning rate α of each policy networkθA weight matrix H in the discount factor gamma and the reward function;
4-2) initializing hybrid policies-evaluating networks
Randomly initializing n policy networksAnd m evaluation networksWeight parameter θ ofpAnd wq(ii) a Randomly selecting the d-th policy network from the n policy networks asd=1,…,n;
Establishing an experience queue set R, setting the maximum capacity of the experience queue set R as B, and initializing the experience queue set R as empty;
4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;
4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV0Let the state variable s of the current time stepk=s0(ii) a And generates an exploratory Noisek;
4-5) network based on n current policiesAnd exploration NoisekDetermining a motion vector a for a current time stepkComprises the following steps:
4-6) AUV in Current State skLower execution action akObtaining the reward function r according to the step 2-3)k+1And observe a new state sk+1(ii) a Note ek=(sk,ak,rk+1,sk+1) Is an experience sample; if the number of samples of the empirical queuing set R has reached the maximum capacity B, thenDeleting the first added sample and then using the experience sample ekStoring the data into an experience queue set R; otherwise, the experience sample e is directly usedkStoring the data into an experience queue set R;
a experience samples are selected from the experience queue set R, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience queuing set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set Rl,al,rl+1,sl+1);
4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samplesqFor characterizing the performance of each evaluation network, the formula is as follows:
selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:
4-8) evaluating the network by the cThe motion vector of each experience sample at the next time step is obtained through the following greedy strategy:
4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methodsThe formula is as follows:
4-10) calculating the loss function L (w) for the c-th evaluation networkc) The formula is as follows:
4-11) through a loss function L (w)c) For the weight parameter wcTo update the weight parameters of the c-th evaluation network, the formula is as follows:
the weight parameters of the rest evaluation networks are kept unchanged;
4-12) randomly selecting one policy network from the n policy networks to reset the d policy network
4-13) calculating the d policy network according to the updated c evaluation networkDeterministic strategy gradientsAnd update the d-th policy network accordinglyWeight parameter θ ofdThe calculation formulas are respectively as follows:
the weight parameters of the rest strategy networks are kept unchanged;
4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, entering the step 4-15);
4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, entering the step 4-16);
4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)f(sk) And the target strategy realizes the trajectory tracking control of the AUV.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810535773.8A CN108803321B (en) | 2018-05-30 | 2018-05-30 | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810535773.8A CN108803321B (en) | 2018-05-30 | 2018-05-30 | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108803321A CN108803321A (en) | 2018-11-13 |
CN108803321B true CN108803321B (en) | 2020-07-10 |
Family
ID=64089259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810535773.8A Active CN108803321B (en) | 2018-05-30 | 2018-05-30 | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108803321B (en) |
Families Citing this family (77)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109361700A (en) * | 2018-12-06 | 2019-02-19 | 郑州航空工业管理学院 | A kind of unmanned plane self-organizing network system adaptive recognition protocol frame |
CN111338333B (en) * | 2018-12-18 | 2021-08-31 | 北京航迹科技有限公司 | System and method for autonomous driving |
US10955853B2 (en) | 2018-12-18 | 2021-03-23 | Beijing Voyager Technology Co., Ltd. | Systems and methods for autonomous driving |
CN109719721B (en) * | 2018-12-26 | 2020-07-24 | 北京化工大学 | Adaptive gait autonomous emerging method of snake-like search and rescue robot |
CN109726866A (en) * | 2018-12-27 | 2019-05-07 | 浙江农林大学 | Unmanned boat paths planning method based on Q learning neural network |
CN109696830B (en) * | 2019-01-31 | 2021-12-03 | 天津大学 | Reinforced learning self-adaptive control method of small unmanned helicopter |
CN109960259B (en) * | 2019-02-15 | 2021-09-24 | 青岛大学 | Multi-agent reinforcement learning unmanned guided vehicle path planning method based on gradient potential |
CN109828463A (en) * | 2019-02-18 | 2019-05-31 | 哈尔滨工程大学 | A kind of adaptive wave glider bow of ocean current interference is to control method |
CN109828467B (en) * | 2019-03-01 | 2021-09-07 | 大连海事大学 | Data-driven unmanned ship reinforcement learning controller structure and design method |
CN109765916A (en) * | 2019-03-26 | 2019-05-17 | 武汉欣海远航科技研发有限公司 | A kind of unmanned surface vehicle path following control device design method |
CN109870162B (en) * | 2019-04-04 | 2020-10-30 | 北京航空航天大学 | Unmanned aerial vehicle flight path planning method based on competition deep learning network |
CN110019151B (en) * | 2019-04-11 | 2024-03-15 | 深圳市腾讯计算机系统有限公司 | Database performance adjustment method, device, equipment, system and storage medium |
CN110083064B (en) * | 2019-04-29 | 2022-02-15 | 辽宁石油化工大学 | Network optimal tracking control method based on non-strategy Q-learning |
CN110045614A (en) * | 2019-05-16 | 2019-07-23 | 河海大学常州校区 | A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning |
CN110428615B (en) * | 2019-07-12 | 2021-06-22 | 中国科学院自动化研究所 | Single intersection traffic signal control method, system and device based on deep reinforcement learning |
CN110362089A (en) * | 2019-08-02 | 2019-10-22 | 大连海事大学 | A method of the unmanned boat independent navigation based on deeply study and genetic algorithm |
CN110321666B (en) * | 2019-08-09 | 2022-05-03 | 重庆理工大学 | Multi-robot path planning method based on priori knowledge and DQN algorithm |
CN110333739B (en) * | 2019-08-21 | 2020-07-31 | 哈尔滨工程大学 | AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning |
CN110806756B (en) * | 2019-09-10 | 2022-08-02 | 西北工业大学 | Unmanned aerial vehicle autonomous guidance control method based on DDPG |
CN110716574B (en) * | 2019-09-29 | 2023-05-02 | 哈尔滨工程大学 | UUV real-time collision avoidance planning method based on deep Q network |
CN110673602B (en) * | 2019-10-24 | 2022-11-25 | 驭势科技(北京)有限公司 | Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment |
CN110806759B (en) * | 2019-11-12 | 2020-09-08 | 清华大学 | Aircraft route tracking method based on deep reinforcement learning |
CN110989576B (en) * | 2019-11-14 | 2022-07-12 | 北京理工大学 | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle |
CN111027677B (en) * | 2019-12-02 | 2023-05-23 | 西安电子科技大学 | Multi-moving target tracking method based on depth deterministic strategy gradient DDPG |
CN111091710A (en) * | 2019-12-18 | 2020-05-01 | 上海天壤智能科技有限公司 | Traffic signal control method, system and medium |
CN111061277B (en) * | 2019-12-31 | 2022-04-05 | 歌尔股份有限公司 | Unmanned vehicle global path planning method and device |
CN111310384B (en) * | 2020-01-16 | 2024-05-21 | 香港中文大学(深圳) | Wind field cooperative control method, terminal and computer readable storage medium |
CN111240345B (en) * | 2020-02-11 | 2023-04-07 | 哈尔滨工程大学 | Underwater robot trajectory tracking method based on double BP network reinforcement learning framework |
CN111580544B (en) * | 2020-03-25 | 2021-05-07 | 北京航空航天大学 | Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm |
CN111813143B (en) * | 2020-06-09 | 2022-04-19 | 天津大学 | Underwater glider intelligent control system and method based on reinforcement learning |
CN111736617B (en) * | 2020-06-09 | 2022-11-04 | 哈尔滨工程大学 | Track tracking control method for preset performance of benthonic underwater robot based on speed observer |
CN111856936B (en) * | 2020-07-21 | 2023-06-02 | 天津蓝鳍海洋工程有限公司 | Control method for cabled underwater high-flexibility operation platform |
CN112100834A (en) * | 2020-09-06 | 2020-12-18 | 西北工业大学 | Underwater glider attitude control method based on deep reinforcement learning |
CN112132263B (en) * | 2020-09-11 | 2022-09-16 | 大连理工大学 | Multi-agent autonomous navigation method based on reinforcement learning |
CN112162555B (en) * | 2020-09-23 | 2021-07-16 | 燕山大学 | Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet |
CN112148025A (en) * | 2020-09-24 | 2020-12-29 | 东南大学 | Unmanned aerial vehicle stability control algorithm based on integral compensation reinforcement learning |
CN112179367B (en) * | 2020-09-25 | 2023-07-04 | 广东海洋大学 | Intelligent autonomous navigation method based on deep reinforcement learning |
CN112241176B (en) * | 2020-10-16 | 2022-10-28 | 哈尔滨工程大学 | Path planning and obstacle avoidance control method of underwater autonomous vehicle in large-scale continuous obstacle environment |
CN112558465B (en) * | 2020-12-03 | 2022-11-01 | 大连海事大学 | Unknown unmanned ship finite time reinforcement learning control method with input limitation |
CN112506210B (en) * | 2020-12-04 | 2022-12-27 | 东南大学 | Unmanned aerial vehicle control method for autonomous target tracking |
CN112462792B (en) * | 2020-12-09 | 2022-08-09 | 哈尔滨工程大学 | Actor-Critic algorithm-based underwater robot motion control method |
CN112698572B (en) * | 2020-12-22 | 2022-08-16 | 西安交通大学 | Structural vibration control method, medium and equipment based on reinforcement learning |
CN112929900B (en) * | 2021-01-21 | 2022-08-02 | 华侨大学 | MAC protocol for realizing time domain interference alignment based on deep reinforcement learning in underwater acoustic network |
CN113029123A (en) * | 2021-03-02 | 2021-06-25 | 西北工业大学 | Multi-AUV collaborative navigation method based on reinforcement learning |
CN113052372B (en) * | 2021-03-17 | 2022-08-02 | 哈尔滨工程大学 | Dynamic AUV tracking path planning method based on deep reinforcement learning |
CN113095463A (en) * | 2021-03-31 | 2021-07-09 | 南开大学 | Robot confrontation method based on evolution reinforcement learning |
CN113095500B (en) * | 2021-03-31 | 2023-04-07 | 南开大学 | Robot tracking method based on multi-agent reinforcement learning |
CN113370205B (en) * | 2021-05-08 | 2022-06-17 | 浙江工业大学 | Baxter mechanical arm track tracking control method based on machine learning |
CN113359448A (en) * | 2021-06-03 | 2021-09-07 | 清华大学 | Autonomous underwater vehicle track tracking control method aiming at time-varying dynamics |
CN113595768A (en) * | 2021-07-07 | 2021-11-02 | 西安电子科技大学 | Distributed cooperative transmission algorithm for guaranteeing control performance of mobile information physical system |
CN113467248A (en) * | 2021-07-22 | 2021-10-01 | 南京大学 | Fault-tolerant control method for unmanned aerial vehicle sensor during fault based on reinforcement learning |
WO2023019536A1 (en) * | 2021-08-20 | 2023-02-23 | 上海电气电站设备有限公司 | Deep reinforcement learning-based photovoltaic module intelligent sun tracking method |
CN113821035A (en) * | 2021-09-22 | 2021-12-21 | 北京邮电大学 | Unmanned ship trajectory tracking control method and device |
CN113829351B (en) * | 2021-10-13 | 2023-08-01 | 广西大学 | Cooperative control method of mobile mechanical arm based on reinforcement learning |
CN113885330B (en) * | 2021-10-26 | 2022-06-17 | 哈尔滨工业大学 | Information physical system safety control method based on deep reinforcement learning |
CN114089633B (en) * | 2021-11-19 | 2024-04-26 | 江苏科技大学 | Multi-motor coupling driving control device and method for underwater robot |
CN114020001A (en) * | 2021-12-17 | 2022-02-08 | 中国科学院国家空间科学中心 | Mars unmanned aerial vehicle intelligent control method based on depth certainty strategy gradient learning |
CN114357884B (en) * | 2022-01-05 | 2022-11-08 | 厦门宇昊软件有限公司 | Reaction temperature control method and system based on deep reinforcement learning |
CN114527642B (en) * | 2022-03-03 | 2024-04-02 | 东北大学 | Method for automatically adjusting PID parameters by AGV based on deep reinforcement learning |
CN114721408A (en) * | 2022-04-18 | 2022-07-08 | 哈尔滨理工大学 | Underwater robot path tracking method based on reinforcement learning |
CN114954840B (en) * | 2022-05-30 | 2023-09-05 | 武汉理工大学 | Method, system and device for controlling stability of ship |
CN114995137B (en) * | 2022-06-01 | 2023-04-28 | 哈尔滨工业大学 | Rope-driven parallel robot control method based on deep reinforcement learning |
CN115016496A (en) * | 2022-06-30 | 2022-09-06 | 重庆大学 | Water surface unmanned ship path tracking method based on deep reinforcement learning |
CN114839884B (en) * | 2022-07-05 | 2022-09-30 | 山东大学 | Underwater vehicle bottom layer control method and system based on deep reinforcement learning |
CN114967713B (en) * | 2022-07-28 | 2022-11-29 | 山东大学 | Underwater vehicle buoyancy discrete change control method based on reinforcement learning |
CN115366099B (en) * | 2022-08-18 | 2024-05-28 | 江苏科技大学 | Mechanical arm depth deterministic strategy gradient training method based on forward kinematics |
CN115330276B (en) * | 2022-10-13 | 2023-01-06 | 北京云迹科技股份有限公司 | Method and device for robot to automatically select elevator based on reinforcement learning |
CN115657477A (en) * | 2022-10-13 | 2023-01-31 | 北京理工大学 | Dynamic environment robot self-adaptive control method based on offline reinforcement learning |
CN115562345B (en) * | 2022-10-28 | 2023-06-27 | 北京理工大学 | Unmanned aerial vehicle detection track planning method based on deep reinforcement learning |
CN115657683B (en) * | 2022-11-14 | 2023-05-02 | 中国电子科技集团公司第十研究所 | Unmanned cable-free submersible real-time obstacle avoidance method capable of being used for inspection operation task |
CN115857556B (en) * | 2023-01-30 | 2023-07-14 | 中国人民解放军96901部队 | Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning |
CN115826594B (en) * | 2023-02-23 | 2023-05-30 | 北京航空航天大学 | Unmanned underwater vehicle switching topology formation control method independent of dynamic model parameters |
CN115855226B (en) * | 2023-02-24 | 2023-05-30 | 青岛科技大学 | Multi-AUV cooperative underwater data acquisition method based on DQN and matrix completion |
CN116295449B (en) * | 2023-05-25 | 2023-09-12 | 吉林大学 | Method and device for indicating path of autonomous underwater vehicle |
CN116578102B (en) * | 2023-07-13 | 2023-09-19 | 清华大学 | Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium |
CN116827685B (en) * | 2023-08-28 | 2023-11-14 | 成都乐超人科技有限公司 | Dynamic defense strategy method of micro-service system based on deep reinforcement learning |
CN117826860B (en) * | 2024-03-04 | 2024-06-21 | 北京航空航天大学 | Fixed wing unmanned aerial vehicle control strategy determination method based on reinforcement learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101545731B1 (en) * | 2014-04-30 | 2015-08-20 | 인하대학교 산학협력단 | System and method for video tracking |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
CN107102644A (en) * | 2017-06-22 | 2017-08-29 | 华南师范大学 | The underwater robot method for controlling trajectory and control system learnt based on deeply |
CN107368076A (en) * | 2017-07-31 | 2017-11-21 | 中南大学 | Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment |
CN107856035A (en) * | 2017-11-06 | 2018-03-30 | 深圳市唯特视科技有限公司 | A kind of robustness dynamic motion method based on intensified learning and whole body controller |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8456523B2 (en) * | 2009-07-20 | 2013-06-04 | Precitec Kg | Laser processing head and method for compensating for the change in focus position in a laser processing head |
-
2018
- 2018-05-30 CN CN201810535773.8A patent/CN108803321B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101545731B1 (en) * | 2014-04-30 | 2015-08-20 | 인하대학교 산학협력단 | System and method for video tracking |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
CN107102644A (en) * | 2017-06-22 | 2017-08-29 | 华南师范大学 | The underwater robot method for controlling trajectory and control system learnt based on deeply |
CN107368076A (en) * | 2017-07-31 | 2017-11-21 | 中南大学 | Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment |
CN107856035A (en) * | 2017-11-06 | 2018-03-30 | 深圳市唯特视科技有限公司 | A kind of robustness dynamic motion method based on intensified learning and whole body controller |
Non-Patent Citations (4)
Title |
---|
AUV Based Source Seeking with Estimated Gradients;Li Zhou等;《Journal of Systems Science & Complexity》;20180228(第1期);第262-275页 * |
Deep Reinforcement Learning Based Optimal Trajectory Tracking Control of Autonomous Underwater Vehicle;Runsheng Yu等;《Proceedings of the 36th Chinese Control Conference》;20170731;第4958-4965页 * |
基于深度强化学习的水下机器人最优轨迹控制;马琼雄等;《华南师范大学(自然科学版)》;20180228;第50卷(第1期);第118-123页 * |
进化强化学习及其在机器人路径跟踪中的应用;段勇等;《控制与决策》;20090430;第24卷(第4期);第532-536、541页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108803321A (en) | 2018-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108803321B (en) | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning | |
CN107168312B (en) | Space trajectory tracking control method for compensating UUV kinematic and dynamic interference | |
CN107748566B (en) | Underwater autonomous robot fixed depth control method based on reinforcement learning | |
CN111966118B (en) | ROV thrust distribution and reinforcement learning-based motion control method | |
Sun et al. | Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning | |
CN111142522B (en) | Method for controlling agent of hierarchical reinforcement learning | |
CN111650948B (en) | Quick tracking control method for horizontal plane track of benthonic AUV | |
CN111240345B (en) | Underwater robot trajectory tracking method based on double BP network reinforcement learning framework | |
CN113052372B (en) | Dynamic AUV tracking path planning method based on deep reinforcement learning | |
CN115016496A (en) | Water surface unmanned ship path tracking method based on deep reinforcement learning | |
CN114879671B (en) | Unmanned ship track tracking control method based on reinforcement learning MPC | |
CN109189103B (en) | Under-actuated AUV trajectory tracking control method with transient performance constraint | |
CN114199248B (en) | AUV co-location method for optimizing ANFIS based on mixed element heuristic algorithm | |
Mousavian et al. | Identification-based robust motion control of an AUV: optimized by particle swarm optimization algorithm | |
CN114115262B (en) | Multi-AUV actuator saturation cooperative formation control system and method based on azimuth information | |
Wang et al. | Path-following optimal control of autonomous underwater vehicle based on deep reinforcement learning | |
Liu et al. | Deep reinforcement learning for vectored thruster autonomous underwater vehicle control | |
Gao et al. | Command filtered path tracking control of saturated ASVs based on time‐varying disturbance observer | |
Fan et al. | Path-Following Control of Unmanned Underwater Vehicle Based on an Improved TD3 Deep Reinforcement Learning | |
Song et al. | Surface path tracking method of autonomous surface underwater vehicle based on deep reinforcement learning | |
Sola et al. | Evaluation of a deep-reinforcement-learning-based controller for the control of an autonomous underwater vehicle | |
Zhang et al. | Novel TD3 Based AUV Path Tracking Control | |
CN116578102B (en) | Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium | |
Liu et al. | Research on obstacle avoidance of underactuated autonomous underwater vehicle based on offline reinforcement learning | |
Frafjord | Target Tracking Control for an Unmanned Surface Vessel: Optimal Control vs Reinforcement Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |