CN108803321B - Autonomous underwater vehicle track tracking control method based on deep reinforcement learning - Google Patents

Autonomous underwater vehicle track tracking control method based on deep reinforcement learning Download PDF

Info

Publication number
CN108803321B
CN108803321B CN201810535773.8A CN201810535773A CN108803321B CN 108803321 B CN108803321 B CN 108803321B CN 201810535773 A CN201810535773 A CN 201810535773A CN 108803321 B CN108803321 B CN 108803321B
Authority
CN
China
Prior art keywords
auv
network
strategy
evaluation
tracking control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810535773.8A
Other languages
Chinese (zh)
Other versions
CN108803321A (en
Inventor
宋士吉
石文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201810535773.8A priority Critical patent/CN108803321B/en
Publication of CN108803321A publication Critical patent/CN108803321A/en
Application granted granted Critical
Publication of CN108803321B publication Critical patent/CN108803321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides an autonomous underwater vehicle track tracking control method based on deep reinforcement learning, and belongs to the field of deep reinforcement learning and intelligent control. Firstly, defining an AUV trajectory tracking control problem; then establishing a Markov decision process model of the AUV trajectory tracking problem; then constructing a mixed strategy-evaluation network, wherein the network consists of a plurality of strategy networks and evaluation networks; and finally, solving an AUV trajectory tracking control target strategy by the constructed hybrid strategy-evaluation network, evaluating the performance of each evaluation network by defining expected Bellman absolute errors for a plurality of evaluation networks, only updating the evaluation network with the worst performance at each time step, randomly selecting a strategy network at each time step for the plurality of strategy networks, updating by adopting a deterministic strategy gradient, and finally learning the strategy which is the mean value of all the strategy networks. The method is not easily influenced by the historical tracking track of the bad AUV, and has high precision.

Description

Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
Technical Field
The invention belongs to the field of deep reinforcement learning and intelligent control, and relates to an Autonomous Underwater Vehicle (AUV) trajectory tracking control method based on deep reinforcement learning.
Background
The development of deep sea seabed science highly depends on deep sea detection technology and equipment, and due to the complex deep sea environment and extreme conditions, deep sea operation type autonomous underwater vehicles are mainly adopted to replace or assist people in detecting, observing and sampling deep sea at present. Aiming at task scenes that humans cannot reach field operation, such as ocean resource exploration, submarine investigation, ocean surveying and mapping, the autonomous and controllable AUV underwater motion is guaranteed to be the most basic and important functional requirement, and the premise is that various complex operation tasks are realized. However, many offshore applications of AUVs (e.g., trajectory tracking control, target tracking control, etc.) are extremely challenging, which is mainly caused by the characteristics of the AUV system in three ways. Firstly, an AUV (autonomous Underwater vehicle) is used as a multi-input multi-output system, and a dynamics and kinematics model (hereinafter referred to as a model) of the AUV is complex and has the characteristics of high nonlinearity, strong coupling, existence of input or state constraint, time variation and the like; secondly, uncertainty exists in model parameters or hydrodynamic environment, so that the AUV system is difficult to model; thirdly, most of the AUVs currently belong to under-actuated systems, i.e. the degrees of freedom are greater than the number of independent actuators (each independent actuator corresponds to one degree of freedom). Generally, the model and parameters of the AUV are determined by a method combining mathematical and physical mechanism derivation, numerical simulation and physical experiment, and the uncertain parts in the model are reasonably depicted. The control problem of the AUV is also very complicated due to the complex model. Moreover, with the continuous expansion of the application scenarios of the AUV, people put higher requirements on the accuracy and stability of the motion control of the AUV, and how to improve the control effect of the AUV in various motion scenarios has become an important research direction.
In the past decades, researchers have designed various AUV motion control methods and verified their effectiveness for different application scenarios such as trajectory tracking, waypoint tracking, path planning, and formation control. The representative of the method is a model-based output feedback control method proposed by Refsnes et al, which adopts two decoupled system models: the three-degree-of-freedom ocean current induced ship body model is used for depicting ocean current load, and the five-degree-of-freedom model is used for describing system dynamics. In addition, Healey et al have designed a state feedback-based tracking control method that employs a fixed forward motion velocity and linearizes the system model, and that employs three decoupled models: a surge model, a horizontal steering model (surge and yaw), and a vertical model (heave and pitch). However, these methods all perform decoupling or linearization processing on the system model, so it is difficult to meet the requirement of high-precision control of the AUV in a specific application scenario.
Due to the limitations of the classical motion control methods and the strong self-learning ability of reinforcement learning, in recent years, researchers have shown great research interest in intelligent control methods represented by reinforcement learning. Various intelligent control methods based on reinforcement learning technologies (such as Q learning, direct strategy search, strategy-evaluation network and adaptive reinforcement learning) are also continuously proposed and successfully applied to different complex application scenarios, such as robot motion control, unmanned plane flight control, hypersonic aircraft tracking control, and road signal lamp control. The core idea of the control method based on reinforcement learning is to realize the performance optimization of the control system on the premise of no prior knowledge. For AUV systems, many researchers have designed various reinforcement learning based control methods and actually verified their feasibility. For the problem of autonomous underwater cable tracking control, EI-Fakdi et al adopt a direct strategy search technique to learn a state/action mapping relation, but the method is only suitable for the case that both state and action spaces are discrete; for a continuous motion space, Paula et al adopts a radial basis network to approximate a strategy function, however, the control method cannot guarantee high tracking control accuracy due to the weak function approximation capability of the radial basis network.
In recent years, with the development of Deep Neural Network (DNN) training technologies such as batch learning, empirical review, and batch regularization, deep reinforcement learning shows excellent performance in complex tasks such as robot motion control, autonomous ground vehicle motion control, quad-rotor control, and autopilot. In particular, the recently proposed Deep Q Networks (DQN) exhibit human-level control accuracy in a number of very challenging tasks. However, DQN does not address the problem of having both a high dimensional state space and a continuous motion space. On the basis of DQN, a depth deterministic strategy gradient (DDPG) algorithm is further proposed and realizes continuous control. However, the DDPG estimates the target value of the evaluation network using the target evaluation network, so that the evaluation network cannot effectively evaluate the strategy learned by the strategy network, and the learned action value function has a large variance, and thus when the DDPG is applied to the AUV trajectory tracking control problem, the requirements of high tracking control accuracy and stable learning cannot be satisfied.
Disclosure of Invention
The invention aims to provide an AUV (autonomous Underwater vehicle) trajectory tracking control method based on deep reinforcement learning, which adopts a mixed strategy-evaluation network structure and adopts a plurality of quasi-Q learning and deterministic strategy gradients to respectively train an evaluation network and a strategy network, overcomes the problems that the conventional reinforcement learning-based method is low in control precision, cannot realize continuous control, is unstable in learning process and the like, and realizes high-precision AUV trajectory tracking control and stable learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
an autonomous underwater vehicle track tracking control method based on deep reinforcement learning comprises the following steps:
1) defining AUV (autonomous Underwater vehicle) track tracking control problem
Defining the AUV trajectory tracking control problem includes four parts: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:
1-1) determining AUV System inputs
Let AUV system input vector be tauk=[ξk,k]TWherein ξkkPropeller thrust and rudder angle of the AUV, respectively, with the subscript k denoting the kth time step ξkkRespectively have a value range of
Figure BDA0001678083500000031
And
Figure BDA0001678083500000032
maximum propeller thrust and maximum rudder angle respectively;
1-2) determining AUV System output
Let AUV system output vector be ηk=[xk,ykk]TWherein x isk、ykThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectivelykThe included angle between the advancing direction of the AUV at the kth time step and the X axis;
1-3) defining a tracking control error
According to the line of AUVDriving path selection reference track
Figure BDA0001678083500000033
Defining the AUV trajectory tracking control error of the kth time step as:
Figure BDA0001678083500000034
1-4) establishing AUV track tracking control target
For the reference track d in step 1-3)kAn objective function of the form:
Figure BDA0001678083500000035
wherein γ is a discount factor and H is a weight matrix;
establishing AUV track tracking control target to find an optimal system input sequence tau*So that the target function P at the initial moment0(τ) is minimal, the calculation formula is as follows:
Figure BDA0001678083500000036
2) markov decision process model for establishing AUV trajectory tracking problem
Performing Markov decision process modeling on the AUV trajectory tracking problem in the step 1), and specifically performing the following steps:
2-1) defining a state vector
Defining the velocity vector of the AUV system as phik=[uk,vkk]TWherein u isk、vkLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUVkThe angular velocity of the k time step AUV around the advancing direction;
AUV system output vector η determined according to step 1-2)kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:
Figure BDA0001678083500000037
2-2) defining motion vectors
Defining the motion vector of the k time step as the AUV system input vector of the time step, namely ak=τk
2-3) defining a reward function
The reward function at the kth time step being used to characterize the state skTaking action akAccording to the trajectory tracking control error e defined in the step 1-3)kAnd the motion vector a defined in step 2-2)kThe AUV reward function defining the kth time step is as follows:
Figure BDA0001678083500000038
2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)*AUV trajectory tracking control target converted into reinforcement learning framework
Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:
Figure BDA0001678083500000041
wherein,
Figure BDA0001678083500000042
expected values representing reward functions, states and actions; k is the maximum time step;
the action value function is used for describing expected accumulated discount rewards when a strategy pi is adopted in all the current and the following states, so that an AUV (autonomous Underwater vehicle) track tracking control target learns an optimal target strategy pi through interaction with the environment where the AUV is located in an enhanced learning framework*The calculation formula is as follows, so that the action value at the initial time is maximum:
Figure BDA0001678083500000043
wherein, p(s)0) Is in an initial state s0The distribution of (a); a is0Is an initial motion vector;
tracking and controlling the target tau of the AUV track established in the step 1-4)*Is converted into pi*Solving;
2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework
Solving the action value function in step 2-4) by iterating the Bellman equation as follows:
Figure BDA0001678083500000044
assuming that the policy pi is deterministic, that is, the state vector space of the AUV is in one-to-one mapping relation to the motion vector space of the AUV, and is recorded as μ, the iterative bellman equation is simplified as follows:
Figure BDA0001678083500000045
for the deterministic strategy mu, the optimal target strategy pi in the step 2-4) is used*Reduction to deterministic optimal target strategy mu*
Figure BDA0001678083500000046
3) Constructing hybrid policy-evaluation networks
Separately estimating deterministic optimal target strategies mu by constructing hybrid strategy-evaluation networks*And corresponding optimal action value function
Figure BDA0001678083500000047
The construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:
3-1) constructing a policy network
Hybrid policy-evaluation network architectureBy constructing n policy networks
Figure BDA0001678083500000048
To estimate a deterministic optimal target strategy mu*(ii) a Wherein, thetapThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network respectively, and comprises an input layer, two hidden layers and an output layer; the input of each policy network is a state vector skThe output of each policy network is an action vector ak
3-2) construction of evaluation network
The mixed strategy-evaluation network structure is realized by constructing m evaluation networks
Figure BDA0001678083500000051
To estimate the optimal action value function
Figure BDA0001678083500000052
Wherein, wqThe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network, and comprises an input layer, two hidden layers and an output layer; the input to each evaluation network is a state vector skAnd motion vector akWherein the state vector skThe motion vector a is input from the input layer to each evaluation networkkFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector skLower taken action vector akAn action value of (d);
3-3) determining a target policy
According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time stepf(sk) Defined as the average of n strategic network outputs, the calculation formula is as follows:
Figure BDA0001678083500000053
4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking controlf(sk) The method comprises the following specific steps:
4-1) parameter setting
Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation networkωLearning rate α of each policy networkθA weight matrix H in the discount factor gamma and the reward function;
4-2) initializing hybrid policies-evaluating networks
Randomly initializing n policy networks
Figure BDA0001678083500000054
And m evaluation networks
Figure BDA0001678083500000055
Weight parameter θ ofpAnd wq(ii) a Randomly selecting the d-th policy network from the n policy networks as
Figure BDA0001678083500000056
d=1,…,n;
Establishing an experience queue set R, setting the maximum capacity of the experience queue set R as B, and initializing the experience queue set R as empty;
4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;
4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV0Let the state variable s of the current time stepk=s0(ii) a And generates an exploratory Noisek
4-5) network based on n current policies
Figure BDA0001678083500000057
And exploration NoisekDetermining a motion vector a for a current time stepkComprises the following steps:
Figure BDA0001678083500000058
4-6) AUV in Current State skLower execution action akObtaining the reward function r according to the step 2-3)k+1And observe a new state sk+1(ii) a Note ek=(sk,ak,rk+1,sk+1) Is an experience sample; if the number of samples in the experience queuing set R reaches the maximum capacity B, deleting the first added sample, and then adding the experience sample ekStoring the data into an experience queue set R; otherwise, the experience sample e is directly usedkStoring the data into an experience queue set R;
a experience samples are selected from the experience queue set R, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience queuing set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set Rl,al,rl+1,sl+1);
4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samplesqFor characterizing the performance of each evaluation network, the formula is as follows:
Figure BDA0001678083500000061
selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:
Figure BDA0001678083500000062
4-8) evaluating the network by the c
Figure BDA0001678083500000063
The motion vector of each experience sample at the next time step is obtained through the following greedy strategy:
Figure BDA0001678083500000064
4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methods
Figure BDA0001678083500000065
The formula is as follows:
Figure BDA0001678083500000066
4-10) calculating the loss function L (w) for the c-th evaluation networkc) The formula is as follows:
Figure BDA0001678083500000067
4-11) through a loss function L (w)c) For the weight parameter wcTo update the weight parameters of the c-th evaluation network, the formula is as follows:
Figure BDA0001678083500000068
the weight parameters of the rest evaluation networks are kept unchanged;
4-12) randomly selecting one policy network from the n policy networks to reset the d policy network
Figure BDA0001678083500000069
4-13) calculating the d policy network according to the updated c evaluation network
Figure BDA00016780835000000610
Deterministic strategy gradients
Figure BDA00016780835000000611
And update the d-th policy network accordingly
Figure BDA00016780835000000612
Weight parameter θ ofdThe calculation formulas are respectively as follows:
Figure BDA00016780835000000613
Figure BDA00016780835000000614
the weight parameters of the rest strategy networks are kept unchanged;
4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, entering the step 4-15);
4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, entering the step 4-16);
4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)f(sk) And the target strategy realizes the trajectory tracking control of the AUV.
The invention has the characteristics and beneficial effects that:
the method provided by the invention adopts a plurality of strategy networks and evaluation networks. For a plurality of evaluation networks, the performance of each evaluation network is evaluated by defining an expected Bellman absolute error, only one evaluation network with the worst performance is updated at each time step, and different from the existing control method based on reinforcement learning, the invention provides a plurality of quasi-Q learning methods to calculate a more accurate evaluation network target value. For multiple policy networks, one policy network is randomly selected at each time step and updated with a deterministic policy gradient. The final learned strategy is the average of all strategy networks.
1) The AUV trajectory tracking control method provided by the invention does not depend on a model, a target strategy for enabling a control target to reach the optimum is automatically learned through sampling data of the AUV in the driving process, no assumption needs to be made on the AUV model in the process, and the method is particularly suitable for the AUV working in a complex deep sea environment and has high practical application value.
2) The method adopts a plurality of quasi-Q learning to obtain the evaluation network target value which is more accurate than that of the existing method, thereby not only reducing the variance of the action value function obtained by the evaluation network approximation, but also solving the problem of over-estimation of the action value function, further obtaining a more optimal target strategy and realizing high-precision AUV trajectory tracking control.
3) The method determines which evaluation network should be updated at each time step based on the expected Bellman absolute error, and the updating rule can weaken the influence of poor evaluation networks, thereby ensuring the rapid convergence of the learning process.
4) The method adopts a plurality of evaluation networks, so that the learning process is not easily influenced by the severe AUV historical tracking track, the robustness is good, and the learning process is stable.
5) The method combines reinforcement learning and a deep neural network, has strong self-learning capability, can realize high-precision self-adaptive control on the AUV in an uncertain deep sea environment, and has good application prospects in scenes such as AUV track tracking, underwater obstacle avoidance and the like.
Drawings
FIG. 1 is a graph comparing the performance of the proposed method of the present invention with the existing DDPG method; wherein, the graph (a) is a comparison graph of learning curves, and the graph (b) is a comparison graph of AUV trajectory tracking effect.
FIG. 2 is a graph comparing the performance of the proposed method of the present invention with a neural network PID method; wherein, the graph (a) is a comparison graph of the coordinate track tracking effect of the AUV along the X, Y direction, and the graph (b) is a comparison graph of the tracking error of the AUV in the X, Y direction.
Detailed Description
The invention provides an autonomous underwater vehicle track tracking control method based on deep reinforcement learning, which is further described in detail below by combining the accompanying drawings and specific embodiments.
The invention provides an autonomous underwater vehicle tracking control algorithm based on deep reinforcement learning, which mainly comprises four parts: defining an AUV trajectory tracking control problem, establishing a Markov decision process model of the AUV trajectory tracking problem, constructing a hybrid strategy-evaluation network structure and solving a target strategy of AUV trajectory tracking control.
1) Defining AUV trajectory tracking control problem
Defining the AUV trajectory tracking control problem includes four components: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:
1-1) determining AUV System inputs
Let AUV system input vector be tauk=[ξk,k]TWherein ξkkRespectively the propeller thrust and rudder angle of the AUV, subscript k indicating the value of the kth time step, i.e. time k.t, where t is the time step, the same applies below ξkkRespectively have a value range of
Figure BDA0001678083500000081
And
Figure BDA0001678083500000082
wherein
Figure BDA0001678083500000083
The maximum propeller thrust and the maximum rudder angle are respectively determined according to the model of the propeller adopted by the AUV.
1-2) determining AUV System output
Let AUV system output vector be ηk=[xk,ykk]TWherein x isk、ykThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectivelykIs the angle between the advancing direction of the AUV at the kth time step and the X axis.
1-3) defining a tracking control error
Selecting a reference trajectory according to the driving path of the AUV
Figure BDA0001678083500000084
Defining the AUV trajectory tracking control error of the kth time step as:
Figure BDA0001678083500000085
1-4) establishing AUV track tracking control target
For the reference track d in step 1-3)kAn objective function of the form:
Figure BDA0001678083500000086
wherein γ is a discount factor and H is a weight matrix;
establishing AUV track tracking control target to find an optimal system input sequence tau*So that the target function P at the initial moment0(τ) is minimal, the calculation formula is as follows:
Figure BDA0001678083500000087
2) markov decision process model for establishing AUV trajectory tracking problem
The Markov Decision Process (MDP) is the basis of the reinforcement learning theory, and therefore MDP modeling is required for the AUV trajectory tracking problem in step 1). The main elements of reinforcement learning comprise an agent, an environment, a state, an action and an incentive function, and the agent learns an optimal action (or control input) sequence through interaction with the environment of the AUV to maximize accumulated incentive (or minimize accumulated tracking control error), so that the AUV trajectory tracking objective is solved. The method comprises the following specific steps:
2-1) defining a state vector
Defining the velocity vector of the AUV system as phik=[uk,vkk]TWherein u isk、vkLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUVkIs the k-thTime step AUV surrounds the angular velocity of the heading.
AUV system output vector η determined according to step 1-2)kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:
Figure BDA0001678083500000091
2-2) defining motion vectors
Defining the motion vector of the kth time step as the AUV system input vector of the time step, namely: a isk=τk
2-3) defining a reward function
The reward function at the kth time step being used to characterize the state skTaking action akAccording to the trajectory tracking control error e defined in the step 1-3)kAnd the motion vector a defined in step 2-2)kThe AUV reward function defining the kth time step is as follows:
Figure BDA0001678083500000092
2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)*AUV trajectory tracking control target converted into reinforcement learning framework
Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:
Figure BDA0001678083500000093
wherein,
Figure BDA0001678083500000094
expected values representing reward functions, states and actions (the same below); k is the maximum time step;
the action value function is used to describe the expected cumulative discount reward when strategy pi is taken in all states, current and later, and thus, under the reinforcement learning frameworkThe AUV track tracking control target (namely the target of the agent) learns an optimal target strategy pi through interaction with the environment where the AUV is located*So that the action value at the initial time is maximum, namely:
Figure BDA0001678083500000095
wherein, p(s)0) Is in an initial state s0The distribution of (a); a is0Is the initial motion vector.
Therefore, the target tau of the AUV track tracking control established in the step 1-4)*Can be converted into pi*And (4) solving.
2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework
Similar to dynamic programming, many reinforcement learning methods solve the action value function in step 2-4) using the following iterative bellman equation:
Figure BDA0001678083500000096
assuming that the policy pi is deterministic, i.e. the state vector space of the AUV is a one-to-one mapping relation to the action vector space of the AUV, and is denoted as μ, then the above iterative bellman equation can be simplified as:
Figure BDA0001678083500000101
furthermore, for a deterministic strategy μ, the optimal target strategy π in step 2-4) is set*Reduction to deterministic optimal target strategy mu*
Figure BDA0001678083500000102
3) Constructing hybrid policy-evaluation networks
From the step 2-5), the core of solving the AUV trajectory tracking problem by using reinforcement learning is how to solve the deterministic optimal target strategy mu*And the corresponding mostFunction of optimal action value
Figure BDA0001678083500000103
The method adopts a mixed strategy-evaluation network to respectively estimate mu*And
Figure BDA0001678083500000104
the construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:
3-1) constructing a policy network
The mixed strategy-evaluation network structure is characterized in that n strategy networks (the values of which are not too large or too small in order to balance the tracking control precision of the algorithm and the network training speed) are constructed
Figure BDA0001678083500000105
To estimate a deterministic optimal target strategy mu*. Wherein, thetapThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network, each strategy network comprises an input layer, two hidden layers and an output layer, and the input of each strategy network is a state vector skThe output of each policy network is an action vector akThe two hidden layers contain 400 and 300 cells, respectively.
3-2) construction of evaluation network
The mixed strategy-evaluation network structure is characterized in that m (the selection basis of the number of evaluation networks is the same as that of the strategy networks) evaluation networks are constructed
Figure BDA0001678083500000106
To estimate the optimal action value function
Figure BDA0001678083500000107
Wherein,w qthe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network respectively, and each evaluation network is realized by using a fully-connected deep neural networkRespectively comprises an input layer, two hidden layers and an output layer, wherein the two hidden layers respectively comprise 400 units and 300 units; the input to each evaluation network is a state vector skAnd motion vector akWherein the state vector skThe motion vector a is input from the input layer to each evaluation networkkFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector skLower taken action vector akThe action value of (1).
3-3) determining a target policy
According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time stepf(sk) Defined as the average of n strategic network outputs, the calculation formula is as follows:
Figure BDA0001678083500000108
4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking controlf(sk) The method comprises the following specific steps:
4-1) parameter setting
Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation networkωLearning rate α of each policy networkθThe discount factor y and the weight matrix H in the reward function, in this example, M is 1500, K is 1000 (each time step t is 0.2s), N is 64, α for each evaluation networkωα for each policy network ═ 0.01θ=0.001,γ=0.99,H=[0.001,0;0,0.001];
4-2) initializing hybrid policies-evaluating networks
Randomly initializing n policy networks
Figure BDA0001678083500000111
And m evaluation networks
Figure BDA0001678083500000112
Weight parameter θ ofpAnd wq(ii) a Randomly selecting the d (d is 1, …, n) th policy network from the n policy networks and recording the d (d is 1, …, n) th policy network
Figure BDA0001678083500000113
Constructing an empirical queuing set R, setting the maximum capacity of the empirical queuing set R as B (in this embodiment, B is 10000), and initializing to be empty;
4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;
4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV0Let the state variable s of the current time stepks 0(ii) a And generates an exploratory Noisek(this example uses Ornstein-Uhlenbeck to explore noise);
4-5) network based on n current policies
Figure BDA0001678083500000114
And exploration NoisekDetermining a motion vector a for a current time stepkComprises the following steps:
Figure BDA0001678083500000115
4-6) AUV in Current State skLower execution action akObtaining the reward function r according to the step 2-3)k+1And observe a new state sk+1(ii) a Note ek=(sk,ak,rk+1,sk+1) Is an experience sample; if the number of samples in the experience queuing set R reaches the maximum capacity B, deleting the first added sample, and then adding the experience sample ekStoring the data into an experience queue set R; otherwise, the experience sample e is directly usedkStoring the data into an experience queue set R;
selecting A experience samples from an experience queue set R, wherein A is less than or equal to N, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience is queuedWhen the set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set Rl,al,rl+1,sl+1) L is the time step of the selected experience sample;
4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samplesqFor characterizing the performance of each evaluation network, the formula is as follows:
Figure BDA0001678083500000116
selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:
Figure BDA0001678083500000117
4-8) evaluating the network by the c
Figure BDA0001678083500000118
The motion vector of each experience sample at the next time step is obtained through the following greedy strategy:
Figure BDA0001678083500000119
4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methods
Figure BDA0001678083500000121
The formula is as follows:
Figure BDA0001678083500000122
4-10) calculating the loss function L (w) for the c-th evaluation networkc) The formula is as follows:
Figure BDA0001678083500000123
4-11) through a loss function L (w)c) For the weight parameter wcTo update the weight parameters of the c-th evaluation network, the formula is as follows:
Figure BDA0001678083500000124
the weight parameters of the rest evaluation networks are kept unchanged;
4-12) randomly selecting one policy network from the n policy networks to reset the d policy network
Figure BDA0001678083500000125
4-13) calculating the d policy network according to the updated c evaluation network
Figure BDA0001678083500000126
Deterministic strategy gradients
Figure BDA0001678083500000127
And update the d-th policy network accordingly
Figure BDA0001678083500000128
Weight parameter θ ofdThe calculation formulas are respectively as follows:
Figure BDA0001678083500000129
Figure BDA00016780835000001210
the weight parameters of the remaining policy networks remain unchanged.
4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, go to step 4-15).
4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, go to step 4-16).
4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)f(sk) And the target strategy realizes the trajectory tracking control of the AUV.
Validity verification of embodiments of the invention
The performance analysis of the AUV trajectory tracking control method (MPQ-DPG for short) based on deep reinforcement learning provided by the invention is shown as follows, and all comparison experiments are based on the widely used REMUS autonomous unmanned aircraft, and the maximum propeller thrust of the REMUS autonomous unmanned aircraft is
Figure BDA00016780835000001211
Rudder angle
Figure BDA00016780835000001212
86N and 0.24rad, respectively; and the following reference trajectories are adopted:
Figure BDA00016780835000001213
in addition, in the embodiment of the present invention, the evaluation network number m is the same as the policy network number n, and is hereinafter collectively referred to as n.
1) Comparison analysis of MPQ-DPG and existing DDPG method
Fig. 1 is a comparison between the learning curve and the trajectory tracking effect of the AUV-proposed trajectory tracking control method (MPQ-DPG) for deep reinforcement learning proposed by the present invention and the existing DDPG method in the training process. The learning curve in graph (a) is obtained by five independent experiments, and Ref in graph (b) represents a reference trajectory.
Analyzing fig. 1, the following conclusions can be drawn:
a) compared with a DDPG method, the learning stability of the MPQ-DPG is better, because the MPQ-DPG adopts a plurality of evaluation networks and strategy networks, the influence of poor samples on the learning stability can be reduced.
b) The average accumulated reward of final convergence of the MPQ-DPG method is obviously higher than that of the DDPG method, which shows that the tracking control precision of the MPQ-DPG method is obviously higher than that of the DDPG method.
c) It can be observed from fig. 1(b) that the tracking trajectory obtained by the MPQ-DPG method almost coincides with the reference trajectory, which shows that the MPQ-DPG method can realize high-precision AUV tracking control.
d) With the increase of the number of strategy networks and evaluation networks, the tracking control precision of the MPQ-DPG method can be gradually improved, but the improvement amplitude is not obvious after n > 4.
2) Comparison analysis of MPQ-DPG method and existing neural network PID method
FIG. 2 is a comparison between the MPQ-DPG method and the neural network PID method provided for the underwater unmanned vehicle trajectory tracking control in the invention on a coordinate trajectory tracking curve and a coordinate trajectory tracking error. In the figure, Ref represents a reference coordinate track, PIDNN represents a neural network PID algorithm, and n is 4.
Analyzing the figure 2, the tracking performance of the neural network PID control method is obviously inferior to that of the MPQ-DPG method provided by the invention; in addition, the tracking error in fig. 2(b) shows that the MPQ-DPG method can achieve faster convergence of the error, and particularly in the initial stage, the MPQ-DPG method can still achieve fast and high-precision tracking performance, while the response time of the neural network PID method is significantly longer than that of the MPQ-DPG method, and the convergence of the tracking error is poor.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (1)

1. An autonomous underwater vehicle track tracking control method based on deep reinforcement learning is characterized by comprising the following steps:
1) defining AUV (autonomous Underwater vehicle) track tracking control problem
Defining the AUV trajectory tracking control problem includes four parts: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:
1-1) determining AUV System inputs
Let AUV system input vector be tauk=[ξk,k]TWherein ξkkPropeller thrust and rudder angle of the AUV, respectively, with the subscript k denoting the kth time step ξkkRespectively have a value range of
Figure FDA0001678083490000011
And
Figure FDA0001678083490000012
Figure FDA0001678083490000013
maximum propeller thrust and maximum rudder angle respectively;
1-2) determining AUV System output
Let AUV system output vector be ηk=[xk,ykk]TWherein x isk、ykThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectivelykThe included angle between the advancing direction of the AUV at the kth time step and the X axis;
1-3) defining a tracking control error
Selecting a reference trajectory according to the driving path of the AUV
Figure FDA0001678083490000014
Defining the AUV trajectory tracking control error of the kth time step as:
Figure FDA0001678083490000015
1-4) establishing AUV track tracking control target
For the reference track d in step 1-3)kSelecting an objective function of the formNumber:
Figure FDA0001678083490000016
wherein γ is a discount factor and H is a weight matrix;
establishing AUV track tracking control target to find an optimal system input sequence tau*So that the target function P at the initial moment0(τ) is minimal, the calculation formula is as follows:
Figure FDA0001678083490000017
2) markov decision process model for establishing AUV trajectory tracking problem
Performing Markov decision process modeling on the AUV trajectory tracking problem in the step 1), and specifically performing the following steps:
2-1) defining a state vector
Defining the velocity vector of the AUV system as phik=[uk,vkk]TWherein u isk、vkLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUVkThe angular velocity of the k time step AUV around the advancing direction;
AUV system output vector η determined according to step 1-2)kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:
Figure FDA0001678083490000021
2-2) defining motion vectors
Defining the motion vector of the k time step as the AUV system input vector of the time step, namely ak=τk
2-3) defining a reward function
The reward function at the kth time step being used to characterize the state skTaking action akAccording to the track defined in step 1-3)Tracking control error ekAnd the motion vector a defined in step 2-2)kThe AUV reward function defining the kth time step is as follows:
Figure FDA0001678083490000022
2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)*AUV trajectory tracking control target converted into reinforcement learning framework
Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:
Figure FDA0001678083490000023
wherein,
Figure FDA0001678083490000024
expected values representing reward functions, states and actions; k is the maximum time step;
the action value function is used for describing expected accumulated discount rewards when a strategy pi is adopted in all the current and the following states, so that an AUV (autonomous Underwater vehicle) track tracking control target learns an optimal target strategy pi through interaction with the environment where the AUV is located in an enhanced learning framework*The calculation formula is as follows, so that the action value at the initial time is maximum:
Figure FDA0001678083490000025
wherein, p(s)0) Is in an initial state s0The distribution of (a); a is0Is an initial motion vector;
tracking and controlling the target tau of the AUV track established in the step 1-4)*Is converted into pi*Solving;
2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework
Solving the action value function in step 2-4) by iterating the Bellman equation as follows:
Figure FDA0001678083490000026
assuming that the policy pi is deterministic, that is, the state vector space of the AUV is in one-to-one mapping relation to the motion vector space of the AUV, and is recorded as μ, the iterative bellman equation is simplified as follows:
Figure FDA0001678083490000027
for the deterministic strategy mu, the optimal target strategy pi in the step 2-4) is used*Reduction to deterministic optimal target strategy mu*
Figure FDA0001678083490000028
3) Constructing hybrid policy-evaluation networks
Separately estimating deterministic optimal target strategies mu by constructing hybrid strategy-evaluation networks*And corresponding optimal action value function
Figure FDA0001678083490000031
The construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:
3-1) constructing a policy network
Hybrid policy-evaluation network architecture by constructing n policy networks
Figure FDA0001678083490000032
To estimate a deterministic optimal target strategy mu*(ii) a Wherein, thetapThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network respectively, and comprises an input layer, two hidden layers and an output layer; the input of each policy network is a state vectorskThe output of each policy network is an action vector ak
3-2) construction of evaluation network
The mixed strategy-evaluation network structure is realized by constructing m evaluation networks
Figure FDA0001678083490000039
To estimate the optimal action value function
Figure FDA0001678083490000033
Wherein, wqThe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network, and comprises an input layer, two hidden layers and an output layer; the input to each evaluation network is a state vector skAnd motion vector akWherein the state vector skThe motion vector a is input from the input layer to each evaluation networkkFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector skLower taken action vector akAn action value of (d);
3-3) determining a target policy
According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time stepf(sk) Defined as the average of n strategic network outputs, the calculation formula is as follows:
Figure FDA0001678083490000034
4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking controlf(sk) The method comprises the following specific steps:
4-1) parameter setting
Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation networkωLearning rate α of each policy networkθA weight matrix H in the discount factor gamma and the reward function;
4-2) initializing hybrid policies-evaluating networks
Randomly initializing n policy networks
Figure FDA0001678083490000035
And m evaluation networks
Figure FDA0001678083490000036
Weight parameter θ ofpAnd wq(ii) a Randomly selecting the d-th policy network from the n policy networks as
Figure FDA0001678083490000037
d=1,…,n;
Establishing an experience queue set R, setting the maximum capacity of the experience queue set R as B, and initializing the experience queue set R as empty;
4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;
4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV0Let the state variable s of the current time stepk=s0(ii) a And generates an exploratory Noisek
4-5) network based on n current policies
Figure FDA0001678083490000038
And exploration NoisekDetermining a motion vector a for a current time stepkComprises the following steps:
Figure FDA0001678083490000041
4-6) AUV in Current State skLower execution action akObtaining the reward function r according to the step 2-3)k+1And observe a new state sk+1(ii) a Note ek=(sk,ak,rk+1,sk+1) Is an experience sample; if the number of samples of the empirical queuing set R has reached the maximum capacity B, thenDeleting the first added sample and then using the experience sample ekStoring the data into an experience queue set R; otherwise, the experience sample e is directly usedkStoring the data into an experience queue set R;
a experience samples are selected from the experience queue set R, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience queuing set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set Rl,al,rl+1,sl+1);
4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samplesqFor characterizing the performance of each evaluation network, the formula is as follows:
Figure FDA0001678083490000042
selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:
Figure FDA0001678083490000043
4-8) evaluating the network by the c
Figure FDA0001678083490000044
The motion vector of each experience sample at the next time step is obtained through the following greedy strategy:
Figure FDA0001678083490000045
4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methods
Figure FDA0001678083490000046
The formula is as follows:
4-10) calculating the loss function L (w) for the c-th evaluation networkc) The formula is as follows:
Figure FDA0001678083490000048
4-11) through a loss function L (w)c) For the weight parameter wcTo update the weight parameters of the c-th evaluation network, the formula is as follows:
Figure FDA00016780834900000414
the weight parameters of the rest evaluation networks are kept unchanged;
4-12) randomly selecting one policy network from the n policy networks to reset the d policy network
Figure FDA0001678083490000049
4-13) calculating the d policy network according to the updated c evaluation network
Figure FDA00016780834900000410
Deterministic strategy gradients
Figure FDA00016780834900000411
And update the d-th policy network accordingly
Figure FDA00016780834900000412
Weight parameter θ ofdThe calculation formulas are respectively as follows:
Figure FDA00016780834900000413
Figure FDA0001678083490000051
the weight parameters of the rest strategy networks are kept unchanged;
4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, entering the step 4-15);
4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, entering the step 4-16);
4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)f(sk) And the target strategy realizes the trajectory tracking control of the AUV.
CN201810535773.8A 2018-05-30 2018-05-30 Autonomous underwater vehicle track tracking control method based on deep reinforcement learning Active CN108803321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810535773.8A CN108803321B (en) 2018-05-30 2018-05-30 Autonomous underwater vehicle track tracking control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810535773.8A CN108803321B (en) 2018-05-30 2018-05-30 Autonomous underwater vehicle track tracking control method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN108803321A CN108803321A (en) 2018-11-13
CN108803321B true CN108803321B (en) 2020-07-10

Family

ID=64089259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810535773.8A Active CN108803321B (en) 2018-05-30 2018-05-30 Autonomous underwater vehicle track tracking control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN108803321B (en)

Families Citing this family (77)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109361700A (en) * 2018-12-06 2019-02-19 郑州航空工业管理学院 A kind of unmanned plane self-organizing network system adaptive recognition protocol frame
CN111338333B (en) * 2018-12-18 2021-08-31 北京航迹科技有限公司 System and method for autonomous driving
US10955853B2 (en) 2018-12-18 2021-03-23 Beijing Voyager Technology Co., Ltd. Systems and methods for autonomous driving
CN109719721B (en) * 2018-12-26 2020-07-24 北京化工大学 Adaptive gait autonomous emerging method of snake-like search and rescue robot
CN109726866A (en) * 2018-12-27 2019-05-07 浙江农林大学 Unmanned boat paths planning method based on Q learning neural network
CN109696830B (en) * 2019-01-31 2021-12-03 天津大学 Reinforced learning self-adaptive control method of small unmanned helicopter
CN109960259B (en) * 2019-02-15 2021-09-24 青岛大学 Multi-agent reinforcement learning unmanned guided vehicle path planning method based on gradient potential
CN109828463A (en) * 2019-02-18 2019-05-31 哈尔滨工程大学 A kind of adaptive wave glider bow of ocean current interference is to control method
CN109828467B (en) * 2019-03-01 2021-09-07 大连海事大学 Data-driven unmanned ship reinforcement learning controller structure and design method
CN109765916A (en) * 2019-03-26 2019-05-17 武汉欣海远航科技研发有限公司 A kind of unmanned surface vehicle path following control device design method
CN109870162B (en) * 2019-04-04 2020-10-30 北京航空航天大学 Unmanned aerial vehicle flight path planning method based on competition deep learning network
CN110019151B (en) * 2019-04-11 2024-03-15 深圳市腾讯计算机系统有限公司 Database performance adjustment method, device, equipment, system and storage medium
CN110083064B (en) * 2019-04-29 2022-02-15 辽宁石油化工大学 Network optimal tracking control method based on non-strategy Q-learning
CN110045614A (en) * 2019-05-16 2019-07-23 河海大学常州校区 A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning
CN110428615B (en) * 2019-07-12 2021-06-22 中国科学院自动化研究所 Single intersection traffic signal control method, system and device based on deep reinforcement learning
CN110362089A (en) * 2019-08-02 2019-10-22 大连海事大学 A method of the unmanned boat independent navigation based on deeply study and genetic algorithm
CN110321666B (en) * 2019-08-09 2022-05-03 重庆理工大学 Multi-robot path planning method based on priori knowledge and DQN algorithm
CN110333739B (en) * 2019-08-21 2020-07-31 哈尔滨工程大学 AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning
CN110806756B (en) * 2019-09-10 2022-08-02 西北工业大学 Unmanned aerial vehicle autonomous guidance control method based on DDPG
CN110716574B (en) * 2019-09-29 2023-05-02 哈尔滨工程大学 UUV real-time collision avoidance planning method based on deep Q network
CN110673602B (en) * 2019-10-24 2022-11-25 驭势科技(北京)有限公司 Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment
CN110806759B (en) * 2019-11-12 2020-09-08 清华大学 Aircraft route tracking method based on deep reinforcement learning
CN110989576B (en) * 2019-11-14 2022-07-12 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111027677B (en) * 2019-12-02 2023-05-23 西安电子科技大学 Multi-moving target tracking method based on depth deterministic strategy gradient DDPG
CN111091710A (en) * 2019-12-18 2020-05-01 上海天壤智能科技有限公司 Traffic signal control method, system and medium
CN111061277B (en) * 2019-12-31 2022-04-05 歌尔股份有限公司 Unmanned vehicle global path planning method and device
CN111310384B (en) * 2020-01-16 2024-05-21 香港中文大学(深圳) Wind field cooperative control method, terminal and computer readable storage medium
CN111240345B (en) * 2020-02-11 2023-04-07 哈尔滨工程大学 Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
CN111580544B (en) * 2020-03-25 2021-05-07 北京航空航天大学 Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN111813143B (en) * 2020-06-09 2022-04-19 天津大学 Underwater glider intelligent control system and method based on reinforcement learning
CN111736617B (en) * 2020-06-09 2022-11-04 哈尔滨工程大学 Track tracking control method for preset performance of benthonic underwater robot based on speed observer
CN111856936B (en) * 2020-07-21 2023-06-02 天津蓝鳍海洋工程有限公司 Control method for cabled underwater high-flexibility operation platform
CN112100834A (en) * 2020-09-06 2020-12-18 西北工业大学 Underwater glider attitude control method based on deep reinforcement learning
CN112132263B (en) * 2020-09-11 2022-09-16 大连理工大学 Multi-agent autonomous navigation method based on reinforcement learning
CN112162555B (en) * 2020-09-23 2021-07-16 燕山大学 Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
CN112148025A (en) * 2020-09-24 2020-12-29 东南大学 Unmanned aerial vehicle stability control algorithm based on integral compensation reinforcement learning
CN112179367B (en) * 2020-09-25 2023-07-04 广东海洋大学 Intelligent autonomous navigation method based on deep reinforcement learning
CN112241176B (en) * 2020-10-16 2022-10-28 哈尔滨工程大学 Path planning and obstacle avoidance control method of underwater autonomous vehicle in large-scale continuous obstacle environment
CN112558465B (en) * 2020-12-03 2022-11-01 大连海事大学 Unknown unmanned ship finite time reinforcement learning control method with input limitation
CN112506210B (en) * 2020-12-04 2022-12-27 东南大学 Unmanned aerial vehicle control method for autonomous target tracking
CN112462792B (en) * 2020-12-09 2022-08-09 哈尔滨工程大学 Actor-Critic algorithm-based underwater robot motion control method
CN112698572B (en) * 2020-12-22 2022-08-16 西安交通大学 Structural vibration control method, medium and equipment based on reinforcement learning
CN112929900B (en) * 2021-01-21 2022-08-02 华侨大学 MAC protocol for realizing time domain interference alignment based on deep reinforcement learning in underwater acoustic network
CN113029123A (en) * 2021-03-02 2021-06-25 西北工业大学 Multi-AUV collaborative navigation method based on reinforcement learning
CN113052372B (en) * 2021-03-17 2022-08-02 哈尔滨工程大学 Dynamic AUV tracking path planning method based on deep reinforcement learning
CN113095463A (en) * 2021-03-31 2021-07-09 南开大学 Robot confrontation method based on evolution reinforcement learning
CN113095500B (en) * 2021-03-31 2023-04-07 南开大学 Robot tracking method based on multi-agent reinforcement learning
CN113370205B (en) * 2021-05-08 2022-06-17 浙江工业大学 Baxter mechanical arm track tracking control method based on machine learning
CN113359448A (en) * 2021-06-03 2021-09-07 清华大学 Autonomous underwater vehicle track tracking control method aiming at time-varying dynamics
CN113595768A (en) * 2021-07-07 2021-11-02 西安电子科技大学 Distributed cooperative transmission algorithm for guaranteeing control performance of mobile information physical system
CN113467248A (en) * 2021-07-22 2021-10-01 南京大学 Fault-tolerant control method for unmanned aerial vehicle sensor during fault based on reinforcement learning
WO2023019536A1 (en) * 2021-08-20 2023-02-23 上海电气电站设备有限公司 Deep reinforcement learning-based photovoltaic module intelligent sun tracking method
CN113821035A (en) * 2021-09-22 2021-12-21 北京邮电大学 Unmanned ship trajectory tracking control method and device
CN113829351B (en) * 2021-10-13 2023-08-01 广西大学 Cooperative control method of mobile mechanical arm based on reinforcement learning
CN113885330B (en) * 2021-10-26 2022-06-17 哈尔滨工业大学 Information physical system safety control method based on deep reinforcement learning
CN114089633B (en) * 2021-11-19 2024-04-26 江苏科技大学 Multi-motor coupling driving control device and method for underwater robot
CN114020001A (en) * 2021-12-17 2022-02-08 中国科学院国家空间科学中心 Mars unmanned aerial vehicle intelligent control method based on depth certainty strategy gradient learning
CN114357884B (en) * 2022-01-05 2022-11-08 厦门宇昊软件有限公司 Reaction temperature control method and system based on deep reinforcement learning
CN114527642B (en) * 2022-03-03 2024-04-02 东北大学 Method for automatically adjusting PID parameters by AGV based on deep reinforcement learning
CN114721408A (en) * 2022-04-18 2022-07-08 哈尔滨理工大学 Underwater robot path tracking method based on reinforcement learning
CN114954840B (en) * 2022-05-30 2023-09-05 武汉理工大学 Method, system and device for controlling stability of ship
CN114995137B (en) * 2022-06-01 2023-04-28 哈尔滨工业大学 Rope-driven parallel robot control method based on deep reinforcement learning
CN115016496A (en) * 2022-06-30 2022-09-06 重庆大学 Water surface unmanned ship path tracking method based on deep reinforcement learning
CN114839884B (en) * 2022-07-05 2022-09-30 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN114967713B (en) * 2022-07-28 2022-11-29 山东大学 Underwater vehicle buoyancy discrete change control method based on reinforcement learning
CN115366099B (en) * 2022-08-18 2024-05-28 江苏科技大学 Mechanical arm depth deterministic strategy gradient training method based on forward kinematics
CN115330276B (en) * 2022-10-13 2023-01-06 北京云迹科技股份有限公司 Method and device for robot to automatically select elevator based on reinforcement learning
CN115657477A (en) * 2022-10-13 2023-01-31 北京理工大学 Dynamic environment robot self-adaptive control method based on offline reinforcement learning
CN115562345B (en) * 2022-10-28 2023-06-27 北京理工大学 Unmanned aerial vehicle detection track planning method based on deep reinforcement learning
CN115657683B (en) * 2022-11-14 2023-05-02 中国电子科技集团公司第十研究所 Unmanned cable-free submersible real-time obstacle avoidance method capable of being used for inspection operation task
CN115857556B (en) * 2023-01-30 2023-07-14 中国人民解放军96901部队 Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning
CN115826594B (en) * 2023-02-23 2023-05-30 北京航空航天大学 Unmanned underwater vehicle switching topology formation control method independent of dynamic model parameters
CN115855226B (en) * 2023-02-24 2023-05-30 青岛科技大学 Multi-AUV cooperative underwater data acquisition method based on DQN and matrix completion
CN116295449B (en) * 2023-05-25 2023-09-12 吉林大学 Method and device for indicating path of autonomous underwater vehicle
CN116578102B (en) * 2023-07-13 2023-09-19 清华大学 Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium
CN116827685B (en) * 2023-08-28 2023-11-14 成都乐超人科技有限公司 Dynamic defense strategy method of micro-service system based on deep reinforcement learning
CN117826860B (en) * 2024-03-04 2024-06-21 北京航空航天大学 Fixed wing unmanned aerial vehicle control strategy determination method based on reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101545731B1 (en) * 2014-04-30 2015-08-20 인하대학교 산학협력단 System and method for video tracking
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN107368076A (en) * 2017-07-31 2017-11-21 中南大学 Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment
CN107856035A (en) * 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of robustness dynamic motion method based on intensified learning and whole body controller

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8456523B2 (en) * 2009-07-20 2013-06-04 Precitec Kg Laser processing head and method for compensating for the change in focus position in a laser processing head

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101545731B1 (en) * 2014-04-30 2015-08-20 인하대학교 산학협력단 System and method for video tracking
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN107368076A (en) * 2017-07-31 2017-11-21 中南大学 Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment
CN107856035A (en) * 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of robustness dynamic motion method based on intensified learning and whole body controller

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AUV Based Source Seeking with Estimated Gradients;Li Zhou等;《Journal of Systems Science & Complexity》;20180228(第1期);第262-275页 *
Deep Reinforcement Learning Based Optimal Trajectory Tracking Control of Autonomous Underwater Vehicle;Runsheng Yu等;《Proceedings of the 36th Chinese Control Conference》;20170731;第4958-4965页 *
基于深度强化学习的水下机器人最优轨迹控制;马琼雄等;《华南师范大学(自然科学版)》;20180228;第50卷(第1期);第118-123页 *
进化强化学习及其在机器人路径跟踪中的应用;段勇等;《控制与决策》;20090430;第24卷(第4期);第532-536、541页 *

Also Published As

Publication number Publication date
CN108803321A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
CN107168312B (en) Space trajectory tracking control method for compensating UUV kinematic and dynamic interference
CN107748566B (en) Underwater autonomous robot fixed depth control method based on reinforcement learning
CN111966118B (en) ROV thrust distribution and reinforcement learning-based motion control method
Sun et al. Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning
CN111142522B (en) Method for controlling agent of hierarchical reinforcement learning
CN111650948B (en) Quick tracking control method for horizontal plane track of benthonic AUV
CN111240345B (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
CN113052372B (en) Dynamic AUV tracking path planning method based on deep reinforcement learning
CN115016496A (en) Water surface unmanned ship path tracking method based on deep reinforcement learning
CN114879671B (en) Unmanned ship track tracking control method based on reinforcement learning MPC
CN109189103B (en) Under-actuated AUV trajectory tracking control method with transient performance constraint
CN114199248B (en) AUV co-location method for optimizing ANFIS based on mixed element heuristic algorithm
Mousavian et al. Identification-based robust motion control of an AUV: optimized by particle swarm optimization algorithm
CN114115262B (en) Multi-AUV actuator saturation cooperative formation control system and method based on azimuth information
Wang et al. Path-following optimal control of autonomous underwater vehicle based on deep reinforcement learning
Liu et al. Deep reinforcement learning for vectored thruster autonomous underwater vehicle control
Gao et al. Command filtered path tracking control of saturated ASVs based on time‐varying disturbance observer
Fan et al. Path-Following Control of Unmanned Underwater Vehicle Based on an Improved TD3 Deep Reinforcement Learning
Song et al. Surface path tracking method of autonomous surface underwater vehicle based on deep reinforcement learning
Sola et al. Evaluation of a deep-reinforcement-learning-based controller for the control of an autonomous underwater vehicle
Zhang et al. Novel TD3 Based AUV Path Tracking Control
CN116578102B (en) Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium
Liu et al. Research on obstacle avoidance of underactuated autonomous underwater vehicle based on offline reinforcement learning
Frafjord Target Tracking Control for an Unmanned Surface Vessel: Optimal Control vs Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant