CN108803321B

CN108803321B - Autonomous underwater vehicle track tracking control method based on deep reinforcement learning

Info

Publication number: CN108803321B
Application number: CN201810535773.8A
Authority: CN
Inventors: 宋士吉; 石文杰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2020-07-10
Anticipated expiration: 2038-05-30
Also published as: CN108803321A

Abstract

The invention provides an autonomous underwater vehicle track tracking control method based on deep reinforcement learning, and belongs to the field of deep reinforcement learning and intelligent control. Firstly, defining an AUV trajectory tracking control problem; then establishing a Markov decision process model of the AUV trajectory tracking problem; then constructing a mixed strategy-evaluation network, wherein the network consists of a plurality of strategy networks and evaluation networks; and finally, solving an AUV trajectory tracking control target strategy by the constructed hybrid strategy-evaluation network, evaluating the performance of each evaluation network by defining expected Bellman absolute errors for a plurality of evaluation networks, only updating the evaluation network with the worst performance at each time step, randomly selecting a strategy network at each time step for the plurality of strategy networks, updating by adopting a deterministic strategy gradient, and finally learning the strategy which is the mean value of all the strategy networks. The method is not easily influenced by the historical tracking track of the bad AUV, and has high precision.

Description

Autonomous underwater vehicle track tracking control method based on deep reinforcement learning

Technical Field

The invention belongs to the field of deep reinforcement learning and intelligent control, and relates to an Autonomous Underwater Vehicle (AUV) trajectory tracking control method based on deep reinforcement learning.

Background

The development of deep sea seabed science highly depends on deep sea detection technology and equipment, and due to the complex deep sea environment and extreme conditions, deep sea operation type autonomous underwater vehicles are mainly adopted to replace or assist people in detecting, observing and sampling deep sea at present. Aiming at task scenes that humans cannot reach field operation, such as ocean resource exploration, submarine investigation, ocean surveying and mapping, the autonomous and controllable AUV underwater motion is guaranteed to be the most basic and important functional requirement, and the premise is that various complex operation tasks are realized. However, many offshore applications of AUVs (e.g., trajectory tracking control, target tracking control, etc.) are extremely challenging, which is mainly caused by the characteristics of the AUV system in three ways. Firstly, an AUV (autonomous Underwater vehicle) is used as a multi-input multi-output system, and a dynamics and kinematics model (hereinafter referred to as a model) of the AUV is complex and has the characteristics of high nonlinearity, strong coupling, existence of input or state constraint, time variation and the like; secondly, uncertainty exists in model parameters or hydrodynamic environment, so that the AUV system is difficult to model; thirdly, most of the AUVs currently belong to under-actuated systems, i.e. the degrees of freedom are greater than the number of independent actuators (each independent actuator corresponds to one degree of freedom). Generally, the model and parameters of the AUV are determined by a method combining mathematical and physical mechanism derivation, numerical simulation and physical experiment, and the uncertain parts in the model are reasonably depicted. The control problem of the AUV is also very complicated due to the complex model. Moreover, with the continuous expansion of the application scenarios of the AUV, people put higher requirements on the accuracy and stability of the motion control of the AUV, and how to improve the control effect of the AUV in various motion scenarios has become an important research direction.

In the past decades, researchers have designed various AUV motion control methods and verified their effectiveness for different application scenarios such as trajectory tracking, waypoint tracking, path planning, and formation control. The representative of the method is a model-based output feedback control method proposed by Refsnes et al, which adopts two decoupled system models: the three-degree-of-freedom ocean current induced ship body model is used for depicting ocean current load, and the five-degree-of-freedom model is used for describing system dynamics. In addition, Healey et al have designed a state feedback-based tracking control method that employs a fixed forward motion velocity and linearizes the system model, and that employs three decoupled models: a surge model, a horizontal steering model (surge and yaw), and a vertical model (heave and pitch). However, these methods all perform decoupling or linearization processing on the system model, so it is difficult to meet the requirement of high-precision control of the AUV in a specific application scenario.

Due to the limitations of the classical motion control methods and the strong self-learning ability of reinforcement learning, in recent years, researchers have shown great research interest in intelligent control methods represented by reinforcement learning. Various intelligent control methods based on reinforcement learning technologies (such as Q learning, direct strategy search, strategy-evaluation network and adaptive reinforcement learning) are also continuously proposed and successfully applied to different complex application scenarios, such as robot motion control, unmanned plane flight control, hypersonic aircraft tracking control, and road signal lamp control. The core idea of the control method based on reinforcement learning is to realize the performance optimization of the control system on the premise of no prior knowledge. For AUV systems, many researchers have designed various reinforcement learning based control methods and actually verified their feasibility. For the problem of autonomous underwater cable tracking control, EI-Fakdi et al adopt a direct strategy search technique to learn a state/action mapping relation, but the method is only suitable for the case that both state and action spaces are discrete; for a continuous motion space, Paula et al adopts a radial basis network to approximate a strategy function, however, the control method cannot guarantee high tracking control accuracy due to the weak function approximation capability of the radial basis network.

In recent years, with the development of Deep Neural Network (DNN) training technologies such as batch learning, empirical review, and batch regularization, deep reinforcement learning shows excellent performance in complex tasks such as robot motion control, autonomous ground vehicle motion control, quad-rotor control, and autopilot. In particular, the recently proposed Deep Q Networks (DQN) exhibit human-level control accuracy in a number of very challenging tasks. However, DQN does not address the problem of having both a high dimensional state space and a continuous motion space. On the basis of DQN, a depth deterministic strategy gradient (DDPG) algorithm is further proposed and realizes continuous control. However, the DDPG estimates the target value of the evaluation network using the target evaluation network, so that the evaluation network cannot effectively evaluate the strategy learned by the strategy network, and the learned action value function has a large variance, and thus when the DDPG is applied to the AUV trajectory tracking control problem, the requirements of high tracking control accuracy and stable learning cannot be satisfied.

Disclosure of Invention

The invention aims to provide an AUV (autonomous Underwater vehicle) trajectory tracking control method based on deep reinforcement learning, which adopts a mixed strategy-evaluation network structure and adopts a plurality of quasi-Q learning and deterministic strategy gradients to respectively train an evaluation network and a strategy network, overcomes the problems that the conventional reinforcement learning-based method is low in control precision, cannot realize continuous control, is unstable in learning process and the like, and realizes high-precision AUV trajectory tracking control and stable learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

an autonomous underwater vehicle track tracking control method based on deep reinforcement learning comprises the following steps:

1) defining AUV (autonomous Underwater vehicle) track tracking control problem

Defining the AUV trajectory tracking control problem includes four parts: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:

1-1) determining AUV System inputs

Let AUV system input vector be tau_k＝[ξ_k,_k]^TWherein ξ_k、_kPropeller thrust and rudder angle of the AUV, respectively, with the subscript k denoting the kth time step ξ_k、_kRespectively have a value range of

And

maximum propeller thrust and maximum rudder angle respectively;

1-2) determining AUV System output

Let AUV system output vector be η_k＝[x_k,y_k,ψ_k]^TWherein x is_k、y_kThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectively_kThe included angle between the advancing direction of the AUV at the kth time step and the X axis;

1-3) defining a tracking control error

According to the line of AUVDriving path selection reference track

Defining the AUV trajectory tracking control error of the kth time step as:

1-4) establishing AUV track tracking control target

For the reference track d in step 1-3)_kAn objective function of the form:

wherein γ is a discount factor and H is a weight matrix;

establishing AUV track tracking control target to find an optimal system input sequence tau^*So that the target function P at the initial moment₀(τ) is minimal, the calculation formula is as follows:

2) markov decision process model for establishing AUV trajectory tracking problem

Performing Markov decision process modeling on the AUV trajectory tracking problem in the step 1), and specifically performing the following steps:

2-1) defining a state vector

Defining the velocity vector of the AUV system as phi_k＝[u_k,v_k,χ_k]^TWherein u is_k、v_kLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUV_kThe angular velocity of the k time step AUV around the advancing direction;

AUV system output vector η determined according to step 1-2)_kAnd the reference track defined in step 1-3), the state vector defining the kth time step is as follows:

2-2) defining motion vectors

Defining the motion vector of the k time step as the AUV system input vector of the time step, namely a_k＝τ_k；

2-3) defining a reward function

The reward function at the kth time step being used to characterize the state s_kTaking action a_kAccording to the trajectory tracking control error e defined in the step 1-3)_kAnd the motion vector a defined in step 2-2)_kThe AUV reward function defining the kth time step is as follows:

2-4) tracking and controlling the target tau of the AUV track established in the step 1-4)^*AUV trajectory tracking control target converted into reinforcement learning framework

Defining a policy pi as the probability of selecting each possible action in a certain state, then defining an action value function as follows:

wherein,

expected values representing reward functions, states and actions; k is the maximum time step;

the action value function is used for describing expected accumulated discount rewards when a strategy pi is adopted in all the current and the following states, so that an AUV (autonomous Underwater vehicle) track tracking control target learns an optimal target strategy pi through interaction with the environment where the AUV is located in an enhanced learning framework^*The calculation formula is as follows, so that the action value at the initial time is maximum:

wherein, p(s)₀) Is in an initial state s₀The distribution of (a); a is₀Is an initial motion vector;

tracking and controlling the target tau of the AUV track established in the step 1-4)^*Is converted into pi^*Solving;

2-5) simplifying AUV trajectory tracking control target under reinforcement learning framework

Solving the action value function in step 2-4) by iterating the Bellman equation as follows:

assuming that the policy pi is deterministic, that is, the state vector space of the AUV is in one-to-one mapping relation to the motion vector space of the AUV, and is recorded as μ, the iterative bellman equation is simplified as follows:

for the deterministic strategy mu, the optimal target strategy pi in the step 2-4) is used^*Reduction to deterministic optimal target strategy mu^*：

3) Constructing hybrid policy-evaluation networks

Separately estimating deterministic optimal target strategies mu by constructing hybrid strategy-evaluation networks^*And corresponding optimal action value function

The construction of the hybrid strategy-evaluation network comprises three parts: the method comprises the following steps of constructing a strategy network, constructing an evaluation network and determining a target strategy, wherein the method comprises the following specific steps:

3-1) constructing a policy network

Hybrid policy-evaluation network architectureBy constructing n policy networks

To estimate a deterministic optimal target strategy mu^*(ii) a Wherein, theta_pThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network respectively, and comprises an input layer, two hidden layers and an output layer; the input of each policy network is a state vector s_kThe output of each policy network is an action vector a_k；

3-2) construction of evaluation network

The mixed strategy-evaluation network structure is realized by constructing m evaluation networks

To estimate the optimal action value function

Wherein, w_qThe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network, and comprises an input layer, two hidden layers and an output layer; the input to each evaluation network is a state vector s_kAnd motion vector a_kWherein the state vector s_kThe motion vector a is input from the input layer to each evaluation network_kFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector s_kLower taken action vector a_kAn action value of (d);

3-3) determining a target policy

According to the constructed hybrid strategy-evaluation network, learning the target strategy mu of AUV trajectory tracking control at the kth time step_f(s_k) Defined as the average of n strategic network outputs, the calculation formula is as follows:

4) target strategy mu for solving AUV (autonomous Underwater vehicle) trajectory tracking control_f(s_k) The method comprises the following specific steps:

4-1) parameter setting

Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation network_ωLearning rate α of each policy network_θA weight matrix H in the discount factor gamma and the reward function;

4-2) initializing hybrid policies-evaluating networks

Randomly initializing n policy networks

And m evaluation networks

Weight parameter θ of_pAnd w_q(ii) a Randomly selecting the d-th policy network from the n policy networks as

d＝1,…,n；

Establishing an experience queue set R, setting the maximum capacity of the experience queue set R as B, and initializing the experience queue set R as empty;

4-3) starting iteration, training the mixed strategy-evaluation network, and initializing iteration frequency epsilon to be 1;

4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV₀Let the state variable s of the current time step_k＝s₀(ii) a And generates an exploratory Noise_k；

4-5) network based on n current policies

And exploration Noise_kDetermining a motion vector a for a current time step_kComprises the following steps:

4-6) AUV in Current State s_kLower execution action a_kObtaining the reward function r according to the step 2-3)_k+1And observe a new state s_k+1(ii) a Note e_k＝(s_k,a_k,r_k+1,s_k+1) Is an experience sample; if the number of samples in the experience queuing set R reaches the maximum capacity B, deleting the first added sample, and then adding the experience sample e_kStoring the data into an experience queue set R; otherwise, the experience sample e is directly used_kStoring the data into an experience queue set R;

a experience samples are selected from the experience queue set R, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience queuing set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set R_l,a_l,r_l+1,s_l+1)；

4-7) calculating the expected Bellman absolute error EBAE of each evaluation network according to the selected A empirical samples_qFor characterizing the performance of each evaluation network, the formula is as follows:

selecting the evaluation network with the worst performance, and obtaining the serial number of the evaluation network with the worst performance according to the following formula, wherein the serial number is marked as c:

4-8) evaluating the network by the c

The motion vector of each experience sample at the next time step is obtained through the following greedy strategy:

4-9) calculating the target value of the c-th evaluation network by a plurality of quasi-Q learning methods

The formula is as follows:

4-10) calculating the loss function L (w) for the c-th evaluation network_c) The formula is as follows:

4-11) through a loss function L (w)_c) For the weight parameter w_cTo update the weight parameters of the c-th evaluation network, the formula is as follows:

the weight parameters of the rest evaluation networks are kept unchanged;

4-12) randomly selecting one policy network from the n policy networks to reset the d policy network

4-13) calculating the d policy network according to the updated c evaluation network

Deterministic strategy gradients

And update the d-th policy network accordingly

Weight parameter θ of_dThe calculation formulas are respectively as follows:

the weight parameters of the rest strategy networks are kept unchanged;

4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, entering the step 4-15);

4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, entering the step 4-16);

4-16) finishing iteration, terminating the training process of the hybrid strategy-evaluation network, and obtaining the final target strategy mu controlled by AUV trajectory tracking by the output values of the n strategy networks at the end of iteration through the calculation formula in the step 3-3)_f(s_k) And the target strategy realizes the trajectory tracking control of the AUV.

The invention has the characteristics and beneficial effects that:

the method provided by the invention adopts a plurality of strategy networks and evaluation networks. For a plurality of evaluation networks, the performance of each evaluation network is evaluated by defining an expected Bellman absolute error, only one evaluation network with the worst performance is updated at each time step, and different from the existing control method based on reinforcement learning, the invention provides a plurality of quasi-Q learning methods to calculate a more accurate evaluation network target value. For multiple policy networks, one policy network is randomly selected at each time step and updated with a deterministic policy gradient. The final learned strategy is the average of all strategy networks.

1) The AUV trajectory tracking control method provided by the invention does not depend on a model, a target strategy for enabling a control target to reach the optimum is automatically learned through sampling data of the AUV in the driving process, no assumption needs to be made on the AUV model in the process, and the method is particularly suitable for the AUV working in a complex deep sea environment and has high practical application value.

2) The method adopts a plurality of quasi-Q learning to obtain the evaluation network target value which is more accurate than that of the existing method, thereby not only reducing the variance of the action value function obtained by the evaluation network approximation, but also solving the problem of over-estimation of the action value function, further obtaining a more optimal target strategy and realizing high-precision AUV trajectory tracking control.

3) The method determines which evaluation network should be updated at each time step based on the expected Bellman absolute error, and the updating rule can weaken the influence of poor evaluation networks, thereby ensuring the rapid convergence of the learning process.

4) The method adopts a plurality of evaluation networks, so that the learning process is not easily influenced by the severe AUV historical tracking track, the robustness is good, and the learning process is stable.

5) The method combines reinforcement learning and a deep neural network, has strong self-learning capability, can realize high-precision self-adaptive control on the AUV in an uncertain deep sea environment, and has good application prospects in scenes such as AUV track tracking, underwater obstacle avoidance and the like.

Drawings

FIG. 1 is a graph comparing the performance of the proposed method of the present invention with the existing DDPG method; wherein, the graph (a) is a comparison graph of learning curves, and the graph (b) is a comparison graph of AUV trajectory tracking effect.

FIG. 2 is a graph comparing the performance of the proposed method of the present invention with a neural network PID method; wherein, the graph (a) is a comparison graph of the coordinate track tracking effect of the AUV along the X, Y direction, and the graph (b) is a comparison graph of the tracking error of the AUV in the X, Y direction.

Detailed Description

The invention provides an autonomous underwater vehicle track tracking control method based on deep reinforcement learning, which is further described in detail below by combining the accompanying drawings and specific embodiments.

The invention provides an autonomous underwater vehicle tracking control algorithm based on deep reinforcement learning, which mainly comprises four parts: defining an AUV trajectory tracking control problem, establishing a Markov decision process model of the AUV trajectory tracking problem, constructing a hybrid strategy-evaluation network structure and solving a target strategy of AUV trajectory tracking control.

1) Defining AUV trajectory tracking control problem

Defining the AUV trajectory tracking control problem includes four components: determining AUV system input, determining AUV system output, defining a trajectory tracking control error and establishing an AUV trajectory tracking control target; the method comprises the following specific steps:

1-1) determining AUV System inputs

Let AUV system input vector be tau_k＝[ξ_k,_k]^TWherein ξ_k、_kRespectively the propeller thrust and rudder angle of the AUV, subscript k indicating the value of the kth time step, i.e. time k.t, where t is the time step, the same applies below ξ_k、_kRespectively have a value range of

And

wherein

The maximum propeller thrust and the maximum rudder angle are respectively determined according to the model of the propeller adopted by the AUV.

1-2) determining AUV System output

Let AUV system output vector be η_k＝[x_k,y_k,ψ_k]^TWherein x is_k、y_kThe coordinates along the X, Y axis, # at the kth time step AUV in the inertial frame I-XYZ, respectively_kIs the angle between the advancing direction of the AUV at the kth time step and the X axis.

1-3) defining a tracking control error

Selecting a reference trajectory according to the driving path of the AUV

Defining the AUV trajectory tracking control error of the kth time step as:

1-4) establishing AUV track tracking control target

For the reference track d in step 1-3)_kAn objective function of the form:

wherein γ is a discount factor and H is a weight matrix;

The Markov Decision Process (MDP) is the basis of the reinforcement learning theory, and therefore MDP modeling is required for the AUV trajectory tracking problem in step 1). The main elements of reinforcement learning comprise an agent, an environment, a state, an action and an incentive function, and the agent learns an optimal action (or control input) sequence through interaction with the environment of the AUV to maximize accumulated incentive (or minimize accumulated tracking control error), so that the AUV trajectory tracking objective is solved. The method comprises the following specific steps:

2-1) defining a state vector

Defining the velocity vector of the AUV system as phi_k＝[u_k,v_k,χ_k]^TWherein u is_k、v_kLinear velocity along and perpendicular to the advancing direction, χ, of the kth time step AUV_kIs the k-thTime step AUV surrounds the angular velocity of the heading.

2-2) defining motion vectors

Defining the motion vector of the kth time step as the AUV system input vector of the time step, namely: a is_k＝τ_k。

2-3) defining a reward function

wherein,

expected values representing reward functions, states and actions (the same below); k is the maximum time step;

the action value function is used to describe the expected cumulative discount reward when strategy pi is taken in all states, current and later, and thus, under the reinforcement learning frameworkThe AUV track tracking control target (namely the target of the agent) learns an optimal target strategy pi through interaction with the environment where the AUV is located^*So that the action value at the initial time is maximum, namely:

wherein, p(s)₀) Is in an initial state s₀The distribution of (a); a is₀Is the initial motion vector.

Therefore, the target tau of the AUV track tracking control established in the step 1-4)^*Can be converted into pi^*And (4) solving.

Similar to dynamic programming, many reinforcement learning methods solve the action value function in step 2-4) using the following iterative bellman equation:

assuming that the policy pi is deterministic, i.e. the state vector space of the AUV is a one-to-one mapping relation to the action vector space of the AUV, and is denoted as μ, then the above iterative bellman equation can be simplified as:

furthermore, for a deterministic strategy μ, the optimal target strategy π in step 2-4) is set^*Reduction to deterministic optimal target strategy mu^*：

3) Constructing hybrid policy-evaluation networks

From the step 2-5), the core of solving the AUV trajectory tracking problem by using reinforcement learning is how to solve the deterministic optimal target strategy mu^*And the corresponding mostFunction of optimal action value

The method adopts a mixed strategy-evaluation network to respectively estimate mu^*And

3-1) constructing a policy network

The mixed strategy-evaluation network structure is characterized in that n strategy networks (the values of which are not too large or too small in order to balance the tracking control precision of the algorithm and the network training speed) are constructed

To estimate a deterministic optimal target strategy mu^*. Wherein, theta_pThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network, each strategy network comprises an input layer, two hidden layers and an output layer, and the input of each strategy network is a state vector s_kThe output of each policy network is an action vector a_kThe two hidden layers contain 400 and 300 cells, respectively.

3-2) construction of evaluation network

The mixed strategy-evaluation network structure is characterized in that m (the selection basis of the number of evaluation networks is the same as that of the strategy networks) evaluation networks are constructed

To estimate the optimal action value function

Wherein,^w _qthe weighting parameter for the qth evaluation network, q is 1, …, m; each evaluation network is realized by using a fully-connected deep neural network respectively, and each evaluation network is realized by using a fully-connected deep neural networkRespectively comprises an input layer, two hidden layers and an output layer, wherein the two hidden layers respectively comprise 400 units and 300 units; the input to each evaluation network is a state vector s_kAnd motion vector a_kWherein the state vector s_kThe motion vector a is input from the input layer to each evaluation network_kFrom the first hidden layer to each evaluation network, the output of each evaluation network is the state vector s_kLower taken action vector a_kThe action value of (1).

3-3) determining a target policy

4-1) parameter setting

Respectively setting a maximum iteration number M, a maximum time step K of each iteration, a training set size N extracted by experience playback and learning rates α of each evaluation network_ωLearning rate α of each policy network_θThe discount factor y and the weight matrix H in the reward function, in this example, M is 1500, K is 1000 (each time step t is 0.2s), N is 64, α for each evaluation network_ωα for each policy network ═ 0.01_θ＝0.001，γ＝0.99，H＝[0.001,0；0,0.001]；

4-2) initializing hybrid policies-evaluating networks

Randomly initializing n policy networks

And m evaluation networks

Weight parameter θ of_pAnd w_q(ii) a Randomly selecting the d (d is 1, …, n) th policy network from the n policy networks and recording the d (d is 1, …, n) th policy network

Constructing an empirical queuing set R, setting the maximum capacity of the empirical queuing set R as B (in this embodiment, B is 10000), and initializing to be empty;

4-4) setting the current time step k to 0, and randomly initializing the state variable s of the AUV₀Let the state variable s of the current time step_k＝^s ₀(ii) a And generates an exploratory Noise_k(this example uses Ornstein-Uhlenbeck to explore noise);

4-5) network based on n current policies

selecting A experience samples from an experience queue set R, wherein A is less than or equal to N, and the details are as follows: when the number of the samples in the experience queuing set R does not exceed N, selecting all experience samples in the experience queuing set R; when the experience is queuedWhen the set R exceeds N, N experience samples(s) are randomly selected from the experience queuing set R_l,a_l,r_l+1,s_l+1) L is the time step of the selected experience sample;

4-8) evaluating the network by the c

The formula is as follows:

the weight parameters of the rest evaluation networks are kept unchanged;

Deterministic strategy gradients

And update the d-th policy network accordingly

Weight parameter θ of_dThe calculation formulas are respectively as follows:

the weight parameters of the remaining policy networks remain unchanged.

4-14) let k be k +1 and decide k: if K is less than K, returning to the step 4-5), and continuing to track the reference track by the AUV; otherwise, go to step 4-15).

4-15) let epamode be equal to epamode +1 and determine epamode: if the epicode is less than M, returning to the step 4-4), and carrying out the next iterative process by the AUV; otherwise, go to step 4-16).

Validity verification of embodiments of the invention

The performance analysis of the AUV trajectory tracking control method (MPQ-DPG for short) based on deep reinforcement learning provided by the invention is shown as follows, and all comparison experiments are based on the widely used REMUS autonomous unmanned aircraft, and the maximum propeller thrust of the REMUS autonomous unmanned aircraft is

Rudder angle

86N and 0.24rad, respectively; and the following reference trajectories are adopted:

in addition, in the embodiment of the present invention, the evaluation network number m is the same as the policy network number n, and is hereinafter collectively referred to as n.

1) Comparison analysis of MPQ-DPG and existing DDPG method

Fig. 1 is a comparison between the learning curve and the trajectory tracking effect of the AUV-proposed trajectory tracking control method (MPQ-DPG) for deep reinforcement learning proposed by the present invention and the existing DDPG method in the training process. The learning curve in graph (a) is obtained by five independent experiments, and Ref in graph (b) represents a reference trajectory.

Analyzing fig. 1, the following conclusions can be drawn:

a) compared with a DDPG method, the learning stability of the MPQ-DPG is better, because the MPQ-DPG adopts a plurality of evaluation networks and strategy networks, the influence of poor samples on the learning stability can be reduced.

b) The average accumulated reward of final convergence of the MPQ-DPG method is obviously higher than that of the DDPG method, which shows that the tracking control precision of the MPQ-DPG method is obviously higher than that of the DDPG method.

c) It can be observed from fig. 1(b) that the tracking trajectory obtained by the MPQ-DPG method almost coincides with the reference trajectory, which shows that the MPQ-DPG method can realize high-precision AUV tracking control.

d) With the increase of the number of strategy networks and evaluation networks, the tracking control precision of the MPQ-DPG method can be gradually improved, but the improvement amplitude is not obvious after n > 4.

2) Comparison analysis of MPQ-DPG method and existing neural network PID method

FIG. 2 is a comparison between the MPQ-DPG method and the neural network PID method provided for the underwater unmanned vehicle trajectory tracking control in the invention on a coordinate trajectory tracking curve and a coordinate trajectory tracking error. In the figure, Ref represents a reference coordinate track, PIDNN represents a neural network PID algorithm, and n is 4.

Analyzing the figure 2, the tracking performance of the neural network PID control method is obviously inferior to that of the MPQ-DPG method provided by the invention; in addition, the tracking error in fig. 2(b) shows that the MPQ-DPG method can achieve faster convergence of the error, and particularly in the initial stage, the MPQ-DPG method can still achieve fast and high-precision tracking performance, while the response time of the neural network PID method is significantly longer than that of the MPQ-DPG method, and the convergence of the tracking error is poor.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An autonomous underwater vehicle track tracking control method based on deep reinforcement learning is characterized by comprising the following steps:

1) defining AUV (autonomous Underwater vehicle) track tracking control problem

1-1) determining AUV System inputs

And

maximum propeller thrust and maximum rudder angle respectively;

1-2) determining AUV System output

1-3) defining a tracking control error

Selecting a reference trajectory according to the driving path of the AUV

Defining the AUV trajectory tracking control error of the kth time step as:

1-4) establishing AUV track tracking control target

For the reference track d in step 1-3)_kSelecting an objective function of the formNumber:

wherein γ is a discount factor and H is a weight matrix;

2-1) defining a state vector

2-2) defining motion vectors

2-3) defining a reward function

The reward function at the kth time step being used to characterize the state s_kTaking action a_kAccording to the track defined in step 1-3)Tracking control error e_kAnd the motion vector a defined in step 2-2)_kThe AUV reward function defining the kth time step is as follows:

wherein,

3) Constructing hybrid policy-evaluation networks

3-1) constructing a policy network

Hybrid policy-evaluation network architecture by constructing n policy networks

To estimate a deterministic optimal target strategy mu^*(ii) a Wherein, theta_pThe weight parameter of the p-th policy network is p 1, …, n; each strategy network is realized by using a fully-connected deep neural network respectively, and comprises an input layer, two hidden layers and an output layer; the input of each policy network is a state vectors_kThe output of each policy network is an action vector a_k；

3-2) construction of evaluation network

To estimate the optimal action value function

3-3) determining a target policy

4-1) parameter setting

4-2) initializing hybrid policies-evaluating networks

Randomly initializing n policy networks

And m evaluation networks

d＝1,…,n；

4-5) network based on n current policies

4-6) AUV in Current State s_kLower execution action a_kObtaining the reward function r according to the step 2-3)_k+1And observe a new state s_k+1(ii) a Note e_k＝(s_k,a_k,r_k+1,s_k+1) Is an experience sample; if the number of samples of the empirical queuing set R has reached the maximum capacity B, thenDeleting the first added sample and then using the experience sample e_kStoring the data into an experience queue set R; otherwise, the experience sample e is directly used_kStoring the data into an experience queue set R;