JP4586129B2

JP4586129B2 - Controller, control method and control program

Info

Publication number: JP4586129B2
Application number: JP2008077671A
Authority: JP
Inventors: 哲郎森村; 英治内部; 潤一郎吉本; 賢治銅谷
Original assignee: OKINAWA INSTITUTE OF SCIENCE AND TECHNOLOGY
Current assignee: OKINAWA INSTITUTE OF SCIENCE AND TECHNOLOGY
Priority date: 2008-03-25
Filing date: 2008-03-25
Publication date: 2010-11-24
Anticipated expiration: 2028-03-25
Also published as: JP2009230645A

Description

本発明は、方策勾配法により制御対象を制御する制御器、制御方法および制御プログラムの構成に関する。 The present invention relates to the configuration of a controller, a control method, and a control program for controlling a controlled object by a policy gradient method.

「マルコフ決定過程」として定式化される制御問題は、ロボット、プラント、移動機械（電車、自動車）などの自律的制御問題として、幅広い応用を持つ重要な技術である。 The control problem formulated as a “Markov decision process” is an important technology that has a wide range of applications as an autonomous control problem for robots, plants, mobile machines (trains, cars, etc.).

マルコフ決定過程に対する最適制御に関する従来技術として、いわゆる「強化学習」がある。 There is so-called “reinforcement learning” as a conventional technique related to optimal control for a Markov decision process.

「強化学習」とは、エージェントが環境と相互作用を通じて試行錯誤し、得られる累積報酬量を最大化するような「方策」と呼ばれる行動則、すなわち、制御問題に用いる場合には、「制御規則」を学習する理論的な枠組みである。この学習法は、環境やエージェント自身に関する先験的な知識をほとんど必要としない点で様々な分野から注目を集めている。 “Reinforcement learning” is a behavioral rule called “policy” that maximizes the cumulative amount of reward that is obtained through trial and error through interaction with the environment. It is a theoretical framework for learning. This learning method attracts attention from various fields in that it requires little a priori knowledge about the environment and the agent itself.

強化学習は大まかに２つに分類できる。価値関数を用いて間接的に方策を表現し、価値関数を更新することで方策も更新される「価値関数更新法」と、方策を明示的にもち目的関数の勾配に従って方策を更新する「直接方策更新法（方策勾配法）」である。 Reinforcement learning can be roughly classified into two. The value function is used to express the policy indirectly, and the value function is updated to update the policy, and the value function is updated, and the policy is updated according to the objective function gradient. “Policy update method (policy gradient method)”.

方策勾配法は、行動のランダム性を制御するパラメータも方策パラメータに含めることで確率的方策の獲得が可能であり、また連続系への適用性も高いため、特に注目を集めている。しかし一般に実タスクへ適用すると、適切な行動則を獲得するまでの時間が非現実となることがある。そこで、複数学習器の同時利用、モデルの利用、教示信号の利用等の補助機構を入れて学習時間を短縮させる研究が活発に行われ、成果も著しい。 The policy gradient method is particularly attracting attention because it can acquire a stochastic policy by including a parameter for controlling the randomness of the action in the policy parameter, and has high applicability to a continuous system. However, in general, when applied to a real task, the time to acquire an appropriate behavioral rule may be unrealistic. Therefore, active research has been conducted to shorten the learning time by using auxiliary mechanisms such as simultaneous use of multiple learners, use of models, use of teaching signals, etc., and results have been remarkable.

ここで、方策勾配強化学習法（ＰＧＲＬ）は、方策パラメータについての平均報酬の偏微分を用いることにより、方策パラメータを改善して平均報酬を最大化するための強化学習（ＲＬ：Reinforcemnt Learning）の一般的なアルゴリズムである。ここで、平均報酬の偏微分は、方策勾配（ＰＧ：Policy Gradient）と呼ばれる。従来の方策勾配強化学習法アルゴリズム（ＰＧアルゴリズム）［非特許文献１，非特許文献２］は、状態の定常分布の偏微分の計算が困難であったため、方策パラメータの変化によりもたらされる定常分布の変化に依存する方策勾配の項を無視している。これらは、定常分布の対数の偏微分−ＬＳＤＧ（(Log) Stationary Distribution Gradients）−と呼ばれる。このような省略による偏りは、いわゆる割引率γを1に近づければ減少するが、一方で推定された偏微分の分散は多くなってしまう。このようなトレードオフは、現実に、適切なγを見いだすことを困難にしていた。マルコフ連鎖の時間を混合する（mixing）ことは適切なγを決定するための尺度である［非特許文献１，非特許文献３］。しかしながら、時間の混合は、方策に依存し、したがって、一般には、学習が完了する前に、予測することは困難である。 Here, the policy gradient reinforcement learning method (PGRL) is a method of reinforcement learning (RL: Reinforcemnt Learning) for maximizing the average reward by improving the policy parameter by using partial differentiation of the average reward for the policy parameter. It is a general algorithm. Here, the partial differential of the average reward is called a policy gradient (PG). Since the conventional policy gradient reinforcement learning method algorithm (PG algorithm) [Non-Patent Document 1, Non-Patent Document 2] is difficult to calculate the partial derivative of the steady distribution of states, Ignoring policy gradient terms that depend on change. These are called logarithmic partial derivatives of stationary distributions -LSDG ((Log) Stationary Distribution Gradients)-. The bias due to such omission decreases as the so-called discount rate γ approaches 1, but on the other hand, the variance of the estimated partial differential increases. Such a trade-off has made it difficult to actually find an appropriate γ. Mixing the Markov chain time is a measure for determining an appropriate γ [Non-Patent Document 1, Non-Patent Document 3]. However, the mixing of time depends on the strategy and is therefore generally difficult to predict before learning is complete.

また、平均報酬ＰＧアルゴリズム［非特許文献４，非特許文献５］は、ポアソン方程式の解としての微分コスト関数を導入することにより、割引率を使用しないものである。しかしながら、割引率が１に近い通常のＰＧと平均報酬ＰＧとのパフォーマンスには、理論的には、大きな相違が無いことが示唆されている［非特許文献６］。 Further, the average reward PG algorithm [Non-Patent Document 4, Non-Patent Document 5] does not use a discount rate by introducing a differential cost function as a solution of the Poisson equation. However, it has been suggested that there is no theoretical difference between the performance of a normal PG with a discount rate close to 1 and the average reward PG [6].

したがって、上述したような推定される方策パラメータの偏りを減少させると分散が増大するというトレードオフを解決した学習方法が必要である。 Therefore, there is a need for a learning method that solves the trade-off of increasing variance when the bias of estimated policy parameters as described above is reduced.

なお、以下、本文中で引用することとなる方策勾配学習法に関連した先行技術文献を以下に挙げる。
Baxter, J. and P. Bartlett (2001) “Infinite-Horizon Policy-Gradient Estimation," Journal of Artificial Intelligence Research, Vol. 15, pp. 319-350. H. Kimura and S. Kobayashi. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In International Conference on Machine Learning, 1998. S. Kakade. Optimizing average reward using discounted rewards. In Annual Conference on Computational Learning Theory, volume 14. MIT Press,2001. J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799-1808, 1999. V. S. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4):1143-1166, 2001. ］J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporal-difference learning. Machine Learning, 49(2):179-191, 2002. P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75-84, 1991. R. Y. Rubinstein. How to optimize discrete-event system from a single sample path by the score function method. Annals of Operations Research, 27(1):175-212, 1991. A. Y. Ng, R. Parr, and D. Koller. Policy search via density estimation. In Advances in Neural Information Processing Systems. MIT Press, 2000. D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 1995. R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998. R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9-44, 1988. S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33-57, 1996. J. A. Boyan. Least-squares temporal difference learning. Machine Learning, 49(2-3):233-246, 2002. R. B. Schinazi. Classical and Spatial Stochastic Processes. Birkhauser, 1999. J. Peng and R. J. Williams. Incremental multi-step Q-learning. Machine Learning, 22:283-290, 1996. P. Young. Recursive Esimation and Time-series Analysis. Springer-Verlag, 1984. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. Hereinafter, prior art documents related to the policy gradient learning method to be cited in the text are listed below.
Baxter, J. and P. Bartlett (2001) “Infinite-Horizon Policy-Gradient Estimation,” Journal of Artificial Intelligence Research, Vol. 15, pp. 319-350. H. Kimura and S. Kobayashi.An analysis of actor / critic algorithms using eligibility traces: Reinforcement learning with imperfect value function.In International Conference on Machine Learning, 1998. S. Kakade. Optimizing average reward using discounted rewards.In Annual Conference on Computational Learning Theory, volume 14.MIT Press, 2001. JN Tsitsiklis and B. Van Roy.Average cost temporal-difference learning.Automatica, 35 (11): 1799-1808, 1999. VS Konda and JN Tsitsiklis.On actor-critic algorithms.SIAM Journal on Control and Optimization, 42 (4): 1143-1166, 2001. ] JN Tsitsiklis and B. Van Roy. On average versus discounted reward temporal-difference learning. Machine Learning, 49 (2): 179-191, 2002. PW Glynn. Likelihood ratio gradient estimation for stochastic systems.Communications of the ACM, 33 (10): 75-84, 1991. RY Rubinstein. How to optimize discrete-event system from a single sample path by the score function method.Annals of Operations Research, 27 (1): 175-212, 1991. AY Ng, R. Parr, and D. Koller.Policy search via density estimation.In Advances in Neural Information Processing Systems.MIT Press, 2000. DP Bertsekas.Dynamic Programming and Optimal Control, Volumes 1 and 2.Athena Scientific, 1995. RS Sutton and AG Barto. Reinforcement Learning. MIT Press, 1998. RS Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3: 9-44, 1988. SJ Bradtke and AG Barto.Linear least-squares algorithms for temporal difference learning.Machine Learning, 22 (1-3): 33-57, 1996. JA Boyan. Least-squares temporal difference learning.Machine Learning, 49 (2-3): 233-246, 2002. RB Schinazi. Classical and Spatial Stochastic Processes. Birkhauser, 1999. J. Peng and RJ Williams. Incremental multi-step Q-learning. Machine Learning, 22: 283-290, 1996. P. Young.Recursive Esimation and Time-series Analysis.Springer-Verlag, 1984. DP Bertsekas and JN Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.

すでに、（定常）状態分布の勾配を評価するための２つの方法があるが、これらは、本願発明の方法とは異なったものであり、以下のような問題がある。最初のものは、「尤度比勾配法」あるいは「スコア関数法」と呼ばれるものであり［非特許文献７，非特許文献８］、再生プロセスにしか適用できない問題がある［非特許文献１］。非特許文献９に開示された他の方法は、状態分布の直接的な評価ではなく、密度伝搬を伴う状態分布の評価を介して実行されるものである。したがって、これらの方法は、エージェントがどの状態にあるかの知識を必要とするのに対し、後に説明する本発明の方法では、ノイズを含む状態の特徴ベクトルを観測するのみでよい。 There are already two methods for evaluating the gradient of the (steady) state distribution, but these are different from the method of the present invention and have the following problems. The first one is called “likelihood ratio gradient method” or “score function method” [Non-Patent Document 7, Non-Patent Document 8], and has a problem that can be applied only to the reproduction process [Non-Patent Document 1]. . The other method disclosed in Non-Patent Document 9 is executed not through direct evaluation of state distribution but through evaluation of state distribution with density propagation. Therefore, these methods require knowledge of which state the agent is in, whereas in the method of the present invention described later, it is only necessary to observe a feature vector of a state including noise.

したがって、本発明は、上記のような問題点を解決するためになされたものであって、その目的は、方策パラメータの偏りを減少させると分散が増大するというトレードオフを解決した方策勾配学習方法を用いた制御器、制御方法または制御プログラムを提供することである。 Therefore, the present invention has been made to solve the above-described problems, and the object thereof is a policy gradient learning method that solves the trade-off that the variance increases when the policy parameter bias is reduced. The present invention provides a controller, a control method, or a control program using a computer.

本発明の他の目的は、汎用性のある方策勾配学習方法を用いた制御器、制御方法または制御プログラムを提供することである。 Another object of the present invention is to provide a controller, control method or control program using a versatile policy gradient learning method.

このような目的を達成するために、本発明の制御器は、対象とするシステムの時間発展が順方向マルコフ決定過程として記述される際に、システムの状態に対する制御則である確率的に表現される方策をシステムの状態量の観測により方策勾配法によって強化学習する制御器であって、方策に基づいて、システムを制御するための制御信号を生成する制御信号生成手段と、システムの状態量を観測する状態量検知手段と、状態と制御信号とに予め定められた関係で依存する報酬値を獲得する報酬値獲得手段と、確率的な方策を規定するパラメータである方策パラメータにより方策が規定されるとき、各タイムステップにおける状態量と制御信号とに基づいて、システムの状態の分布の定常分布の対数の方策パラメータについての偏微分である対数定常分布偏微分を推定することで、方策の勾配を推定する方策勾配推定手段と、報酬値と方策勾配推定手段による推定結果とに基づいて、対数定常分布偏微分を用いて推定した平均報酬偏微分の方向に方策パラメータを微小変化させることで、方策を更新する方策更新手段とを備える。 In order to achieve such an object, the controller of the present invention is expressed stochastically as a control law for the state of the system when the time evolution of the target system is described as a forward Markov decision process. that measures a controller for reinforcement learning by policy gradient by the observation of the state of the system, based on policy, and a control signal generating means for generating a control signal for controlling the system, the state of the system The policy is defined by the state quantity detection means to be observed, the reward value acquisition means for acquiring a reward value that depends on a predetermined relationship between the state and the control signal, and the policy parameter that is a parameter that defines the stochastic policy. Rutoki, on the basis of the state quantity and the control signal in each time step is the partial derivative of the logarithm of the measures parameter of the normal distribution of the distribution state of the system By estimating the number stationary distribution partial differential, the policy gradient estimation means for estimating a gradient of measures, based on the estimation result of the compensation value and a policy gradient estimation means, the average compensation estimated using the log-normal distribution partial derivative the square measures parameters in the direction of the partial differential be to small changes, and a measure updating means for updating the policy.

好ましくは、方策勾配推定手段は、状態の分布の状態についての和が一定であるとの条件により導かれる制約条件であって、対数定常分布偏微分の順方向マルコフ連鎖についての期待値が０であるという制約条件の下で、逆方向マルコフ連鎖に対するＴＤ学習により、対数定常分布偏微分を推定する。
好ましくは、ＴＤ学習においては、ｉ）逆方向マルコフ連鎖における方策の対数の偏微分の１ステップ前の観測値と１ステップ前の対数定常分布偏微分の和と、現在の状態の対数定常分布偏微分との差をδとするとき、δの２乗の順方向マルコフ連鎖についての期待値と、ｉｉ）対数定常分布偏微分の順方向マルコフ連鎖についての期待値の２乗との和を最小化することにより、対数定常分布偏微分を推定する。 Preferably, the policy gradient estimation means is a constraint condition derived by a condition that the sum of the state distribution states is constant, and the expected value for the forward Markov chain of log-steady distribution partial differentiation is 0 Under the constraint that there is a logarithmic steady distribution partial derivative is estimated by TD learning for the backward Markov chain .
Preferably, in TD learning, i) the sum of the observed value one step before the logarithmic partial differentiation of the policy in the backward Markov chain, the logarithmic steady distribution partial derivative one step before, and the log steady distribution bias of the current state When the difference from the derivative is δ, the sum of the expected value of the forward square Markov chain with the square of δ and the expected value square of the forward Markov chain of the logarithmic steady distribution partial differential is minimized. By doing so, the logarithmic steady distribution partial differential is estimated.

この発明の他の局面に従うと、対象とするシステムの時間発展が順方向マルコフ決定過程として記述される際に、システムの状態に対する制御則である確率的に表現される方策をシステムの状態量の観測により方策勾配法によって強化学習する制御方法であって、方策に基づいて、システムを制御するための制御信号を生成する制御信号生成ステップと、システムの状態量を観測する状態量検知ステップと、状態と制御信号とに予め定められた関係で依存する報酬値を獲得する報酬値獲得ステップと、確率的な方策を規定するパラメータである方策パラメータにより方策が規定されるとき、各タイムステップにおける状態量と制御信号とに基づいて、システムの状態の分布の定常分布の対数の方策パラメータについての偏微分である対数定常分布偏微分を推定することで、方策の勾配を推定する方策勾配推定ステップと、報酬値と方策勾配推定手段による推定結果とに基づいて、対数定常分布偏微分に基づき表現される平均報酬の勾配の方向に方策パラメータを更新することで、方策を更新する方策更新ステップとを備える。 According to another aspect of the present invention, when the time evolution of the target system is described as a forward Markov decision process , a policy expressed stochastically, which is a control law for the system state, A control method for reinforcement learning by means of a policy gradient method by observation, a control signal generation step for generating a control signal for controlling the system based on the policy, a state quantity detection step for observing the state quantity of the system, When a policy is defined by a reward value acquisition step that acquires a reward value that depends on a state and a control signal in a predetermined relationship, and a policy parameter that is a parameter that defines a probabilistic policy, the state at each time step based on the amount and the control signal is a partial differential of the logarithm of the measures parameter of the normal distribution of the distribution state of the system log steady By estimating the fabric partial differential, gradient of the policy gradient estimating step for estimating a gradient of measures, based on the estimation result of the compensation value and a policy gradient estimation means, average earnings expressed on the basis of the log-normal distribution partial derivative by the direction update the square measures parameters, and a policy update step of updating the policy.

この発明のさらに他の局面に従うと、対象とするシステムの時間発展が順方向マルコフ決定過程として記述される際に、システムの状態に対する制御則である確率的に表現される方策をシステムの状態量の観測により方策勾配法によって強化学習する制御方法をコンピュータに実行させるためのプログラムであって、方策に基づいて、システムを制御するための制御信号を生成する制御信号生成ステップと、システムの状態量を観測する状態量検知ステップと、状態と制御信号とに予め定められた関係で依存する報酬値を獲得する報酬値獲得ステップと、確率的な表現を規定するパラメータである方策パラメータにより方策が規定されるとき、各タイムステップにおける状態量と制御信号とに基づいて、システムの状態の分布の定常分布の対数の方策パラメータについての偏微分である対数定常分布偏微分を推定することで、方策の勾配を推定する方策勾配推定ステップと、報酬値と方策勾配推定ステップによる推定結果とに基づいて、対数定常分布偏微分に基づき表現される平均報酬の勾配の方向に方策パラメータを更新することで、方策を更新する方策更新ステップとを含む、制御方法をコンピュータに実行させる。 According to still another aspect of the present invention, when the time evolution of the target system is described as a forward Markov decision process , a policy expressed stochastically, which is a control law for the system state, is represented as a system state quantity. a program for executing a control method for enhancing learning in a computer by a policy gradient method by the observation, based on the policy, and a control signal generating step of generating a control signal for controlling the system, the state of the system A policy is defined by a state quantity detection step for observing a state, a reward value acquisition step for acquiring a reward value that depends on a predetermined relationship between the state and the control signal, and a policy parameter that is a parameter for defining a probabilistic expression when it is, on the basis of the state quantity and the control signal at each time step, a pair of stationary distribution of the distribution of the state of the system By estimating a partial differential of measures parameters logarithmic normal distribution partial derivative, and policy gradient estimating step for estimating a gradient of measures, based on the estimation result of the compensation value and a policy gradient estimation step, logarithmic normal distribution by updating the square measures parameters the direction of the gradient of the average earnings expressed on the basis of partial differential, and a policy update step of updating the strategy to perform the control method on a computer.

以下の説明の構成の概要を説明すると、（１．本発明の概要）において、本発明の全体的な構成を説明し、（２．前提：方策勾配強化学習）では、従来のＰＧ法をレビューし、本発明のＬＳＤＧを評価する目的を概説する。（３．定常分布の対数の偏微分の推定）では、逆方向マルコフ連鎖の方法に基づく最小二乗ＴＤ法によるＬＳＤＧ（λ）アルゴリズムを説明する。（４. ＬＳＤＧ推定による方策の更新）では、ＬＳＤＧ（λ）−ＰＧアルゴリズムを説明する。これは、ＬＳＤＧ（λ）を利用し、割引率γを用いない方法である。（５．数値計算の結果）では、提案する方法のパフォーマンスを確認するため、簡単なマルコフ決定過程（ＭＤＰ）における計算結果が示される。 The outline of the configuration of the following description will be explained. In (1. Overview of the present invention), the overall configuration of the present invention will be explained. In (2. Premise: Policy gradient reinforcement learning), the conventional PG method is reviewed. The purpose of evaluating the LSDG of the present invention will be outlined. (3. Estimation of partial differential of logarithm of stationary distribution) describes an LSDG (λ) algorithm based on the least square TD method based on the inverse Markov chain method. (4. Policy update by LSDG estimation) describes the LSDG (λ) -PG algorithm. This is a method that uses LSDG (λ) and does not use a discount rate γ. In (5. Result of numerical calculation), in order to confirm the performance of the proposed method, the calculation result in a simple Markov decision process (MDP) is shown.

（１．本発明の概要）
後に説明するように、本発明では、逆方向マルコフ連鎖の方法とＴＤ（temporal difference）学習アルゴリズムにより、定常分布の勾配としてのＬＳＤＧが導出される、新たな方策勾配法の枠組みを提案する。この枠組みにおいては、平均報酬の偏微分は、割引率γに依存せず、γ＝０とおくことにより、価値関数を学習する必要が無くなる。 (1. Overview of the present invention)
As will be described later, the present invention proposes a new policy gradient method framework in which LSDG as a gradient of a steady distribution is derived by a backward Markov chain method and a TD (temporal difference) learning algorithm. In this framework, the partial differentiation of the average reward does not depend on the discount rate γ, and by setting γ = 0, it is not necessary to learn the value function.

以下、図面を参照して本発明の実施の形態について説明する。
以下の説明で明らかとなるとおり、本発明は、ロボット、プラント、移動機械（電車、自動車）などの制御問題として、幅広い応用を持つ。 Embodiments of the present invention will be described below with reference to the drawings.
As will be apparent from the following description, the present invention has a wide range of applications as control problems for robots, plants, mobile machines (trains, automobiles), and the like.

ただし、以下では、本発明の具体的な適用例として、特に簡単なロボットの自動制御問題を対象とするものとして説明を行う。また、数値計算の結果は、さらに簡単なモデルに対する比較を示している。しかしながら、本発明は、このような応用に限定されるものではなく、より一般的に、対象システムの時間発展が複雑な場合の対象システムの制御に適用することができる。そのようなものの例としては、巨大プラント（溶鉱炉、原子力プラント）、マルチリンクロボット（ヒューマノイドロボット）、ノンホロノームシステム（宇宙ステーション）、地下鉄ホームでの人の流れなどがある。これらは、いずれも古典的制御法での制御が困難であり、かつ重要な制御対象である。 However, in the following, as a specific application example of the present invention, a description will be given on the assumption that a particularly simple automatic robot control problem is targeted. The numerical calculation results show a comparison with a simpler model. However, the present invention is not limited to such an application, and more generally can be applied to control of the target system when the time development of the target system is complicated. Examples of such are giant plants (blast furnaces, nuclear power plants), multi-link robots (humanoid robots), non-holonomic systems (space stations), and the flow of people in subway platforms. All of these are difficult to control by the classical control method and are important control objects.

（１．本発明のシステム構成）
図１は、本発明の制御方法および制御プログラムが適用される制御器を用いたシステム１０００の一例を示す概念図である。 (1. System configuration of the present invention)
FIG. 1 is a conceptual diagram showing an example of a system 1000 using a controller to which a control method and a control program of the present invention are applied.

図１を参照して、システム１０００は、制御対象となる被制御装置２００と、この被制御装置２００に対して制御信号を与えるためのコンピュータ１００とを備える。 Referring to FIG. 1, system 1000 includes controlled device 200 to be controlled and a computer 100 for giving a control signal to controlled device 200.

図１を参照してこのコンピュータ１００は、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory ）上の情報を読込むためのＣＤ−ＲＯＭドライブ１０８およびフレキシブルディスク（Flexible Disk、以下ＦＤ）１１６に情報を読み書きするためのＦＤドライブ１０６を備えたコンピュータ本体１０２と、コンピュータ本体１０２に接続された表示装置としてのディスプレイ１０４と、同じくコンピュータ本体１０２に接続された入力装置としてのキーボード１１０およびマウス１１２とを含む。 Referring to FIG. 1, this computer 100 reads / writes information from / to a CD-ROM drive 108 and a flexible disk (FD) 116 for reading information on a CD-ROM (Compact Disc Read-Only Memory). A computer main body 102 having the FD drive 106, a display 104 as a display device connected to the computer main body 102, and a keyboard 110 and a mouse 112 as input devices also connected to the computer main body 102.

図２は、このコンピュータ１００の構成をブロック図形式で示す図である。
図２に示されるように、このコンピュータ１００を構成するコンピュータ本体１０２は、ＣＤ−ＲＯＭドライブ１０８およびＦＤドライブ１０６に加えて、それぞれバスＢＳに接続されたＣＰＵ（Central Processing Unit ）１２０と、ＲＯＭ（Read Only Memory) およびＲＡＭ（Random Access Memory）を含むメモリ１２２と、直接アクセスメモリ装置、たとえば、ハードディスク１２４と、被制御装置２００とデータの授受を行うための通信インタフェース１２８とを含んでいる。ＣＤ−ＲＯＭドライブ１０８にはＣＤ−ＲＯＭ１１８が装着される。ＦＤドライブ１０６にはＦＤ１１６が装着される。 FIG. 2 is a block diagram showing the configuration of the computer 100. As shown in FIG.
As shown in FIG. 2, in addition to the CD-ROM drive 108 and the FD drive 106, the computer main body 102 constituting the computer 100 includes a CPU (Central Processing Unit) 120 connected to the bus BS, and a ROM ( It includes a memory 122 including a read only memory (RAM) and a random access memory (RAM), a direct access memory device, for example, a hard disk 124, and a communication interface 128 for exchanging data with the controlled device 200. A CD-ROM 118 is attached to the CD-ROM drive 108. An FD 116 is attached to the FD drive 106.

被制御装置２００からは、コンピュータ１００に対して被制御装置２００の状態を示すパラメータ（状態量）の情報、たとえば、被制御装置２００の可動部分の位置、速度、加速度、角度、角速度等の情報が与えられる。一方、コンピュータ１００からは、被制御装置２００に対して、これら状態量を制御するための制御情報が制御信号として与えられる。 From the controlled device 200, information on parameters (state quantities) indicating the state of the controlled device 200 with respect to the computer 100, for example, information on the position, speed, acceleration, angle, angular velocity, etc. of the movable part of the controlled device 200 Is given. On the other hand, control information for controlling these state quantities is given as a control signal from the computer 100 to the controlled device 200.

なお、ＣＤ−ＲＯＭ１１８は、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体であれば、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）やメモリカードなどでもよく、その場合は、コンピュータ本体１０２には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 The CD-ROM 118 may be another medium, such as a DVD-ROM (Digital Versatile Disc) or a memory card, as long as it can record information such as a program installed in the computer main body. In this case, the computer main body 102 is provided with a drive device that can read these media.

本発明の制御器の主要部は、コンピュータハードウェアと、ＣＰＵ１２０により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ１１８、ＦＤ１１６等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドライブ１０８またはＦＤドライブ１０６等により記憶媒体から読取られてハードディスク１２４に一旦格納される。または、当該装置がネットワークに接続されている場合には、ネットワーク上のサーバから一旦ハードディスク１２４にコピーされる。そうしてさらにハードディスク１２４からメモリ１２２中のＲＡＭに読出されてＣＰＵ１２０により実行される。なお、ネットワーク接続されている場合には、ハードディスク１２４に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the controller of the present invention is composed of computer hardware and software executed by the CPU 120. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 118 or FD 116, read from the storage medium by the CD-ROM drive 108 or FD drive 106, and temporarily stored in the hard disk 124. Alternatively, when the device is connected to the network, it is temporarily copied from the server on the network to the hard disk 124. Then, the data is further read from the hard disk 124 to the RAM in the memory 122 and executed by the CPU 120. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 124.

図１および図２に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分は、ＦＤ１１６、ＣＤ−ＲＯＭ１１８、ハードディスク１２４等の記憶媒体に記憶されたソフトウェアである。 The computer hardware itself and its operating principle shown in FIGS. 1 and 2 are general. Therefore, the most essential part of the present invention is software stored in a storage medium such as the FD 116, the CD-ROM 118, and the hard disk 124.

なお、一般的傾向として、コンピュータのオペレーティングシステムの一部として様々なプログラムモジュールを用意しておき、アプリケーションプログラムはこれらモジュールを所定の配列で必要な時に呼び出して処理を進める方式が一般的である。そうした場合、当該制御器を実現するためのソフトウェア自体にはそうしたモジュールは含まれず、当該コンピュータでオペレーティングシステムと協働してはじめて制御器が実現することになる。しかし、一般的なプラットフォームを使用する限り、そうしたモジュールを含ませたソフトウェアを流通させる必要はなく、それらモジュールを含まないソフトウェア自体およびそれらソフトウェアを記録した記録媒体（およびそれらソフトウェアがネットワーク上を流通する場合のデータ信号）が実施の形態を構成すると考えることができる。 As a general tendency, various program modules are prepared as a part of a computer operating system, and an application program generally calls a module in a predetermined arrangement and advances the processing when necessary. In such a case, the software itself for realizing the controller does not include such a module, and the controller is realized only in cooperation with the operating system on the computer. However, as long as a general platform is used, it is not necessary to distribute software including such modules, and the software itself not including these modules and the recording medium storing the software (and the software distributes on the network). Data signal) can be considered to constitute the embodiment.

［制御方法の一般的説明］
以下、本発明の構成について、その理論的な構成をまず説明する。 [General description of control method]
Hereinafter, the theoretical configuration of the configuration of the present invention will be described first.

（制御器の構成）
（２．前提：方策勾配強化学習）
以下では、マルコフ決定過程（ＭＤＰ）について考えることにし、制御対象（制御器の環境）は状態遷移確率（時刻ｔにおいて、状態ｘ_tであるときに行動（制御）ｕ_tを実行することで状態がｘ_t+1となる確率）と報酬関数ｒ_t+1＝ｒ（ｘ_t，ｕ_t）（なお、ｒ_t+1＝ｒ（ｘ_t＋１,ｘ_t，ｕ_t）の場合にも同様に議論できる）によって特徴づけられるものとする。なお、この報酬関数については、制御対象の制御目標に応じて予め定められているものとする。状態入力ｘ∈Ｘから行動出力ｕ∈Ｕへの写像を方策と呼び、以下で説明するように確率的に表現する。方策は、パラメータθで、規定される。 (Configuration of controller)
(2. Premise: Policy gradient reinforcement learning)
In the following, the Markov decision process (MDP) will be considered, and the controlled object (controller environment) is in a state transition probability (execution (control) u _t when it is in the state x _t at time t). The same applies to the case where the probability is x _{t + 1} ) and the reward function r _{t + 1} = r (x _t , u _t ) (r _{t + 1} = r (x _{t + 1} , x _t , u _t )) Can be discussed). In addition, about this reward function, it shall be predetermined according to the control target of control object. The mapping from the state input xεX to the action output uεU is called a policy and is expressed stochastically as described below. The strategy is defined by the parameter θ.

ここでは、従来のＰＧＲＬアルゴリズムをレビューし、本発明の新しいアルゴリズムの主たるアイデアを提示する。有限な状態Ｘ∋ｘの組と行動Ｕ∋ｕとを有する離散時間ＭＤＰは、以下の状態遷移確率ｐと報酬関数ｒ₊₁によって規定される。 Here, the conventional PGRL algorithm is reviewed and the main idea of the new algorithm of the present invention is presented. A discrete time MDP having a finite set of state X∋x and action U∋u is defined by the following state transition probability p and reward function r _{+ 1} .

ここで、記載の簡単のために、ｘ₊₁は、状態ｘにおいて、行動ｕにより与えられる次の状態であり、ｒ₊₁は、ｘ₊₁において観測された即時報酬である［非特許文献１０，非特許文献１１］。ｘ_+kおよびｕ_+kは、それぞれ、状態ｘからｋ時間ステップ先の状態および行動であり、添え字が−ｋとなっていれば、その反対である。（ＭＤＰにおける）決定は、θ∈Ｒ^dによりパラメータ化された以下の確率的方策πにしたがってなされる。 Here, for simplicity of description, x _{+ 1} is the next state given by the action u in the state x, and r _{+ 1} is an immediate reward observed in x _{+ 1} [Non-Patent Document]. 10, Non-Patent Document 11]. x _{+ k} and u _{+ k} are respectively a state and an action which are k steps from the state x, and vice versa if the subscript is −k. The decision (in MDP) is made according to the following probabilistic policy π parameterized by θεR ^d .

以下の方策πは、全てのｘ∈Ｘおよびｕ∈Ｕに対して、θについて微分可能であると仮定する。 The following strategy π is assumed to be differentiable with respect to θ for all x∈X and u∈U.

さらに、以下のような仮定をおく。
（仮定１）
以下のような状態遷移確率ｐと確率的方策πとを有するマルコフ連鎖Ｍ（θ）は、全ての方策パラメータについてエルゴート的（既約で非周期的）である。 Furthermore, the following assumptions are made.
(Assumption 1)
A Markov chain M (θ) having a state transition probability p and a stochastic policy π as follows is ergodic (irreducible and aperiodic) for all policy parameters.

したがって、以下のただ１つの定常分布が存在する。 Thus, there is only one steady distribution:

この定常分布は、初期状態には独立であって、以下の式を満たす。 This steady distribution is independent of the initial state and satisfies the following equation.

ここで、以下の式が成り立つ。 Here, the following equation holds.

ＰＲＧＬの目的は、以下の「平均報酬」と呼ばれる即時報酬の平均を最大化する方策パラメータθ^＊を見いだすことである。 The purpose of PRGL is to find a policy parameter θ ^* that maximizes the average of immediate rewards, called “average reward” below.

仮定１の下では、平均報酬は、初期状態ｘには独立で、以下の式に等しいことが示される［非特許文献１０］： Under Assumption 1, the average reward is shown to be independent of the initial state x and equal to the following equation [10]:

方策勾配ＲＬアルゴリズムは、方策パラメータθを、以下の式に示されるθについての平均報酬Ｒ（θ）の勾配の方向に更新する。 The policy gradient RL algorithm updates the policy parameter θ in the direction of the average reward R (θ) gradient for θ shown in the following equation.

以下、単に方策勾配（ＰＧ）と、しばしば呼ばれる。この方策勾配は、以下のように与えられる。 Hereinafter, it is often referred to simply as policy gradient (PG). This policy gradient is given as follows.

以下の式に示される定常分布の対数の偏微分の導出は簡単ではない。 Deriving the partial derivative of the logarithm of the steady distribution shown in the following equation is not easy.

そこで、従来のＰＧアルゴリズム［非特許文献１，非特許文献２］は、ＰＧのもう一つの表現を利用している。 Therefore, the conventional PG algorithm [Non-Patent Document 1, Non-Patent Document 2] uses another expression of PG.

ここで、割引率γ∈［０，１）で、それぞれ、行動価値関数Ｑと状態価値関数Ｖとは以下のように表される［非特許文献１１］。 Here, the action value function Q and the state value function V are expressed as follows, respectively, with a discount rate γ∈ [0, 1) [Non-patent Document 11].

式（４）の第２項の寄与は、γが１に近づくにつれて小さくなるので［非特許文献１］、従来のアルゴリズム［非特許文献１，非特許文献２］は、γ〜１とすることで、第１項のみからＰＧを近似している。このような省略による偏りは、割引率γを1に近づければ小さくなるが、推定の分散は多くなってしまう。 The contribution of the second term of equation (4) becomes smaller as γ approaches 1 [Non-patent document 1], and the conventional algorithm [Non-patent document 1, Non-patent document 2] should be γ˜1. Thus, PG is approximated only from the first term. Such bias due to omission becomes smaller when the discount rate γ approaches 1, but the variance of estimation increases.

ここで、本発明では、もう１つのアプローチを提案する。そこでは、以下の式の定常分布の対数の偏微分（ＬＳＤＧ）を推定し、ＰＧを導出するために式（３）を用いる。 Here, the present invention proposes another approach. There, the partial differential (LSDG) of the logarithm of the steady distribution of the following equation is estimated, and equation (3) is used to derive PG.

著しい特徴は、価値関数を学習する必要がなく、したがって、そのアルゴリズムは、割引率γの選択において、偏りと分散のトレードオフと関係がないことである。
（３．定常分布の対数の偏微分の推定）
以下では、最小二乗法に基づくＬＳＤＧ推定アルゴリズム、ＬＳＤＧ（λ）を提案する。この目的のために、エルゴート的なマルコフ連鎖Ｍ（θ）の逆過程を定式化し、ＬＳＤＧは、ＴＤ法［非特許文献１２，非特許文献１３，非特許文献１４］の枠組みで推定できることを示す。
（３．１順方向および逆方向マルコフ連鎖の性質）
ベイズの理論を用いれば、現在の状態から過去の状態および行動の対への逆方向の確率は、以下の式で表される。 A striking feature is that no value function needs to be learned, and therefore the algorithm has nothing to do with the trade-off between bias and variance in the selection of discount rate γ.
(3. Estimation of partial derivative of logarithm of stationary distribution)
In the following, we propose an LSDG estimation algorithm, LSDG (λ), based on the least square method. For this purpose, we formulate the inverse process of the Ergotian Markov chain M (θ) and show that LSDG can be estimated in the framework of the TD method [Non-patent document 12, Non-patent document 13, Non-patent document 14]. .
(3.1 Properties of forward and backward Markov chains)
Using Bayesian theory, the backward probability from the current state to the past state and action pair is expressed by the following equation.

以下の事後確率ｑは事前分布ｐに依存する。 The following posterior probability q depends on the prior distribution p.

以下のように、事前分布ｐが定常分布ｄと方策πに従うとき、事後分布ｑは、定常逆方向確率と呼ばれ、下付添え字Ｂ（θ）が加えられる。 As will be described below, when the prior distribution p follows the steady distribution d and the policy π, the posterior distribution q is called a steady reverse probability, and a subscript B (θ) is added.

マルコフ連鎖―Ｍ（θ）とＢ（θ）の両方―は、以下の２つの定理において記述されるように密接に関連している。
（定理１） The Markov chain—both M (θ) and B (θ) —is closely related as described in the following two theorems.
(Theorem 1)

（証明）
式（５）の両辺に以下の定常分布をかける。 (Proof)
The following steady distribution is applied to both sides of Equation (5).

すると、全ての可能な行動ｕ_-1∈Ｕについて総和をとると、以下の式が得られる： Then, summing over all possible actions u ₋₁ ∈U, we get:

そして、式（７）の両辺を可能な状態ｘ∈Ｘについて総和をとると、以下の式が成り立つ。 Then, when the sum is taken for the possible state xεX on both sides of the equation (7), the following equation is established.

このことは、以下の２点を成立させる．（ｉ）Ｂ（θ）は、Ｍ（θ）と同一の定常分布を有すること、（ｉｉ）Ｂ（θ）はＭ（θ）と同じ既約な性質を有すること。 This establishes the following two points. (I) B (θ) has the same steady distribution as M (θ), and (ii) B (θ) has the same irreducible property as M (θ).

このことは、（ｉｉｉ）Ｂ（θ）がＭ（θ）と同じ非周期的な性質をもっていることを示唆する。式（６）は、（ｉ）−（ｉｉｉ）により直接証明される［非特許文献１５］。
（定理２） This suggests that (iii) B (θ) has the same aperiodic nature as M (θ). Equation (6) is directly proved by (i)-(iii) [Non-Patent Document 15].
(Theorem 2)

（証明）
マルコフ連鎖の特性と式（５）を代入することにより、以下の関係が得られる。 (Proof)
By substituting the properties of the Markov chain and equation (5), the following relationship is obtained.

このことは、有限のＫの場合において式（８）が成立することを証明する。定理１から以下の式が導かれるので、式（８）のＫ→∞の極限の場合も成立することが、すぐさま証明される。 This proves that the equation (8) holds in the case of finite K. Since the following equation is derived from Theorem 1, it is immediately proved that the K → ∞ limit of equation (8) holds.

定理１および定理２は、これらが、定常分布に収束する状態分布の下で、順方向マルコフ連鎖Ｍ（θ）からのサンプルは、そのまま、逆方向マルコフ連鎖Ｂ（θ）に関する推定に使用できることになるので、重要である。そして、これらは、後に利用されうるものである。
（３．２逆方向から順方向のマルコフ連鎖のＬＳＤＧのためのＴＤ（Temporal Difference）学習法）
ＬＳＤＧは式（５）を用いて、以下のように分解される。 Theorem 1 and Theorem 2 indicate that samples from the forward Markov chain M (θ) can be used for estimation on the backward Markov chain B (θ) as they are under a state distribution that converges to a steady distribution. So it ’s important. These can be used later.
(3.2 TD (Temporal Difference) learning method for LSDG of Markov chain from reverse direction to forward direction)
LSDG is decomposed as follows using equation (5).

式（１０）は、状態ｘのＬＳＤＧは、以下の式で表される方策の対数の偏微分の状態ｘから逆方向マルコフ連鎖Ｂ（θ）の無限区間の集積であることを暗示している。 Equation (10) implies that LSDG in state x is an accumulation of infinite intervals of backward Markov chain B (θ) from the logarithmic partial differential state x of the policy represented by the following equation: .

式（９）および（１０）から、ＬＳＤＧは、Ｍ（θ）よりもむしろ逆方向マルコフ連鎖Ｂ（θ）についての、以下のような逆方向ＴＤ δに関するＴＤ学習[非特許文献１２]により推定されうる。 From Equations (9) and (10), LSDG is estimated by TD learning [12] for the backward TD δ as follows for the backward Markov chain B (θ) rather than M (θ). Can be done.

ここで、最初の２つの項は、Ｂ（θ）における方策の対数の偏微分の１ステップ前の実際の観測値と１ステップ前のＬＳＤＧであり、現在の状態のＬＳＤＧが最後の項である。 Here, the first two terms are the actual observation value one step before the logarithmic partial differentiation of the policy at B (θ) and the LSDG one step before, and the LSDG in the current state is the last term. .

適格度減衰率λ∈［０，１］と逆方向追跡時間ステップＫ∈Ｎを用いて、式（１０）は、以下のように一般化される。 Using the eligibility decay rate λε [0,1] and the backward tracking time step KεN, equation (10) is generalized as follows:

上記のような設定でなくとも、大きなλやＫを用いたならば、このような最小化は従来の価値関数に対するＴＤ（λ）学習の場合のように、非マルコフ効果に対しては、より敏感ではない［非特許文献１６］。 Even if the setting is not as described above, if a large λ or K is used, such minimization is more effective for the non-Markov effect as in the case of TD (λ) learning for the conventional value function. It is not sensitive [Non-Patent Document 16].

理論的な仮定と現実への適用との間のギャップを埋めるために、以下の２つのうちのいずれかの戦略をとる必要がある。（ｉ）λ〜１ならば、Ｋは、あまり大きな整数に設定しない、（ｉｉ））Ｋ〜ｔならば、λは、１に設定しない、ここで、ｔは、現実の順方向マルコフ連鎖の現在のタイムステップである。
（３．３ＬＳＤＧ推定アルゴリズム：制限付き逆方向ＴＤ（λ）の最小二乗法） To fill the gap between theoretical assumptions and real-world applications, one of the following two strategies needs to be taken. (I) If λ˜1, K is not set to a very large integer, (ii)) If K˜t, λ is not set to 1, where t is the real forward Markov chain The current time step.
(3.3 LSDG Estimation Algorithm: Least Square Method of Restricted Reverse Direction TD (λ))

この３．３では、最小二乗法に基づく［非特許文献１７，非特許文献１３，非特許文献１４］、ＬＳＤＧ推定アルゴリズム、ＬＳＤＧ（λ）を提案する。これは、同時に、平均二乗を減少させるとともに、制約条件を満足することを達成しようとするものである。 This 3.3 proposes a [LS17], [13], [14], LSDG estimation algorithm, and LSDG (λ) based on the least square method. This simultaneously attempts to reduce the mean square and meet the constraints.

したがって、最小化すべき対象となる関数は、以下の式（１３）となる。 Therefore, the function to be minimized is the following expression (13).

ここで、右辺の第２項は、式（１２）の制約条件のためのものである。したがって、式（１３）の偏微分は、以下のようになる。 Here, the second term on the right-hand side is for the constraint condition of Expression (12). Therefore, the partial differentiation of equation (13) is as follows.

一般的なＲＬ問題では、このような相関が存在するので、このような偏りを除くために、操作変数法（instrumental variable method）を適用する［非特許文献１７，非特許文献１３］。 In a general RL problem, such a correlation exists, and thus an instrumental variable method is applied to eliminate such a bias [Non-patent Document 17, Non-Patent Document 13].

以下では、現実のマルコフ連鎖Ｍ（θ）における時間ステップｔの状態を示すために、ノーテーションをｘ_ｔに変更する。提案するＬＳＤＧ推定アルゴリズム、ＬＳＤＧ（λ）は、適格性減衰率λ∈［０，１）の下で、逆方向にさかのぼる時間ステップＫを現在の状態ｘ_ｔのタイムステップｔと同じにする。すなわち、以下が成り立つ。 Hereinafter, to indicate the status of the time step t in the real Markov chain M (theta), to change the notation for x _t. The proposed LSDG estimation algorithm, LSDG (λ), makes the time step K going backwards the same as the time step t of the current state x _t under the qualifying decay rate λε [0,1). That is, the following holds.

図３は、ＬＳＤＧ（λ）を求める手順をアルゴリズム１として示す図である。 FIG. 3 is a diagram showing a procedure for obtaining LSDG (λ) as algorithm 1.

また、図４は、アルゴリズム１を示すフローチャートである。
図４を参照して、まず、ステップ１００において、処理の前提として、以下の設定がなされる。 FIG. 4 is a flowchart showing the algorithm 1.
Referring to FIG. 4, first, in step 100, the following settings are made as a premise of processing.

続いて、初期化処理として、以下の処理が行われる（ステップＳ１０２）。 Subsequently, the following processing is performed as initialization processing (step S102).

時間ステップｔがｔ＝０とされ（ステップＳ１０４）、以下の処理が、ｔ＝０からｔ＝Ｔ−１まで繰り返される（ステップＳ１０６〜Ｓ１１６）。 The time step t is set to t = 0 (step S104), and the following processing is repeated from t = 0 to t = T−1 (steps S106 to S116).

まず、ステップＳ１０６においてｔ＝０であれば、初期状態が観測され（ステップＳ１０８）、続いて、以下の設定が行われる（Ｓ１１０）。 First, if t = 0 in step S106, the initial state is observed (step S108), and then the following settings are made (S110).

一方、ステップＳ１０６において、ｔが０でなければ、以下の処理が行われる（ステップＳ１１２）。 On the other hand, if t is not 0 in step S106, the following processing is performed (step S112).

ステップＳ１１０またはＳ１１２に続いて、以下の計算が行われる（ステップＳ１１４）。 Following step S110 or S112, the following calculation is performed (step S114).

ステップＳ１１６にて、ｔがＴよりも小さければ処理はステップＳ１０６に復帰し、ｔがＴ以上であれば、処理はステップＳ１１８に移行して、以下の計算を行う。 In step S116, if t is smaller than T, the process returns to step S106. If t is equal to or greater than T, the process proceeds to step S118, and the following calculation is performed.

続いて、以下の計算によりＬＳＤＧの推定値を得る。 Subsequently, an estimated value of LSDG is obtained by the following calculation.

（４. ＬＳＤＧ推定による方策の更新）
ここでは、上述したＬＳＤＧ推定に基づくＰＧＲＬアルゴリズムを定義する。 (4. Policy update by LSDG estimation)
Here, a PGRL algorithm based on the LSDG estimation described above is defined.

方策パラメータは、適切なステップサイズαの確率的勾配法により更新される。 The policy parameters are updated by a probabilistic gradient method with an appropriate step size α.

ここで、：＝は、右辺を左辺に代入することを示す。
図５は、ＬＳＤＧ（λ）-ＰＧを、ＬＳＤＧ（λ）を利用した、ＰＧアルゴリズムについての最も簡単な実現法の１つとして、アルゴリズム２として示す図である。ここで、減衰率パラメータβ∈［０，１）は、古いθの値により与えられる過去の推定を捨てていくために導入されている。 Here, = indicates that substituting right side to the left side.
FIG. 5 is a diagram showing LSDG (λ) -PG as algorithm 2 as one of the simplest implementations of the PG algorithm using LSDG (λ). Here, the attenuation factor parameter β∈ [0, 1) is introduced in order to discard the past estimation given by the old value of θ.

関数近似器によるＬＳＤＧ推定は他の重要な内容を構成する。すなわち、特に、連続状態問題において、基底関数φ（ｘ）をいかにして設定するか、ということである。 LSDG estimation with a function approximator constitutes another important content. That is, how to set the basis function φ (x) particularly in the continuous state problem.

したがって、以下のような定理が有用である。
（定理３） Therefore, the following theorem is useful.
(Theorem 3)

（証明）
以下の式により証明される。 (Proof)
This is proved by the following equation.

図６は、図５に対応するフローチャートである。
図６を参照して、まず、ステップ２００において、処理の前提として、以下の設定がなされる。 FIG. 6 is a flowchart corresponding to FIG.
Referring to FIG. 6, first, in step 200, the following setting is made as a premise of processing.

続いて、初期化処理として、以下の処理が行われる（ステップＳ２０２）。 Subsequently, the following processing is performed as initialization processing (step S202).

時間ステップｔがｔ＝０とされ（ステップＳ２０４）、以下の処理が、ｔ＝０からｔ＝Ｔ−１まで繰り返される（ステップＳ２０６〜Ｓ２１６）。 The time step t is set to t = 0 (step S204), and the following processing is repeated from t = 0 to t = T-1 (steps S206 to S216).

まず、ステップＳ２０６においてｔ＝０であれば、初期状態が観測され（ステップＳ２０８）、続いて、以下の設定が行われる（Ｓ２１０）。 First, if t = 0 in step S206, the initial state is observed (step S208), and then the following settings are made (S210).

一方、ステップＳ２０６において、ｔが０でなければ、以下の処理が行われる（ステップＳ２１２）。 On the other hand, if t is not 0 in step S206, the following processing is performed (step S212).

ステップＳ２１０またはＳ２１２に続いて、以下の計算が行われる（ステップＳ２１４）。 Following step S210 or S212, the following calculation is performed (step S214).

ステップＳ１１６にて、ｔがＴよりも小さければ処理はステップＳ２０６に復帰し、ｔがＴ以上であれば、処理はステップＳ２２０に移行して、以下の計算を行うことで、方策がアップデートされる。 In step S116, if t is smaller than T, the process returns to step S206. If t is greater than or equal to T, the process proceeds to step S220, and the policy is updated by performing the following calculation. .

図７は、本発明の制御器の構成の概念図である。
本発明の制御器は、行動、すなわち、制御信号を制御対象に与える処理を行って、制御対象の状態量を観測器（たとえば、位置センサ、角度センサ、加速度センサ、角加速度センサなど）で観測し、この観測結果により「定常分布の対数の偏微分」（ＬＳＤＧ）を推定し、方策パラメータを更新し、これにより方策を更新する。そして、更新された方策により、さらに、制御対象が制御される。
（５．数値計算の結果）
有限のグリッドの組Ｘ＝｛１，…、ｍ｝と２つの可能な行動Ｕ＝｛Ｌ，Ｒ｝（左（Ｌ）または右（Ｒ）への１グリッド分の運動）を有している「１次元トーラス状グリッド空間」において、われわれが提案したアルゴリズムのパフォーマンスを検証した。これは、典型的なｍ状態ＭＤＰタスクであり、状態の遷移確率は以下のように与えられる。 FIG. 7 is a conceptual diagram of the configuration of the controller of the present invention.
The controller of the present invention performs processing, that is, processing for giving a control signal to a control target, and observes the state quantity of the control target with an observer (for example, a position sensor, an angle sensor, an acceleration sensor, an angular acceleration sensor, etc.) Then, “logarithmic partial derivative of stationary distribution” (LSDG) is estimated from the observation result, the policy parameter is updated, and the policy is updated accordingly. The controlled object is further controlled by the updated policy.
(5. Results of numerical calculations)
It has a finite set of grids X = {1,..., M} and two possible actions U = {L, R} (one grid movement to the left (L) or right (R)) In the “one-dimensional torus-like grid space”, we verified the performance of our proposed algorithm. This is a typical m-state MDP task, and the state transition probabilities are given as follows.

ここで、ｘ＝０およびｘ＝ｍ（ｘ＝１およびｘ＝ｍ＋１）とは、同じ状態であり、ｐ_i∈［０，１］（ｉ＝１，…，ｍ）は、タスクに依存する定数である。われわれの数値計算では、確率的方策は、以下のようなシグモイダル関数で表される： Here, x = 0 and x = m (x = 1 and x = m + 1) are the same state, and p _i ∈ [0, 1] (i = 1,..., M) depends on the task. It is a constant. In our numerical calculations, the stochastic strategy is represented by the following sigmoidal function:

ここで、状態特徴ベクトルφ（１），…，φ（ｍ）∈Ｒ^ｍの全ての要素は、定常正規分布Ｎ（０，１^２）からシミュレーションごとに独立に取り出された。これは、確率的方策のパラメータ化がいかにして、われわれのアルゴリズムのパフォーマンスに影響を与えるかを検証するためであった。状態特徴ベクトルφ（ｘ）は、ＬＳＤＧ推定のための基底関数としても使用された。各シミュレーションは、１０^５タイムステップ以上からなる１つのエピソードを実行した。 Here, the state feature vector phi (1), ..., all elements of the φ ^(m) ∈R m was taken out independently for each simulation from a steady normal distribution N (0,1 ^2). This was to verify how the parameterization of the stochastic policy affects the performance of our algorithm. The state feature vector φ (x) was also used as a basis function for LSDG estimation. Each simulation performed one episode consisting of more than 10 ⁵ time steps.

まず、われわれは、問題設定や方策パラメータθに関わりなく、ＬＳＤＧ（λ）がどれぐらい正確に以下の「定常分布の対数の偏微分」（ＬＳＤＧ）を推定しているのかを検証した。 First, we verified how accurately LSDG (λ) estimates the following “logarithmic partial derivative of stationary distribution” (LSDG) regardless of the problem setting and policy parameter θ.

この目的を達成するために、タスク依存の定数ｐ₁，…，ｐ_mは、区間［０．７，１］の均一な分布から独立に選ばれ、各シミュレーションでは固定された。方策パラメータθは、正規分布Ｎ（０，０．５²）に従ってランダムに選択され、各シミュレーションでは固定された。 To achieve this goal, task-dependent constants p ₁ ,..., P _m were chosen independently from the uniform distribution in the interval [0.7, 1] and were fixed in each simulation. The policy parameter θ was randomly selected according to the normal distribution N (0, 0.5 ² ) and fixed in each simulation.

図８は、ｍ＝３のときに、推定されたＬＳＤＧの典型的な時間経過を示す図である。ここでは、９の異なった色が、ＬＳＤＧの異なった要素を示している。実線は、ＬＳＤＧ（０）により推定された値を示し、点線は、ＬＳＤＧの解析的な解を示している。 FIG. 8 shows a typical time course of the estimated LSDG when m = 3. Here, nine different colors indicate different elements of the LSDG. A solid line indicates a value estimated by LSDG (0), and a dotted line indicates an analytical solution of LSDG.

図８に示すように、ＬＳＤＧ（０）による推定は、ｍ＝３の場合は約１００回のタイムステップで解析的な解に収束している。 As shown in FIG. 8, the estimation by LSDG (0) converges to an analytical solution in about 100 time steps when m = 3.

７状態タスクを用いて、適格性減衰率λの影響を調べた。異なった設定についての平均的なパフォーマンスを評価するために、以下で定義される「相対誤差」基準を採用した。 A 7-state task was used to investigate the effect of the eligibility decay rate λ. In order to evaluate the average performance for different settings, the “relative error” criterion defined below was adopted.

ここで、Ω^＊は定理３において定義された最適パラメータであり、解析的に計算された。図９および図１０は、λ＝０，０．３，０．９および１について、相対誤差の２００シミュレーションについての平均の時間推移を示している。これら２つの図の間の相違は、ただ、特徴ベクトルφ（ｘ）の要素の数だけである。 Here, Ω ^* is the optimum parameter defined in Theorem 3, and was calculated analytically. FIGS. 9 and 10 show the average time course for 200 simulations of relative error for λ = 0, 0.3, 0.9 and 1. FIG. The only difference between these two figures is the number of elements of the feature vector φ (x).

図９において使用される特徴ベクトルφ（ｘ）∈Ｒ^７は、適切なものであり、異なった状態を区別するのに十分であった一方、図１０において使用される特徴ベクトルφ（ｘ）∈Ｒ^６は、適切でない。これらの結果は、理論的な予想と合致するものである。つまり、もしも基底関数が適切であれば（図９）、λ＝１以外の任意の値にλを設定できるのに対して、そうでないときは、λ＝１以外の大きな値にλを設定する必要がある（図１０）。 The feature vector φ (x) εR ⁷ used in FIG. 9 is appropriate and sufficient to distinguish different states, while the feature vector φ (x) ε used in FIG. R ⁶ is not appropriate. These results are consistent with theoretical expectations. That is, if the basis function is appropriate (FIG. 9), λ can be set to an arbitrary value other than λ = 1, whereas λ is set to a larger value other than λ = 1 otherwise. There is a need (FIG. 10).

最後に、ＬＳＤＧ（λ）―ＰＧを他の従来からのＰＧ法とを、３状態タスクにおいて比較した。ここで、状態遷移確率は、全てのｉ∈｛１，２，３｝についてｐ_i＝１に設定された。 Finally, LSDG (λ) -PG was compared with other conventional PG methods in a three-state task. Here, the state transition probability is set to p _i = 1 for all iε {1, 2, 3}.

図１１は、このタスクにおける報酬の設定を示す図である。ここでは、２種類の報酬がある。定数“ｒ＝（±）２”と変数“ｒ＝（±）ｓ”である。変数ｓは、各シミュレーションにおいて、区間［０．８，１）での均一分布からランダムに設定された。報酬ｓは、最適方策を見いだすためのγの最小値を以下のように規定していることに注意されたい： FIG. 11 is a diagram showing setting of rewards in this task. Here, there are two types of rewards. The constant “r = (±) 2” and the variable “r = (±) s”. The variable s was randomly set from the uniform distribution in the interval [0.8, 1) in each simulation. Note that the reward s defines the minimum value of γ to find the optimal strategy as follows:

したがって、γの設定は重要であり、このタスクにおいては困難である。従来のＰＧ法のパフォーマンスのベースラインとして、２つのアルゴリズムを採用した：ＧＰＯＭＤＰ［非特許文献１］とKondaのactor-critic法とである［非特許文献５］。 Therefore, setting γ is important and difficult for this task. Two algorithms were adopted as the baseline of the performance of the conventional PG method: GPOMDP [Non-Patent Document 1] and Konda's actor-critic method [Non-Patent Document 5].

図１２は、１００回のシミュレーションについての３つの方法の平均のパフォーマンスを示す図である。エラーバーは１００回についての標準偏差を示す。ここでは、パフォーマンスは、３Ｒ（θ）／（２−２ｓ）により評価された。すなわち、平均の報酬がその上限値である（２−２ｓ）／３により正規化されている。結果はＬＳＤＧ（λ）−ＰＧ法は他の方法よりもパフォーマンスが優れていることを示している。 FIG. 12 shows the average performance of the three methods for 100 simulations. Error bars indicate standard deviation for 100 times. Here, the performance was evaluated by 3R (θ) / (2-2s). That is, the average reward is normalized by the upper limit value (2-2s) / 3. The results show that the LSDG (λ) -PG method is superior to other methods.

以上説明したように、本発明では、現実の順方向および逆方向のマルコフ連鎖は密接に関連しており、定理において共通の性質を有することを利用している。これらを用いて、定常分布の対数の偏微分（ＬＳＤＧ）を推定するアルゴリズムとしてＬＳＤＧ（λ）を、ＬＳＤＧ推定を用いたＰＧアルゴリズムとしてＬＳＤＧ（λ）―ＰＧを提案した。実験結果はＬＳＤＧ（λ）は、適格性減衰率λ∈［０，１）で動作することができ、かつ、ＬＳＤＧ（λ）−ＰＧは、割引率γとは独立に学習をすることができる。 As described above, the present invention utilizes the fact that the actual forward and backward Markov chains are closely related and share a common property in the theorem. Using these, LSDG (λ) was proposed as an algorithm for estimating the logarithmic partial derivative (LSDG) of a stationary distribution, and LSDG (λ) -PG was proposed as a PG algorithm using LSDG estimation. Experimental results show that LSDG (λ) can operate with a qualified decay rate λε [0,1) and LSDG (λ) -PG can learn independently of discount rate γ. .

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の制御方法および制御プログラムが適用される制御器を用いたシステム１０００の一例を示す概念図である。It is a conceptual diagram which shows an example of the system 1000 using the controller with which the control method and control program of this invention are applied. コンピュータ１００の構成をブロック図形式で示す図である。FIG. 2 is a block diagram showing the configuration of a computer 100. ＬＳＤＧ（λ）を求める手順をアルゴリズム１として示す図である。FIG. 5 is a diagram showing a procedure for obtaining LSDG (λ) as algorithm 1; アルゴリズム１を示すフローチャートである。3 is a flowchart showing Algorithm 1. ＬＳＤＧ（λ）-ＰＧを示す図である。It is a figure which shows LSDG ((lambda))-PG. 図５に対応するフローチャートである。6 is a flowchart corresponding to FIG. 本発明の制御器の構成の概念図である。It is a conceptual diagram of the structure of the controller of this invention. ｍ＝３のときに、推定されたＬＳＤＧの典型的な時間経過を示す図である。FIG. 6 is a diagram illustrating a typical time course of estimated LSDG when m = 3. ｍ＝７のときに、十分な特徴ベクトルを用いて推定されたＬＳＤＧの相対誤差の時間経過を示す図である。It is a figure which shows the time passage of the relative error of LSDG estimated using sufficient feature vector when m = 7. ｍ＝７のときに、不十分な特徴ベクトルを用いて推定されたＬＳＤＧの相対誤差の時間経過を示す図である。It is a figure which shows the time passage of the relative error of LSDG estimated using the insufficient feature vector when m = 7. タスクにおける報酬の設定を示す図である。It is a figure which shows the setting of the reward in a task. １００回のシミュレーションについての３つの方法の平均のパフォーマンスを示す図である。FIG. 6 shows the average performance of the three methods for 100 simulations.

Explanation of symbols

１００コンピュータ、１０２コンピュータ本体、１０４ディスプレイ、１０６ＦＤドライブ、１０８ＣＤ−ＲＯＭドライブ、１１０キーボード、１１２マウス、１１８ＣＤ−ＲＯＭ、１２０ＣＰＵ、１２２メモリ、１２４ハードディスク、１２８通信インタフェース、２００被制御装置、１０００システム。 100 computer, 102 computer main body, 104 display, 106 FD drive, 108 CD-ROM drive, 110 keyboard, 112 mouse, 118 CD-ROM, 120 CPU, 122 memory, 124 hard disk, 128 communication interface, 200 controlled device, 1000 system.

Claims

When the time evolution of the target system is described as a forward Markov decision process , a policy that is stochastically expressed as a control law for the state of the system is enhanced by the policy gradient method by observing the state quantity of the system A controller to learn,
Control signal generating means for generating a control signal for controlling the system based on the strategy;
State quantity detection means for observing the state quantity of the system;
Reward value acquisition means for acquiring a reward value depending on the state and the control signal in a predetermined relationship;
When the policy is defined by a policy parameter that is a parameter that defines the probabilistic policy, the logarithm of the steady distribution of the state distribution of the system is based on the state quantity and the control signal at each time step. by estimating the logarithmic normal distribution partial derivative is a partial differential of the policy parameters, a policy gradient estimation means for estimating a gradient of the measures,
Based on the estimation result obtained by the compensation value and the policy gradient estimation means, the logarithmic normal distribution partially differentiating the previous SL side measures parameters in the direction of the average earnings partial differential estimated using it to slightly changed, the measures A controller comprising policy updating means for updating.

The policy gradient estimation means is a constraint condition derived by a condition that the sum of the distribution of the states with respect to the state is constant, and an expected value for a forward Markov chain of the logarithmic steady distribution partial differentiation is 0 The controller according to claim 1, wherein the logarithmic stationary distribution partial derivative is estimated by TD learning for a backward Markov chain under the constraint that

In the TD learning, i) the sum of the observed value one step before the logarithmic partial differentiation of the policy in the backward Markov chain, the logarithmic steady distribution partial differentiation one step before, and the logarithmic stationary of the current state When the difference from the distribution partial derivative is δ, the expected value of the square of δ for the forward Markov chain, and ii) the square of the expected value of the logarithmic steady distribution partial differentiation for the forward Markov chain The controller according to claim 2, wherein the logarithmic stationary distribution partial derivative is estimated by minimizing a sum of the two.

When the time evolution of the target system is described as a forward Markov decision process , a policy that is stochastically expressed as a control law for the state of the system is enhanced by the policy gradient method by observing the state quantity of the system A control method to learn,
A control signal generating step for generating a control signal for controlling the system based on the strategy;
A state quantity detection step of observing the state quantity of the system;
A reward value acquiring step of acquiring a reward value depending on the state and the control signal in a predetermined relationship;
When the policy is defined by a policy parameter that is a parameter that defines the probabilistic policy, the logarithm of the steady distribution of the state distribution of the system is based on the state quantity and the control signal at each time step. by estimating the logarithmic normal distribution partial derivative is a partial differential of the policy parameters, a policy gradient estimation step of estimating the slope of the measures,
Based on the estimation result by the policy gradient estimating step and the compensation value, by updating the previous SL side measures parameters the direction of the gradient of the average earnings expressed on the basis of the logarithmic normal distribution partial derivative, updating the measures And a policy updating step.

The policy gradient estimation step is a constraint condition derived by a condition that the sum of the distribution of the states with respect to the state is constant, and an expected value for a forward Markov chain of the logarithmic steady distribution partial differentiation is 0 The control method according to claim 4, further comprising the step of estimating the logarithmic steady distribution partial derivative by TD learning for a backward Markov chain under the constraint that

In the TD learning, i) the sum of the observed value one step before the logarithmic partial differentiation of the policy in the backward Markov chain, the logarithmic steady distribution partial differentiation one step before, and the logarithmic stationary of the current state When the difference from the distribution partial derivative is δ, the expected value of the square of δ for the forward Markov chain, and ii) the square of the expected value of the logarithmic steady distribution partial differentiation for the forward Markov chain The control method according to claim 5, wherein the logarithmic steady distribution partial derivative is estimated by minimizing a sum of and.

When the time evolution of the target system is described as a forward Markov decision process , a policy that is stochastically expressed as a control law for the state of the system is enhanced by the policy gradient method by observing the state quantity of the system A program for causing a computer to execute a learning control method,
A control signal generating step for generating a control signal for controlling the system based on the strategy;
A state quantity detection step of observing the state quantity of the system;
A reward value acquiring step of acquiring a reward value depending on the state and the control signal in a predetermined relationship;
When the policy is defined by a policy parameter that is a parameter that defines the probabilistic expression, the logarithm of the steady distribution of the state distribution of the system is based on the state quantity and the control signal at each time step. by estimating the logarithmic normal distribution partial derivative is a partial differential of the policy parameters, a policy gradient estimation step of estimating the slope of the measures,
Based on the estimation result by the policy gradient estimating step and the compensation value, by updating the previous SL side measures parameters the direction of the gradient of the average earnings expressed on the basis of the logarithmic normal distribution partial derivative, updating the measures A control program for causing a computer to execute a control method, comprising:

The policy gradient estimation step is a constraint condition derived by a condition that the sum of the distribution of the states with respect to the state is constant, and an expected value for a forward Markov chain of the logarithmic steady distribution partial differentiation is 0 The control program according to claim 7, further comprising the step of estimating the logarithmic steady distribution partial derivative by TD learning for a backward Markov chain under the constraint that

In the TD learning, i) the sum of the observed value one step before the logarithmic partial differentiation of the policy in the backward Markov chain, the logarithmic steady distribution partial differentiation one step before, and the logarithmic stationary of the current state When the difference from the distribution partial derivative is δ, the expected value of the square of δ for the forward Markov chain, and ii) the square of the expected value of the logarithmic steady distribution partial differentiation for the forward Markov chain The control program according to claim 8, wherein the logarithmic steady distribution partial derivative is estimated by minimizing a sum of the two.