Article

Trust region policy optimization

Authors:

Philipp Moritz,

Michael Jordan,

Pieter AbbeelAuthors Info & Claims

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

Pages 1889 - 1897

Published: 06 July 2015 Publication History

Abstract

In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

References

[1]

Bagnell, J. A. and Schneider, J. Covariant policy search. IJCAI, 2003.

[2]

Bartlett, P. L. and Baxter, J. Infinite-horizon policygradient estimation. arXiv preprint arXiv:1106.0665, 2011.

[3]

Barto, A., Sutton, R., and Anderson, C. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, (5):834-846, 1983.

[4]

Bertsekas, D. Dynamic programming and optimal control, volume 1. 2005.

[5]

Deisenroth, M., Neumann, G., and Peters, J. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1-142, 2013.

[6]

Fu, Michael C, Glover, Fred W, and April, Jay. Simulation optimization: a review, new developments, and applications. In Proceedings of the 37th conference on Winter simulation, pp. 83-95. Winter Simulation Conference, 2005.

[7]

Gabillon, Victor, Ghavamzadeh, Mohammad, and Scherrer, Bruno. Approximate dynamic programming finally performs well in the game of Tetris. In Advances in Neural Information Processing Systems, 2013.

[8]

Geng, T., Porr, B., and Wörgötter, F. Fast biped walking with a reflexive controller and realtime policy searching. In Advances in Neural Information Processing Systems (NIPS), 2006.

[9]

Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X. Deep learning for real-time atari game play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing Systems, pp. 3338-3346, 2014.

[10]

Hansen, Nikolaus and Ostermeier, Andreas. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Evolutionary Computation, 1996., Proceedings of IEEE International Conference on, pp. 312-317. IEEE, 1996.

[11]

Hunter, David R and Lange, Kenneth. A tutorial on MM algorithms. The American Statistician, 58(1):30-37, 2004.

[12]

Kakade, Sham. A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1057-1063. MIT Press, 2002.

[13]

Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267-274, 2002.

[14]

Lagoudakis, Michail G and Parr, Ronald. Reinforcement learning as classification: Leveraging modern classifiers. In ICML, volume 3, pp. 424-431, 2003.

[15]

Levin, D. A., Peres, Y., and Wilmer, E. L. Markov chains and mixing times. American Mathematical Society, 2009.

[16]

Levine, Sergey and Abbeel, Pieter. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071-1079, 2014.

[17]

Martens, J. and Sutskever, I. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade, pp. 479-535. Springer, 2012.

[18]

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[19]

Nemirovski, Arkadi. Efficient methods in convex programming. 2005.

[20]

Ng, A. Y. and Jordan, M. PEGASUS: A policy search method for large mdps and pomdps. In Uncertainty in artificial intelligence (UAI), 2000.

[21]

Owen, Art B. Monte Carlo theory, methods and examples. 2013.

[22]

Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.

[23]

Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4): 682-697, 2008a.

[24]

Peters, J., Mülling, K., and Altün, Y. Relative entropy policy search. In AAAI Conference on Artificial Intelligence, 2010.

[25]

Peters, Jan and Schaal, Stefan. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745-750. ACM, 2007.

[26]

Peters, Jan and Schaal, Stefan. Natural actor-critic. Neurocomputing, 71(7):1180-1190, 2008b.

[27]

Pirotta, Matteo, Restelli, Marcello, Pecorino, Alessio, and Calandriello, Daniele. Safe policy iteration. In Proceedings of The 30th International Conference on Machine Learning, pp. 307-315, 2013.

[28]

Pollard, David. Asymptopia: an exposition of statistical asymptotic theory. 2000. URL https://www.stat.yale.edu/~pollard/Books/Asymptopia.

[29]

Szita, István and Lörincz, András. Learning tetris using the noisy cross-entropy method. Neural computation, 18 (12):2936-2941, 2006.

[30]

Tedrake, R., Zhang, T., and Seung, H. Stochastic policy gradient reinforcement learning on a simple 3d biped. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2004.

[31]

Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. MuJoCo: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026-5033. IEEE, 2012.

[32]

Wampler, Kevin and Popovic, Zoran. Optimal gait and form for animal locomotion. In ACM Transactions on Graphics (TOG), volume 28, pp. 60. ACM, 2009.

[33]

Wright, Stephen J and Nocedal, Jorge. Numerical optimization, volume 2. Springer New York, 1999.

Cited By

Jordan PGrötschla FFan FWattenhofer RDastani MSichman JAlechina NDignum V(2024)Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast ConvergenceProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662951(964-972)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662951
Chen WOnyejizu JVu LHoang LSubramanian DKar KMishra SPaternain SDastani MSichman JAlechina NDignum V(2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662881
Li YQi SWang XZhang JCui L(2024)A Novel Tree-Based Method for Interpretable Reinforcement LearningACM Transactions on Knowledge Discovery from Data10.1145/369546418:9(1-22)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3695464
Show More Cited By

Trust region policy optimization
1. Computing methodologies
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Learning to constrain policy optimization with virtual trust region
NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we ...
Deep reinforcement learning collision avoidance using policy gradient optimisation and Q-learning

Usage of trust region policy optimisation (TRPO) and proximal policy optimisation (PPO) 'children of policy gradient optimisation method' and deep Q-learning network (DQN) in Lidar-based differential robots are proposed using Turtlebot and OpenAI's ...
Supported trust region optimization for offline reinforcement learning
ICML'23: Proceedings of the 40th International Conference on Machine Learning

Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

July 2015

2558 pages

Editors:
Francis Bach,
David Blei

Publisher

JMLR.org

Publication History

Published: 06 July 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

422
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jordan PGrötschla FFan FWattenhofer RDastani MSichman JAlechina NDignum V(2024)Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast ConvergenceProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662951(964-972)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662951
Chen WOnyejizu JVu LHoang LSubramanian DKar KMishra SPaternain SDastani MSichman JAlechina NDignum V(2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662881
Li YQi SWang XZhang JCui L(2024)A Novel Tree-Based Method for Interpretable Reinforcement LearningACM Transactions on Knowledge Discovery from Data10.1145/369546418:9(1-22)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3695464
Ding XDu W(2024)Optimizing Irrigation Efficiency using Deep Reinforcement Learning in the FieldACM Transactions on Sensor Networks10.1145/366218220:4(1-34)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3662182
Mo KYe PRen XWang SLi WLi J(2024)Security and Privacy Issues in Deep Reinforcement Learning: Threats and CountermeasuresACM Computing Surveys10.1145/364031256:6(1-39)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3640312
Le AChalvatzaki GBiess APeters JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Accelerating motion planning via optimal transportProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669552(78453-78482)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669552
Tiapkin DBelomestny DCalandriello DMoulines ÉMunos RNaumov APerrault PValko MMénard POh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Model-free posterior sampling via learning rate randomizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669347(73719-73774)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669347
Zhang XChen JWang HXie HLiu YLui JLi HOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Uncertainty-aware instance reweighting for off-policy learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669346(73691-73718)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669346
Morad SKortvelesy RLiwicki SProrok AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Reinforcement learning with fast and forgetful memoryProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669272(72008-72029)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669272
Ganai MGong ZYu CHerbert SGao SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Iterative reachability estimation for safe reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669180(69764-69797)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669180
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents