skip to main content
10.5555/3045118.3045319guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Trust region policy optimization

Published: 06 July 2015 Publication History

Abstract

In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

References

[1]
Bagnell, J. A. and Schneider, J. Covariant policy search. IJCAI, 2003.
[2]
Bartlett, P. L. and Baxter, J. Infinite-horizon policygradient estimation. arXiv preprint arXiv:1106.0665, 2011.
[3]
Barto, A., Sutton, R., and Anderson, C. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, (5):834-846, 1983.
[4]
Bertsekas, D. Dynamic programming and optimal control, volume 1. 2005.
[5]
Deisenroth, M., Neumann, G., and Peters, J. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1-142, 2013.
[6]
Fu, Michael C, Glover, Fred W, and April, Jay. Simulation optimization: a review, new developments, and applications. In Proceedings of the 37th conference on Winter simulation, pp. 83-95. Winter Simulation Conference, 2005.
[7]
Gabillon, Victor, Ghavamzadeh, Mohammad, and Scherrer, Bruno. Approximate dynamic programming finally performs well in the game of Tetris. In Advances in Neural Information Processing Systems, 2013.
[8]
Geng, T., Porr, B., and Wörgötter, F. Fast biped walking with a reflexive controller and realtime policy searching. In Advances in Neural Information Processing Systems (NIPS), 2006.
[9]
Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X. Deep learning for real-time atari game play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing Systems, pp. 3338-3346, 2014.
[10]
Hansen, Nikolaus and Ostermeier, Andreas. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Evolutionary Computation, 1996., Proceedings of IEEE International Conference on, pp. 312-317. IEEE, 1996.
[11]
Hunter, David R and Lange, Kenneth. A tutorial on MM algorithms. The American Statistician, 58(1):30-37, 2004.
[12]
Kakade, Sham. A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1057-1063. MIT Press, 2002.
[13]
Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267-274, 2002.
[14]
Lagoudakis, Michail G and Parr, Ronald. Reinforcement learning as classification: Leveraging modern classifiers. In ICML, volume 3, pp. 424-431, 2003.
[15]
Levin, D. A., Peres, Y., and Wilmer, E. L. Markov chains and mixing times. American Mathematical Society, 2009.
[16]
Levine, Sergey and Abbeel, Pieter. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071-1079, 2014.
[17]
Martens, J. and Sutskever, I. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade, pp. 479-535. Springer, 2012.
[18]
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[19]
Nemirovski, Arkadi. Efficient methods in convex programming. 2005.
[20]
Ng, A. Y. and Jordan, M. PEGASUS: A policy search method for large mdps and pomdps. In Uncertainty in artificial intelligence (UAI), 2000.
[21]
Owen, Art B. Monte Carlo theory, methods and examples. 2013.
[22]
Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
[23]
Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4): 682-697, 2008a.
[24]
Peters, J., Mülling, K., and Altün, Y. Relative entropy policy search. In AAAI Conference on Artificial Intelligence, 2010.
[25]
Peters, Jan and Schaal, Stefan. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745-750. ACM, 2007.
[26]
Peters, Jan and Schaal, Stefan. Natural actor-critic. Neurocomputing, 71(7):1180-1190, 2008b.
[27]
Pirotta, Matteo, Restelli, Marcello, Pecorino, Alessio, and Calandriello, Daniele. Safe policy iteration. In Proceedings of The 30th International Conference on Machine Learning, pp. 307-315, 2013.
[28]
Pollard, David. Asymptopia: an exposition of statistical asymptotic theory. 2000. URL https://www.stat.yale.edu/~pollard/Books/Asymptopia.
[29]
Szita, István and Lörincz, András. Learning tetris using the noisy cross-entropy method. Neural computation, 18 (12):2936-2941, 2006.
[30]
Tedrake, R., Zhang, T., and Seung, H. Stochastic policy gradient reinforcement learning on a simple 3d biped. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2004.
[31]
Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. MuJoCo: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026-5033. IEEE, 2012.
[32]
Wampler, Kevin and Popovic, Zoran. Optimal gait and form for animal locomotion. In ACM Transactions on Graphics (TOG), volume 28, pp. 60. ACM, 2009.
[33]
Wright, Stephen J and Nocedal, Jorge. Numerical optimization, volume 2. Springer New York, 1999.

Cited By

View all
  • (2024)Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast ConvergenceProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662951(964-972)Online publication date: 6-May-2024
  • (2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
  • (2024)A Novel Tree-Based Method for Interpretable Reinforcement LearningACM Transactions on Knowledge Discovery from Data10.1145/369546418:9(1-22)Online publication date: 9-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37
July 2015
2558 pages

Publisher

JMLR.org

Publication History

Published: 06 July 2015

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast ConvergenceProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662951(964-972)Online publication date: 6-May-2024
  • (2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
  • (2024)A Novel Tree-Based Method for Interpretable Reinforcement LearningACM Transactions on Knowledge Discovery from Data10.1145/369546418:9(1-22)Online publication date: 9-Sep-2024
  • (2024)Optimizing Irrigation Efficiency using Deep Reinforcement Learning in the FieldACM Transactions on Sensor Networks10.1145/366218220:4(1-34)Online publication date: 8-Jul-2024
  • (2024)Security and Privacy Issues in Deep Reinforcement Learning: Threats and CountermeasuresACM Computing Surveys10.1145/364031256:6(1-39)Online publication date: 12-Jan-2024
  • (2023)Accelerating motion planning via optimal transportProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669552(78453-78482)Online publication date: 10-Dec-2023
  • (2023)Model-free posterior sampling via learning rate randomizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669347(73719-73774)Online publication date: 10-Dec-2023
  • (2023)Uncertainty-aware instance reweighting for off-policy learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669346(73691-73718)Online publication date: 10-Dec-2023
  • (2023)Reinforcement learning with fast and forgetful memoryProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669272(72008-72029)Online publication date: 10-Dec-2023
  • (2023)Iterative reachability estimation for safe reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669180(69764-69797)Online publication date: 10-Dec-2023
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media