Improving Policy Gradient by Exploring Under-appreciated Rewards

Nachum, Ofir; Norouzi, Mohammad; Schuurmans, Dale

Computer Science > Machine Learning

arXiv:1611.09321v3 (cs)

[Submitted on 28 Nov 2016 (v1), last revised 15 Mar 2017 (this version, v3)]

Title:Improving Policy Gradient by Exploring Under-appreciated Rewards

Authors:Ofir Nachum, Mohammad Norouzi, Dale Schuurmans

View PDF

Abstract:This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences. This is, to our knowledge, the first time that a pure RL method has solved addition using only reward feedback.

Comments:	Published as a conference paper at ICLR 2017
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:1611.09321 [cs.LG]
	(or arXiv:1611.09321v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1611.09321

Submission history

From: Ofir Nachum [view email]
[v1] Mon, 28 Nov 2016 20:15:55 UTC (391 KB)
[v2] Wed, 25 Jan 2017 22:35:03 UTC (992 KB)
[v3] Wed, 15 Mar 2017 22:55:17 UTC (995 KB)

Computer Science > Machine Learning

Title:Improving Policy Gradient by Exploring Under-appreciated Rewards

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Improving Policy Gradient by Exploring Under-appreciated Rewards

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators