Skip to content
/ RHER Public
forked from kaixindelele/RHER

The official code for paper “Relay Hindsight Experience Replay: Self-Guided Continual Reinforcement Learning for Sequential Object Manipulation Tasks with Sparse Rewards”

License

Notifications You must be signed in to change notification settings

yan320/RHER

 
 

Repository files navigation

RHER: (Ralay-HER)--A revolutionary variant of HER!

The official code for the paper “Relay Hindsight Experience Replay: Self-Guided Continual Reinforcement Learning for Sequential Object Manipulation Tasks with Sparse Rewards

Natter:

This paper has been delivered to Knowledge-based systems. Hope for a good and quick result~

Otherwise, my PhD cycle will be extended again~

Once again, the agent (me) cannot immediately influence the achieved goal~

Yesterday, I review the Reincarnating RL (https://agarwl.github.io/reincarnating_rl/), and found that Jump Start RL (JSRL) has the state-distribution problem when using the guide-policy, while our Self-Guided Exploration Strategy (SGES) does not. Because JSRL use the guide-policy with a certain trajectory, then transfer to learning-policy, this combination has the state-distribution problem naturally~

Our SGES mixes guide-policy and learning-policy with the same probility, so that them have the some state-distribution~

1. Abstract:

Learning with sparse rewards remains a challenging problem in reinforcement learning (RL). Especially for sequential object manipulation tasks, the RL agent always receives negative rewards until completing all of the sub-tasks, which results in low exploration efficiency. To tackle the sample inefficiency for sparse reward sequential object manipulation tasks, we propose a novel self-guided continual RL framework, named Relay Hindsight Experience Replay (RHER). RHER decomposes the sequential task into several sub-tasks with increasing complexity and ensures that the simplest sub-task can be learned quickly by applying HER. Meanwhile, a multi-goal & multi-task network is designed to learn all sub-tasks simultaneously. In addition, a SelfGuided Exploration Strategy (SGES) is proposed to accelerate exploration. With SGES, the already learned sub-task policy will guide the agent to the states that are helpful to learn more complex sub-task with HER. Therefore, RHER can learn sparse reward sequential tasks efficiently stage by stage. The proposed RHER trains the agent in an end-to-end manner and is highly adaptable to avariousmanipulation tasks with sparse rewards. The experimental results demonstrate the superiority and high efficiency of RHER on a variety of single-object and multi-object manipulation tasks (e.g., ObstaclePush, DrawerBox, TStack, etc.). We perform a real robot experiment that agents learn how to accomplish a contact-rich push task from scratch. The results show that the success rate of the proposed method RHER reaches 10/10 with only 250 episodes.

2. Contributions:

(1) For common complex sequential object manipulation tasks with sparse rewards, this paper develops an elegant and sample efficient self-guided continual RL framework, RHER.

(2) To achieve self-guided exploration, we propose a multi-goal & multi-task network to learn multiple sub-tasks with different complexity simultaneously.

(3) The proposed RHER method is more sample-efficient than vanillaHER and other state-of-the-art methods, which are validated in the standard manipulation tasks from the OpenAI Gym. Further, to validate the versatility of RHER, we design eight sequential object manipulation tasks, including five complex multi-object tasks, which are available at this libary. The results show that the proposed RHER method consistently outperforms the vanilla-HER in terms of sample efficiency and performance.

(4) The proposed RHER learns a contact-rich task on a physical robot from scratch within 250 episodes in real world.

I had release all codes for single-object tasks, if this paper is accepted, I will release the codes for multi-object tasks with the pytorch version immediately.


Although the mainstream tasks are soft robot and deformable object, my work provides an more effecient RL scheme for RL-Robotics community.

RHER is efficient and concise enough to be a new benchmark for the manipulation tasks with sparse rewards.

3. Suitable tasks:

Complex sequential object manipulation tasks, in which both objects (Num <= 3) and goals are within the workspace of the robot.

RHER_multi_obj

Fig1. Multi-object tasks graphs.

Fig_multi_obj

Fig2. Learning curve of multi-object tasks.

Unsuitable tasks: Stroke tasks: Slide, Tennis.


4. Motivation:

HER works for simple reach tasks, but faces low sample efficient for manipulation tasks. image

Each epoch means 19 * 2 * 50 = 1900 episodes!

Reported in 'Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research'

I found an implicit problem of HER:

5. HER introduces an implicit non-negative sparse reward problem for manipulation tasks

HER has an implicit non-negative sparse reward problem caused by indentical achieved goals!

HER_INNR

Fig. 3. Illustration of the difference of HER and RHER. (a) The problem of Identical Non-Negative Rewards (INNR) with HER. (b) The proposed RHER solves the INNR problem by Self-Guided Exploration Strategy (SGES). (c) The surprising results of comparation of RHER and HER in FetchPush (If our codes is not open source, it may seem a bit outrageous~ Today, I read the controversy of the corpus indexer of NIPS and rethink our results. There should be no bug in my project, because the efficiency of the real machine is really high~).

6. A diagram of RHER:

RHER_overall

Fig4. A diagram of RHER, of which the key components are shown in the yellow rectangles. This framework achieves self-guided exploration for a sequential task.

6.1 A. Task Decomposition and Rearrangement

RHER_task

Fig5. Sequential task decomposition and rearrangement.

6.2 B. Multi-goal & Multi-task RL Model.

RHER_goal_encoding

Fig6. Multi-goal & Multi-task RL Model.

6.3 C. Maximize the Use of All Data by HER.

  1. In the RHER framework, updating a policy can not only use its own explored data but also relabel the data collected by other policies by HER.

  2. Coincidentally, for continual RL, the agent also needs to generate non-negative samples by HER.

6.4 D. Self-Guided Exploration Strategy (SGES)

Like students for scientific research, who are guided by advisers and other researchers until they need to explore a new field.

RHER-SGES

Fig7. Illustration of Self-Guided Exploration Strategy (SGES) in a toy push task. The black solid curve represents actual trajectory with SGES.

6.5 E. Relay Policy Learning.

RHER_relay

Fig8. A diagram of relay policy learning for a task with 3 stages. By using HER and SGES, RHER can solve the whole sequential task stage by stage with sample efficient.

7. Other interesting motivation:

  1. Don’t overambitious, agent need pay more attention to the goal which can be changed by itself.
  2. One step at a time, gradually reach the distant goal.
  3. Standing on the shoulders of giants, we can avoid many detours, just like scientific research.

8. Some interesting experiments that don't have space to show in the article:

  1. Why learn a reach policy alone, instead of directly designing a simpler P-controller?

a) I really did do a comparison experiment~ In the manipulation tasks without obstacle, the effect of P-controller is not much different from that of RHER, and some are even faster because it can also reach the object quickly.

But P-controller is much worse than RHER in tasks with obstacle, because RHER has the ability to adapt to the environment.

b) As for the tasks of multiple blocks, especially DPush, it is difficult to design a base controller that can push object1 to the specified position and reach the vicinity of object2, but RHER can deal with it.

9. Training Videos:

9.1 Training process for stack.

RHER.mp4
Stack_convergence.mp4

9.2 Training process for DrawerBox.

RHER.mp4
DrawerBox.mp4

9.3 Training process for Real World Task.

RHER.mp4
Relay-HER-10m.mp4

9.4 Testing process of TPush and TStack with Success Rate about 80%.

RHER.mp4
TObj_08.mp4

11. Reproduce:

Baselines

Our baselines is based on OpenAI baselines, and gym is based on OpenAI gym

OpenAI Baselines is a set of high-quality implementations of reinforcement learning algorithms.

These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.

Prerequisites

Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows

Ubuntu

sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev

Virtual environment

conda create -n rher python=3.6

Tensorflow versions

The master branch supports Tensorflow 1.14.

Installation

  • Clone the repo and cd into it:

    git clone https://github.com/kaixindelele/RHER.git
    cd RHER
  • If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases, you may use

    conda install tensorflow-gpu==1.14 # if you have a CUDA-compatible gpu and proper drivers

    and

    pip install -r requirement.txt

MuJoCo (200)

Some of the baselines examples use MuJoCo (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (license can be obtained from mujoco-free-license

MuJoCo-py (2.0.2.1)

Instructions on setting up MuJoCo can be found mujoco-py(2.0.2.1)

Training models

run in terminal

bash run_rher_push.sh

or run in pycharm

python -m baselines.run_rher_np1.py

About

The official code for paper “Relay Hindsight Experience Replay: Self-Guided Continual Reinforcement Learning for Sequential Object Manipulation Tasks with Sparse Rewards”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%