Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Interaction Drivers/Runners #381

Open
jmribeiro opened this issue Jun 21, 2019 · 10 comments
Open

[Feature Request] Interaction Drivers/Runners #381

jmribeiro opened this issue Jun 21, 2019 · 10 comments
Labels
enhancement New feature or request

Comments

@jmribeiro
Copy link

jmribeiro commented Jun 21, 2019

Currently the way to train an agent is to 1) Instantiate the environment, 2) Instantiate the agent, passing the environment in constructor and 3) calling the learn method.

Some agent frameworks have started implementing execution drivers, i.e., objects responsible for interacting the agent with the environment.

Would love to see such a feature in stable-baselines, given that it would facilitate a lot the pipeline for testing a new agent and comparing it with existing ones.

Current code:

env = gym.make("CartPole-v0")
agent = DQN(env, ...)
agent.learn(...)

What if there were driver objects such that the execution would go something like this:

driver = ExampleDriver(max_timesteps=10000, max_episodes=200)
metrics = [TotalTimesteps(), AverageTimestepReward(), AverageEpisodeReward(), ...]
driver.run(agent, env, metrics)
for metric in metric:
      print(f"{metric.name}: {metric.result()}")

Example

class ExampleDriver(BaseDriver):
    """ 
      Runs until one of the conditions is met - max_timesteps or episodes
    """
    def __init__(self, agent, environment, max_timesteps=math.inf, max_episodes=math.inf, observers=None):
        super(ExampleDriver, self).__init__(agent, environment, observers)
        self._timesteps = max_timesteps
        self._episodes = max_episodes

    def run(self):
        self._environment.reset()
        done = False
        while not done:
            self.step()
            done = self.total_episodes >= self._episodes or self.total_steps >= self._timesteps

And a base class:

Timestep = namedtuple("Timestep", "t state action reward next_state is_terminal info")


class BaseDriver(ABC):

    def __init__(self, agent, environment, observers):
        """
        :param agent: The agent to interact with the environment
        :param environment: The environment
        :param observers: The observers
        """
        self._agent = agent
        self._environment = environment
        self._observers = observers or []

        self._total_steps = 0
        self._total_episodes = 0

    @property
    def total_steps(self):
        return self._total_steps

    @property
    def total_episodes(self):
        return self._total_episodes

    @abstractmethod
    def run(self):
        raise NotImplementedError()

    def step(self):
        state = self._environment.state
        action = self._agent.action(state)
        timestep = self._environment.step(action)
        for observer in self._observers:
            observer(timestep)
            self._agent.reinforcement(timestep)
        self._total_steps += 1
        is_terminal = timestep.is_terminal
        if is_terminal:
            self._total_episodes += 1
        return timestep

    def episode(self):
        self._environment.reset()
        is_terminal = False
        trajectory = [self._environment.state]
        while not is_terminal:
            timestep = self.step()
            trajectory.append(timestep)
        return trajectory

Would love to contribute with such features.
Let me know what you think.

@araffin
Copy link
Collaborator

araffin commented Jun 22, 2019

Hello,

If I understand well, the main advantage of Driver is that it is easy to have the same log/metrics for each algo? or do you see other advantages?

I feel this would require too much changes in the codebase, because some algos uses n-steps of interaction with the environment, others 1, and it gets even more complicated when you have environments running in parallel.

I would rather go for callbacks if it is only about metrics (see #348 ).

@jmribeiro
Copy link
Author

Ok.
Lets say I implement an awesome, new state-of-the-art algorithm with a certain goal, for example, zero-shot transfer learning.
I want to prove it as better than the DQN (lets assume) and so I train both of them from scratch:

dqn = DQN("MlpPolicy", gym.make("CartPole-v0"))
dqn.learn(10000)

my_agent = AwesomeNewAgent(gym.make("CartPole-v0"))
my_agent.learn(10000)

Now I want to prove that my agent outperforms the DQN in MountainCar, without any more training on both of them.

Is there a way to now do something like the following with callbacks?

dqn_results = dqn.run(gym.make("MountainCar-v0"), timesteps=100)
my_agent_results = my_agent.run((gym.make("MountainCar-v0"), timesteps=100)

# Plot both and show that mine is better

@araffin
Copy link
Collaborator

araffin commented Jun 22, 2019

So, you want to compare them after training only? (i.e. on the final performance)
Is there a complete description somewhere of where the drivers can be used?

Callbacks are mainly made for doing things during training.

@jmribeiro
Copy link
Author

jmribeiro commented Jun 22, 2019

Without drivers, a really simple solution would be to do the same you guys are doing in Zoo's Enjoy Script.

Something like adding a enjoy or evaluate method to the BaseRLModel which does the same as learn however without updating the network's parameters and has the option of rendering while acting.

How does it sound?
I can try and setup an example using for the DQN if you want.

@araffin
Copy link
Collaborator

araffin commented Jun 22, 2019

Something like adding a enjoy or evaluate method to the BaseRLModel

sounds like something more plausible. I'm just afraid of all the possible problems that may come from the environment (and that is also the reason of the enjoy script of the zoo), especially the wrappers.

To give two examples, keeping it simple, with one env (with multiple it gets even worse):

  • when using VecNormalize, the test env statistics must be fixed and synced with the one from the training env
  • when evaluating on Atari games, because of the wrappers, the concept of episode may be different (e.g. for Breakout, does that correspond to 1 or 3 lifes?)

but, I think it could be a good idea anyway. For instance, we have a lot of redundant code in the tests, I made a function that may be used a base:

def model_predict(model, env, n_steps, additional_check=None):
"""
Test helper
:param model: (rl model)
:param env: (gym.Env)
:param n_steps: (int)
:param additional_check: (callable)
"""
obs = env.reset()
for _ in range(n_steps):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
if additional_check is not None:
additional_check(obs, action, reward, done)
if done:
obs = env.reset()

I need the point of view of the others maintainers: @hill-a , @erniejunior , @AdamGleave ?

@jmribeiro
Copy link
Author

Awesome, lets see wait for their response.

@AdamGleave
Copy link
Collaborator

I'd be reluctant to change e.g. BaseRLModel, which seems core to the codebase and is best kept simple. However, it is very common to need to perform rollouts for evaluation with particular termination conditions and evaluation metrics. Making a new class, that people can chose to use (but do not need to depend on), could be useful.

@araffin
Copy link
Collaborator

araffin commented Jun 24, 2019

I agree with @AdamGleave , in fact the evaluate method is not specific to the agent.

I think adding a file evaluation in the common folder, with both the evaluate method and some metrics could be a good idea, no?

@ernestum
Copy link
Collaborator

ernestum commented Jul 5, 2019

After training finished, I roll out a large number of trajectories/episodes to estimate the policy performance. For this I keep writing custom code, which is not desirable since this means extra work and extra risk of introducing bugs which are very sad especially in evaluation code (maybe your policy is very good but you miss that because there is a bug in your evaluation code). So an interaction runner as you describe it is an awesome idea!

However, to get a meaningful estimate of your policy performance you often need a very large number of trajectories. Often larger than you might think. E.g. if you want to estimate the failure rate of a policy which is about 0.1% and you only use 1000 episodes to compute your failure rate, with a probability of about 37% you will not observe a single failure in your 1000 test episodes, thereby greatly overestimating the reliability of a policy. In this particular case you need at least about 5000 test episodes.

So it is clear that you might need a large number of test episodes depending on your problem domain. It comes with three problems:

  1. It takes more time. This means you might want to parallelize the execution of your environments AND of your policy. The latter is easier because tensorflow does that for you, the former means dealing with pythons multiprocessing package.
  2. It takes more memory and might exceed your RAM if you record full state information of that many trajectories. One solution would be to compute one trajectory, analyze and store summary statistics and continue with the next trajectory. However this is in contradiction with parallel execution of environments. Therefore my best bet would be to compute the trajectories in batches which are efficient to compute but do not exceed the RAM. Since you do not want to keep batches in mind when writing your analysis code, I propose a interaction runner that acts as a generator, yielding one episode at a time but computing them in larger batches.
  3. You need an efficient way of doing data analysis on the episodes. This means you need it memory efficient and fast to query. Pandas can help but I found structured numpy arrays far more powerful especially if your state has a hierarchy.

See here for an example of how I implemented most of the above (except for parallel environment execution) for a specific environment and with PPO2 in mind. Maybe you can generalize from it. Sorry for mixing up the terms episode and trajectory here.

@AdamGleave
Copy link
Collaborator

Agree with @erniejunior's comments. Just to chime in with another use case for this code: for imitation learning algorithms, it is common to need to perform rollouts of a policy while training the reward model. It'd be nice if the interaction driver/runner also easily supported this use case.

I'm sure we do this in the GAIL algorithm inside Stable Baselines. Here is another example of rollout for AIRL & GAIL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants