Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Issue 18280: A3C/IMPALA multi-agent not working. #19100

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Oct 5, 2021

Issue 18280: A3C/IMPALA multi-agent not working.

  • Fix IMPALA's MultiGPULearnerThread to also work with multi-agent.
  • Fix RolloutWorker's compute_gradients/apply_gradients methods to work with TFPolicy's per-policy session in multi-agent case.

Why are these changes needed?

Issue #18280

Related issue number

Closes #18280

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sven1977 sven1977 requested a review from avnishn October 5, 2021 12:04
@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 6, 2021
@@ -14,7 +15,23 @@ def check_support_multiagent(alg, config):
lambda _: MultiAgentMountainCar({"num_agents": 2}))
register_env("multi_agent_cartpole",
lambda _: MultiAgentCartPole({"num_agents": 2}))
config["log_level"] = "ERROR"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, we would use the default multiagent setup, which doesn't cover the case of having 2 policies. The default case is simply: 1 policy ("default_policy") and all agents map to this 1 policy.

@@ -104,18 +103,14 @@ def __init__(

self.train_batch_size = train_batch_size

# TODO: (sven) Allow multi-GPU to work for multi-agent as well.
self.policy = self.local_worker.policy_map[DEFAULT_POLICY_ID]
self.policy_map = self.local_worker.policy_map
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes in this file make it possible to run learner threads (e.g. IMPALA) with multi-agent as well. Before, the somewhat obsoleted "simple_optimizer" would have to be used to make this work.

pid: self.policy_map[pid]._build_apply_gradients(
builder, grad)
for pid, grad in grads.items()
if self.policy_config.get("framework") == "tf":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was buggy code: Each policy nowadays has its own tf-graph (instead of the workers/trainer carrying the graph), so we have to loop through each policy here.

@@ -860,14 +860,15 @@ def compute_gradients(
summarize(samples)))
if isinstance(samples, MultiAgentBatch):
grad_out, info_out = {}, {}
if self.tf_sess is not None:
builder = TFRunBuilder(self.tf_sess, "compute_gradients")
if self.policy_config.get("framework") == "tf":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was buggy code: Each policy nowadays has its own tf-graph (instead of the workers/trainer carrying the graph), so we have to loop through each policy here.

Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine to me -- I have some questions that I left as inline review comments. If you could PTAL and answer them that would help me a lot.

builders = {}
outputs = {}
for pid, grad in grads.items():
if pid not in self.policies_to_train:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that we compute grads ever on a non trainable policy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we just not compute grads on it in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The thing is we don't call learn_on_batch/learn_on_loaded_batch on non-trainable policies.
But what I think could happen is that the grads were calculated, when some policy A was still trainable and we are now trying here to apply these to A, which might have become untrainable in the meantime as per a user-induced change in the policy_map (and in the trainable_policies list of the worker).

@@ -509,6 +509,8 @@ def check_train_results(train_results):
f"train_results['infos']['learner'] ({learner_info})!"

for pid, policy_stats in learner_info.items():
if pid == "batch_count":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this addition here? Why would a policy id ever be the string "batch_count". Is this a hack/work around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yeah, this is a hack. Not sure where this key comes from, but it is currently in this learner_info dict (besides the policy IDs), so I didn't want to touch this here. We can fix this in a follow-up PR. No additional harm done here by the PR ;)

# TODO: (sven) Allow multi-GPU to work for multi-agent as well.
self.policy = self.local_worker.policy_map[DEFAULT_POLICY_ID]
self.policy_map = self.local_worker.policy_map
self.devices = next(iter(self.policy_map.values())).devices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line unblock the following comment that you had deleted?

# TODO: (sven) Allow multi-GPU to work for multi-agent as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the whole PR makes multi-agent actually "runnable" by the MultiGPULearnerThread (which was not possible before).
It's a different question of whether multi-agent (with >1 policies) should use multi-GPU, but in this case, multi-agent was even not running here in the case where only one single policy is used.

@sven1977 sven1977 merged commit c3e3fc7 into ray-project:master Oct 7, 2021
@sven1977 sven1977 deleted the issue_18280_a3c_and_impala_multi_agent_not_working branch October 10, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[rllib] A3C error: AttributeError: 'RolloutWorker' object has no attribute 'tf_sess'
2 participants