-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Issue 18280: A3C/IMPALA multi-agent not working. #19100
[RLlib] Issue 18280: A3C/IMPALA multi-agent not working. #19100
Conversation
@@ -14,7 +15,23 @@ def check_support_multiagent(alg, config): | |||
lambda _: MultiAgentMountainCar({"num_agents": 2})) | |||
register_env("multi_agent_cartpole", | |||
lambda _: MultiAgentCartPole({"num_agents": 2})) | |||
config["log_level"] = "ERROR" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, we would use the default multiagent setup, which doesn't cover the case of having 2 policies. The default case is simply: 1 policy ("default_policy") and all agents map to this 1 policy.
@@ -104,18 +103,14 @@ def __init__( | |||
|
|||
self.train_batch_size = train_batch_size | |||
|
|||
# TODO: (sven) Allow multi-GPU to work for multi-agent as well. | |||
self.policy = self.local_worker.policy_map[DEFAULT_POLICY_ID] | |||
self.policy_map = self.local_worker.policy_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes in this file make it possible to run learner threads (e.g. IMPALA) with multi-agent as well. Before, the somewhat obsoleted "simple_optimizer" would have to be used to make this work.
pid: self.policy_map[pid]._build_apply_gradients( | ||
builder, grad) | ||
for pid, grad in grads.items() | ||
if self.policy_config.get("framework") == "tf": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was buggy code: Each policy nowadays has its own tf-graph (instead of the workers/trainer carrying the graph), so we have to loop through each policy here.
@@ -860,14 +860,15 @@ def compute_gradients( | |||
summarize(samples))) | |||
if isinstance(samples, MultiAgentBatch): | |||
grad_out, info_out = {}, {} | |||
if self.tf_sess is not None: | |||
builder = TFRunBuilder(self.tf_sess, "compute_gradients") | |||
if self.policy_config.get("framework") == "tf": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was buggy code: Each policy nowadays has its own tf-graph (instead of the workers/trainer carrying the graph), so we have to loop through each policy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks fine to me -- I have some questions that I left as inline review comments. If you could PTAL and answer them that would help me a lot.
builders = {} | ||
outputs = {} | ||
for pid, grad in grads.items(): | ||
if pid not in self.policies_to_train: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason that we compute grads ever on a non trainable policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we just not compute grads on it in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The thing is we don't call learn_on_batch
/learn_on_loaded_batch
on non-trainable policies.
But what I think could happen is that the grads were calculated, when some policy A was still trainable and we are now trying here to apply these to A, which might have become untrainable in the meantime as per a user-induced change in the policy_map
(and in the trainable_policies
list of the worker).
@@ -509,6 +509,8 @@ def check_train_results(train_results): | |||
f"train_results['infos']['learner'] ({learner_info})!" | |||
|
|||
for pid, policy_stats in learner_info.items(): | |||
if pid == "batch_count": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this addition here? Why would a policy id ever be the string "batch_count". Is this a hack/work around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yeah, this is a hack. Not sure where this key comes from, but it is currently in this learner_info dict (besides the policy IDs), so I didn't want to touch this here. We can fix this in a follow-up PR. No additional harm done here by the PR ;)
# TODO: (sven) Allow multi-GPU to work for multi-agent as well. | ||
self.policy = self.local_worker.policy_map[DEFAULT_POLICY_ID] | ||
self.policy_map = self.local_worker.policy_map | ||
self.devices = next(iter(self.policy_map.values())).devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this line unblock the following comment that you had deleted?
# TODO: (sven) Allow multi-GPU to work for multi-agent as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the whole PR makes multi-agent actually "runnable" by the MultiGPULearnerThread (which was not possible before).
It's a different question of whether multi-agent (with >1 policies) should use multi-GPU, but in this case, multi-agent was even not running here in the case where only one single policy is used.
Issue 18280: A3C/IMPALA multi-agent not working.
Why are these changes needed?
Issue #18280
Related issue number
Closes #18280
Checks
scripts/format.sh
to lint the changes in this PR.