-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to support multi-agent reinforcement learning #121
Comments
This issue can be used to track the design of multi-agent reinforcement learning implementation. |
After some pilot study, I find there are three paradigms of multi-agent reinforcement learning:
The problem is how to transform these paradigms to the following standard RL procedure:
For simultaneous move, the solution is simple: we can just add a For 2 & 3 (cyclic move & conditional move), an elegant solution is:
By constructing new state
Just be careful that Usually, the legal action set varies with state, so it is more convenient to denote |
I don't think there is a need for multi-agent reinforcement learning at short term. To me the priority is to improve the current functionality. There is major flaws in the current implementation, especially regarding the efficiency of distributed sampling, which are way more critical to handle than adding new features. The core has be to robust before building on top of it. Yet, of course, it is still interesting to discuss the implementation of future features! |
This feature does not change any of the current code and is also compatible with 2d-buffer. I think it is independent with what you said. |
Of course it is! Work force being limited, someone working on a specific feature necessarily affects / slow-down the development of the others. That's the point of prioritizing development of some features wrt to others, it is often not because of interdependencies, but rather because of limited work force. But obviously, in the setup of an open-source project, anyone is free to work on the features they want to. |
I gave an example of playing Tic Tac Toe in the test case, without modifying the core code. What is the next step seems unclear. I do not know how people in MARL typically train their model, especially how to sample experience.Some possible ideas may be:
Which one is commonly used? Or both? Or there are other paradigms? This should be cleared by some experts in MARL area. |
Issue #136 is a great discussion on multi-agent rl with simultaneous move. The conclusion is that, it can be dealt with, without modifying the core of Tianshou. Just to inherit some class and re-implement one function with several lines, depending on the specific scenario. |
I am not an expert in MARL and I have just discovered Tianshou. With that said, here are some thoughts based on the papers I have been reading recently. There are many workflows for MARL with different training (centralized/decentralized), execution (centralized/decentralized), type of agents (homogeneous/heterogeneous), task setting (cooperative/competitive/mixed), types of reward (individual/global), etc. Here are some common MARL types:
The problem with 1 is that it tends not to converge and might require managing different networks in training and execution (heterogeneous agents). Type 2 and Type 3 are strong trends in MARL. Type 2 can use the algorithms already implemented in Tianshou with small changes in the policy or collector. Type 3 would require implementing specific algorithms (ex: QMIX, MADDPG or COMA), like in RLlib. I think type 4 can be implemented using Tianshou with no change, but in practice it is hard to scale, as the joint action space grows exponentially in the number of agents. Therefore, type 2 might be a good start. If there is a large interest in making Tianshou a MARL library, maybe it is worth developing 3. |
Thank you for your comment and the overall description of MARL algorithms. It seems MARL has many variants and it is difficult to support them once at all. We have to support MARL step by step. |
Some updates:
|
You guys might want to take a look at this: https://github.com/PettingZoo-Team/PettingZoo |
I noticed that the cheat sheet for rl states that tianshou supports simultaneous move (case 1):
**Question: Is it possible to use the MultiAgentPolicyManager to work with
Last year (#136) I had to "tweak" the policy and net classes by
|
Sorry about that, that is actually beyond the current scope. I haven't come up with a good design choice for this kind of requirement... |
Isn't that just the case of adding a dimension of n_agent to the actions and observations, like I did last year and adapting the operations in the policy to that? When you reset the environment, it will set a new number of agents for that episode. |
hmm yep, you're right |
Assuming that I can fix the number of agents for a certain period of training, does the class MultiAgentPolicyManager support parameter sharing (single policy for multiple agents) and simultaneous actions? If not, are there other classes that can support it? |
Can you pass the same reference into MAPM, i.e., MultiAgentPolicyManager([policy_1, policy_1, policy_2])? I think that would be fine to some extend. |
That might work, but I think it will train the same network with separate batches, right?
|
Exactly, that should be a tiny issue but I think it would be fine for agent to learn, though it is a little bit inefficient. |
Thanks for the quick reply, @Trinkle23897 . |
I don't quite understand. Do you mean you use different algorithms in the same MAPM? Current implementation only accepts a list of on-policy algorithms or a list of off-policy algorithms. It would be no code changes if you use something like |
No. I am using a single policy. What I mean is that last year I changed the policy of PPO and the neural network to deal with parameter sharing and simultaneous actions. I changed the code of tianshou to manage the additional dimension in my setting. The environment sends observations, rewards, etc. with the number of agents as the first dimension (n_agents, ?) These changes were made in
So, my point is that it would be tricky to do that for multiple algorithms. So, I am curious if I could address these changes only by customizing my own multi agent policy manager (similar to the MultiAgentPolicyManager, but with one policy and the changes mentioned above done directly in its methods). |
Yeah, you can definitely do that. In MAPM the only thing need to do is to re-organize the Batch to buffer-style data (reshape or flatten to let the 1st dim be n_agent*bsz, also pay attention to done flag) and that would be the same as PPO single-agent. |
Could I offer a thought? You might want to consider using the PettingZoo parallel API (it has two). The parallel API isn't significantly different what I understand you're proposing your 1.0 one to be, and this way you're using a standard instead of a custom one. A bunch of third party libraries for PettingZoo already exist (e.g. p-veloso above has one), and RLlib and stable baselines interface with it, as do several more minor RL libraries. |
That's pretty cool! I'm wondering if you have any interest in integrating this standard library into tianshou. |
Sure we can do that! It'll probably take a few weeks though |
@Trinkle23897, For now, I gave up on my previous approach (manually changing the shape of the batch), because it would require tweaking some of the algorithms that rely on the order of the observations (e.g., GAE, multi-step value prediction, etc.). I am trying the approach that you cited above, but there is a problem:
|
I go through the above discussion again. So if the environment produces all agent's step simultaneously and uses only one policy, there's no need to use/follow MAPM. Instead, treat this environment as a normal single-agent environment, e.g.,
and then use normal way of collector/buffer (a transition in buffer stores several agent's obs/act/rew/... in this timestep), one thing you need to do is to customize your network to squash the second dimension (num_agent) into the first dimension (batch size). Since this reward is an array instead of scalar, you should pass Please let me know if there's anything unclear (or maybe I misunderstood some parts lol) |
Yes. In a previous version of tianshou I tried to fix that with the following changes: NETWORK
But that is not enough, because the single agent algorithms also assume that shape, so I have to change the batch or change them directly, such as PPO:
As I mentioned in another post, the problems with this approach are:
That is why I was looking for a more general approach outside of the policies. I tried your idea of repeating the same policy in the policy manager for each agent, which is nice because each policy will only deal with a batch of one of the agents separately. However, that
in BaseVectorEnv: |
Yep, that's what I was doing previously.
Yeah I thought about this approach last night. Let's say if you have 4 envs and each env needs 5 agent, so that you need a So this comes to one natural way: construct another vector env that inherit existing
therefore you can execute only num_env envs but get num_env x num_agent results at each step without modifying the agent's code and without using MAPM. |
I spent this last day trying the different approaches. I have just solved the "policy approach" for DQN in the latest version of the Tianshou, but I would definitely prefer a higher-level modification that can work with all the original policies. I think your suggestion is similar to what supersuit does ... but I had no idea how to do that in Tianshou. Thanks for the suggestion. Just for clarification, according to your current idea, would I still need to change other parts, such as the forward pass of the neural network? |
None of them I think. |
It works! Thanks again. |
@p-veloso I just saw this, but while the supersuit example here: https://github.com/PettingZoo-Team/SuperSuit#parallel-environment-vectorization is for stable baselines, all it does is translate the parallel environment into a vector environment. Since tianshou support vector environments out of box for all algorithms, you should just be able to use supersuit's environment vectorization, rather than your own custom code. If there is some reason it doesn't work out of box, feel free to raise an issue with supersuit asking for support for tianshou. |
The text was updated successfully, but these errors were encountered: