This is a summary of my solution for the 2020-2021 "Rock, Paper, Scissors" competition hosted on Kaggle. Participants had to write bots that competed against each other in the well-known RPS game. Each matchup consisted of 1000 rounds, making this an excellent task for machine learning/statistics/AI.
When you first hear of Rock, Paper, Scissors, you may wonder why this is an interesting game for a computer competition. There is simply no way to beat the Nash equilibrium strategy (playing completely randomly) statistically. However, this doesn't mean that there is no meaning in this challenge as the majority of opponents your bot may face isn't playing pure random. Therefore the goal is to try to win against these deterministic agents while trying not to lose against them at the same time (Playing deterministically is a double-edged sword.). I'll try to give a short high-level overview of the approach I chose for this task, which is an ensemble strategy, involving multiple meta-layers.
In my code, a policy/strategy/action selector is simply a mapping of the current game history to a probability distribution over the three possible actions. A policy selecting rock all the time would yield [1, 0, 0] for example. There are multiple ways to create new policies from existing ones of which I use the following:
This transformation takes the probabilities from a policy and selects the action with the highest probability, e.g. [0.6, 0.2, 0.2] becomes [1, 0, 0]. Therefore we now have a deterministic version of the original policy.
"When my opponent thinks that I think he thinks I think..." Dan Egnor already implemented this idea in his Iocaine Powder bot. One takes a policy and shifts the probabilities by one to produce a version that would counter the current one. [0.2, 0.5, 0.3] becomes [0.3, 0.2, 0.5]. One can derive a double incremented policy by shifting the incremented version.
This is an attempt to construct a policy countering a given policy. The counter policy gets constructed by assuming the opponent uses the original policy (One has to flip one's own actions and the opponent's actions in the game history.) and shifting the probability distribution produced by the policy.
As you can see, it is relatively easy to derive a whole set of policies from a single stochastic policy. By combining the above transformations, one can get up to twelve policies out of a single one.
- Random
- Frequency
- Copy last action
- Transition matrix
- Transition tensor
- Random forest model
- two adapted stochastic versions of RFind
- some fixed sequences (pi, e, De-Bruijn)
- Seed searcher by @taaha-khan
- RPS Geometry by @superant
- Anti-Geometry bot by @robga
- Greenberg by Andrzej Nagorko, translated into Python by Travis Erdman
- Iocaine Powder by Dan Egnor, translated into Python by David Bau
and the following RPS Contest bots, which I could integrate into my ensemble using this method by Daniel Lu (@dllu):
- testing please ignore by Daniel Lu
- centrifugal bumblepuppy 4 by Daniel Lu
- IO2_fightinguuu by sdfsdf
- dllu1 by Daniel Lu
- RPS_Meta_Fix by TeleZ
- Are you a lucker? by sdfsdf
- bayes14 by pyfex
I tried many ways of selecting the best policy for a move including weighted agent voting, Thompson sampling, a beta distribution, a Dirichlet distribution, LSTMs, CNNs, max wins, but one outperformed them all: argmax (which I didn't even consider until one week before the competition end date because it was so simple) applied to the decayed performance scores of the policies. Every step the performance of a policy is calculated like this: score = probabilities[winning_action] - probabilities[losing_action] My final version of the policy selector simply takes the three policies with the highest decayed performance scores and combines their probabilities using the weights [0.7, 0.2, 0.1], I really don't know why this performs so well, but well, it does.
I added a meta-layer by creating multiple policy selectors with different parameters to calculate the performance score. The parameters are decay (used to differentiate between long- and short-term success), drop probability (the probability to reset the score of a policy when it loses; didn't perform too well in combination with a decay < 0.99) and zero clip (whether the score should be clipped at zero). Then the agent chooses best performing parameter configuration (measured by a decayed configuration performance score with additional hyperparameters) using argmax on the configuration weights. These weights are computed by sampling from a Dirichlet distribution using the configuration performance scores. At first, I tried combining the probabilities of the configurations weighting them by their respective performance, but this led to way too much noise. However, only configurations that yield a decisive result at this step get considered. If the result is too indecisive (The sum of the weights of all configurations yielding decisive probabilities is too small), my agent selects a uniformly random action instead of choosing from the distribution it got from the best parameter configuration.
This is the last meta-layer in my agent and maybe the one with the biggest impact. Choosing a random action is a great defensive tool as there is simply no way to beat it. An additional decayed performance score gets computed for the agent. At this level, I no longer use a probability distribution for score calculation, but the action the agent selected. The agent chooses a random action with a certain probability in the following cases:
- The probability of winning by playing random is >= 92%
- The agent performance score is below a certain threshold (I tried playing the counter of the counter of the action selected by the agent with a certain probability, but this didn't perform too well.)
So that's it. Essentially just a strange cascade of scores and meta-layers operating on top of an ensemble of other agents. Interestingly, this architecture performs way better than previous ones with more meta-layers, more policies and more sophisticated methods for combining weighted probability distributions. This could be due to a bot essentially degenerating to random at a certain degree of complexity. The stochastic base policies, which are obviously strongly correlated to their strict versions, are almost never used by the ensemble. However, removing them resulted in a major decline in leaderboard performance. Removing almost all policies (except Greenberg and the Geometry bot by @superant), but keeping their counter policies results in a very strong agent, even outperforming the original version on the leaderboard. My interpretation of this is that this reduces noise that would distract the agent from cracking the opponent's strategy and minimal logic suffices (already demonstrated by @superant) apart from counter policies. In my local tests, using online-learned LSTMs as policy scoring functions vastly outperformed (up to 80 points) the approach described here, particularly because the agents with LSTMs are extremely fast at adapting to policy changes. On the leaderboard, however, this strategy performs worse. I'm convinced there is much room for further improvement using neural networks with a good architecture and maybe a combination of offline and online learning.
Submitting almost all agents only once and only using a fraction of my submissions doesn't seem to be a good idea in a competition where a few matches against random agents can completely destroy the score of an agent.
RPS has been an amazing competition and I'm sure I learned a lot from it. What I will remember most, however, is the incredible collaboration and helpfulness in the discussions section of this competition. Thanks to everyone who made this challenge the great experience it was. Of course, also special thanks to Kaggle for hosting this competition and providing excellent support.