Alpha Zero General (any game, any framework!)

A simplified, highly flexible, commented and (hopefully) easy to understand implementation of self-play based reinforcement learning based on the AlphaGo Zero paper (Silver et al). It is designed to be easy to adopt for any two-player turn-based adversarial game and any deep learning framework of your choice. A sample implementation has been provided for the game of Othello in PyTorch, Keras, TensorFlow and Chainer. An accompanying tutorial can be found here.

Coach.py contains the core training loop and MCTS.py performs the Monte Carlo Tree Search. The parameters for the self-play can be specified in main.py.

To start training a model for connect4:

python main.py

Choose your framework and game in main.py.

Overview

connect4 folder is game related program file

Connect4Game.py :
	it defines small function about the games like getting board states and action states , ending game function

Connect4Logic.py :
	It defines function for setting board , valid moves , how to move with action , function for state be in win.

Connect4Players.py :
	This defines class each containing play function which decides how players is going to chose action. Like the players defines here are randomplayer , human player and onestep-lookahead-player

MCTS.py :
This file uses monte carlo tree search for generating good policy during the self play.

Coach.py :
this contain the learning method of neural network

neural network is defined inside connect4/tensorflow folder.

Arena.py :
This contains method which will do the tournament between the agents.

Results

KL-UCB Regret in multi-arm bandit is less than the UCB regret. This information can be utilised and be implemented in action selection from a state as similar to kl-ucb selection of multi-arm bandit. The below loss function suggest the same that self-play using MCTS with KL-UCB vs MCTS with UCB.

Tournament between these shows 53 Wins ,39 Loss and 8 Draws by the kl-ucb.

With the recent discovery of Posterior sampling or Thompson sampling, regret of Thompson sampling is lower than the UCB. The loss comparison is shown below between Thompson sampling and UCB.

Also Thompson sampling should have less regret than the KL-UCB but in this experiment both are performing same.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
connect4		connect4
images		images
pytorch_classification		pytorch_classification
.gitignore		.gitignore
Arena.py		Arena.py
Coach.py		Coach.py
Game.py		Game.py
LICENSE		LICENSE
MCTS.py		MCTS.py
NeuralNet.py		NeuralNet.py
README.md		README.md
instruction.md		instruction.md
losses_array.npy		losses_array.npy
main.py		main.py
pit.py		pit.py
setup_env.sh		setup_env.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alpha Zero General (any game, any framework!)

Overview

Results

About

Releases

Packages

Languages

License

vishalkumarchaudhary/connect4

Folders and files

Latest commit

History

Repository files navigation

Alpha Zero General (any game, any framework!)

Overview

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages