Name		Name	Last commit message	Last commit date
parent directory ..
result_pics		result_pics
MAB_algorithm_experiment.py		MAB_algorithm_experiment.py
README.md		README.md
action_select.py		action_select.py
bandit.py		bandit.py
experiments.py		experiments.py
test_bandit.py		test_bandit.py

README.md

Multi-armed Bandit

Multi-armed bandit (MAB) is a quite simple and easy to implement toy problem for understanding the concept of reinforcement learning. The codes here solve the MAB by epsilon-greedy strategy and UCB (Upper Confidence Bound) strategy for action selection, using the straightforward reward expectation as the Q value.

Preliminary results

epsilon greedy action selection

5-armed stable gaussian bandit training for 2000 epoches.

Q values estimated and real reward mean.

different epsilons for epsilon-greedy

using UCB for action selection.

solving unstable gaussian bandit using given stride alpha to enhance the weight of nearby rewards.

the unstable gaussian bandit is shown below, indicating the mean(reward) of each arm all changes with time.

comparison alpha stride and common (1/n) stride for unstable MAB problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi_armed_bandit

multi_armed_bandit

README.md

Multi-armed Bandit

Preliminary results

epsilon greedy action selection

5-armed stable gaussian bandit training for 2000 epoches.

Q values estimated and real reward mean.

different epsilons for epsilon-greedy

using UCB for action selection.

solving unstable gaussian bandit using given stride alpha to enhance the weight of nearby rewards.

Files

multi_armed_bandit

Directory actions

More options

Directory actions

More options

Latest commit

History

multi_armed_bandit

Folders and files

parent directory

README.md

Multi-armed Bandit

Preliminary results

epsilon greedy action selection

5-armed stable gaussian bandit training for 2000 epoches.

Q values estimated and real reward mean.

different epsilons for epsilon-greedy

using UCB for action selection.

solving unstable gaussian bandit using given stride alpha to enhance the weight of nearby rewards.