Multi-arm bandit algorithms are fundamental to enabling recommendation systems, online advertising, clinical trials, online resource allocation and online search. Here is a small collection of them in Python Implementation: ETC, E-Greedy, Elimination, UCB, Exp3, LinearUCB, and Thompson Sampling.
Vis-a-vis Bandits used in Reinforcement Learning settings, three key assumptions apply:
- Reward observed only conrrespond to the action taken (feedback).
- Action does not alter the environment.
- Taking an action does not restrict future actions.
Furthermore, these algorithms differ in their requirement of the time horizon and suboptimality gap , and in their scaling of regret bounds.
For detailed background of each algorithm, refer to Bandit Algorithms (2020) by Lattimore & Szepesvari.