Name		Name	Last commit message	Last commit date
parent directory ..
examples		examples
src/rllib_bandit		src/rllib_bandit
tests		tests
tuned_examples		tuned_examples
BUILD		BUILD
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

README.md

Contextual Bandits

The Multi-armed bandit (MAB) problem provides a simplified RL setting that involves learning to act under one situation only, i.e. the context (observation/state) and arms (actions/items-to-select) are both fixed. Contextual bandit is an extension of the MAB problem, where at each round the agent has access not only to a set of bandit arms/actions but also to a context (state) associated with this iteration. The context changes with each iteration, but, is not affected by the action that the agent takes. The objective of the agent is to maximize the cumulative rewards, by collecting enough information about how the context and the rewards of the arms are related to each other. The agent does this by balancing the trade-off between exploration and exploitation.

Contextual bandit algorithms typically consist of an action-value model (Q model) and an exploration strategy (epsilon-greedy, LinUCB, Thompson Sampling etc.)

RLlib supports the following online contextual bandit algorithms, named after the exploration strategies that they employ:

Linear Upper Confidence Bound (BanditLinUCB)

paper

LinUCB assumes a linear dependency between the expected reward of an action and its context. It estimates the Q value of each action using ridge regression. It constructs a confidence region around the weights of the linear regression model and uses this confidence ellipsoid to estimate the uncertainty of action values.

Linear Thompson Sampling (BanditLinTS)

paper

Like LinUCB, LinTS also assumes a linear dependency between the expected reward of an action and its context and uses online ridge regression to estimate the Q values of actions given the context. It assumes a Gaussian prior on the weights and a Gaussian likelihood function. For deciding which action to take, the agent samples weights for each arm, using the posterior distributions, and plays the arm that produces the highest reward.

Installation

conda create -n rllib-bandit python=3.10
conda activate rllib-bandit
pip install -r requirements.txt
pip install -e '.[development]'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bandit

bandit

README.md

Contextual Bandits

Linear Upper Confidence Bound (BanditLinUCB)

Linear Thompson Sampling (BanditLinTS)

Installation

Usage

Files

bandit

Directory actions

More options

Directory actions

More options

Latest commit

History

bandit

Folders and files

parent directory

README.md

Contextual Bandits

Linear Upper Confidence Bound (BanditLinUCB)

Linear Thompson Sampling (BanditLinTS)

Installation

Usage