Chapter 2: Upper Confidence Bound (UCB) | |
Demo:![]() |
|
From the above demo, we can see that the ML agent gives preference to those arms explored less. This is because arms with fewer exploration gives higher UCB. As UCB is part of the reward, those arms will produce higher overall rewards.
Press `[F5]` to restart the demo. |
|
Contents
|
In Chapter 1, we saw that the ML agent may overlook potential best arm due to unlucky start and prematurely conclude that the arm wasn't worth exploiting. How do we let the agent know some arms actually have good potential and should be exploited even the average reward at that time isn't impressive?
Luckily, we can apply Hoeffding’s Inequality to derive confidence interval. Essentially, we establish the probability that the next reward is bounded by some interval (also called radius) after seeing
Back to the interval, it has an upper bound and a lower bound. We're interested in the upper bound, since this tells the potential of the next reward. Let
where
and
With the above, the upper confidence bound of the next reward is thus:
Having the UCB, instead of using the observed average reward
The UCB system takes a few parameters to construct
-
$\alpha$ : It controls the failure where a future reward escapes the bound. This value should be sufficiently small to ensure that the probability of failure is very small. In many implemenations, we set$\alpha=2$ . -
$\beta$ : It is a scalar related to the reward range. If rewards are within a range$[u,v]$ , then$\beta = (v-u)^2$ . For the reward range of$[0,1]$ ,$\beta=1$ . -
$T$ : It is a fixed parameter. Ideally, this quantity should be large to ensure low failure. We often set it to the number of rounds. Consequently, the UCB radius also increases accordingly, more so for those insufficiently explored arms, forcing the ML agent to pick those arms to reduce their UCB radius.
Assuming our rewards are in the range
where
class UCB1(MAB): # it extends class MAB to implement UCB
def __init__(self, beta=1.0):
'''Constructor.'''
super().__init__()
self.beta = beta
self.overall_total_count = 0
self.ucb = 0
def update_reward(self, arm, reward):
'''Use this method to update the algorithm which `arm` has been
selected and what `reward` has been observed from the environment.'''
if arm not in self.total_rewards:
self.total_rewards[arm] = 0
self.total_count[arm] = 0
self.total_count[arm] += 1
self.overall_total_count += 1
self.ucb = math.sqrt(2*self.beta*math.log(self.total_count[arm])/self.total_count[arm])
ucb_reward = reward + self.ucb
self.total_rewards[arm] += ucb_reward
self.average_reward[arm] = self.total_rewards[arm]/self.total_count[arm]
def get_last_ucb(self):
return self.ucb
The following shows some statistics of the learning. As can be seen, the average rewards for all arms are quite similar. Based on the environment, we know that some ads have low theoretical click rate. But since we use the UCB as the reward, they are given the benefit of doubt with a higher UCB radius, and hence their reward is artificially improved. The UCB radius is shown under UCB raduis
in the animation.
Testing UCB MAB
Ad Average UCB Ad shown
type reward radius to users
--------------------------------
> toys 0.55 0.31 [==] 96
> cars 0.61 0.18 [=========] 365
> sports 0.61 0.12 *[=======================] 941
> holidays 0.61 0.17 [==========] 419
> foods 0.60 0.24 [====] 174
Click rate = 0.35
Regret = 114.35
Strategy: Epsilon Greedy, epsilon = 0.15
Number of users = 2000
Number of clicks = 709
Click rate = 35.45%
Theoretical best click rate = 40.00%
Unlike the simple MAB where the learning can be highly influenced by the short-term bias in the environment, UCB can self-correct this bias by offseting the effect using UCB radius. Thus in the simple MAB, the exploration rate must not be too low to avoid being influenced by the short-term bias in the environment, UCB can be operated with a low exploration rate. When all arms are sufficiently explored and all UCB radii are equally low, the best arm will be revealed.
In the previous chapter, we introduce MAB and demonstrated its operation using a primitive MAB. This chapter discusses the classical UCB which aims to avoid missing potential good arms due to short-term bias in the environment.
In both techniques, the ML agent exploits the best arm by choosing the arm with the highest average reward. This decision is hard and can get the agent stucked at a local maximal. In the next chapter, we shall look at Boltzmann Exploration
which is also called the softmax exploration. Rather focusing on the best arm, the technique uses softmax function to decide which arm to pick. We shall see in the next chapter how the ML agent makes decision using softmax function.