Active Learning and Thompson Sampling

Active learning is an interesting field in machine learning where the data is actively collected rather than passively observed.  That is, the learning algorithm has some influence over the data that is collected and used for learning.  The best example of this is Google AdWords, which uses active learning to serve ads to people in order to maximize click-through rate, profit, or some other objective function.  In this setting, the learning algorithm used by Google can choose which Ad to show me then it observes whether I clicked it or ignored it.  The learning algorithm will then update it's beliefs about the world based on this feedback so it can serve more relevant ads in the future.

Since the learning algorithm doesn't know which types of ads are most relevant a priori, it must explore the space of ads to determine which ones produce the highest click-through rate.  However, it also needs to exploit it's current knowledge to maximize click-through rate.  Thus, a good learning algorithm must balance the need for exploration and exploitation, which is one of the central problems of active learning.  Another dimension of this challenge is generalization: how can feedback about one individual or advertisement provide information about other individuals and advertisements?  While I won't be addressing the problem here, it is something that needs careful consideration when tackling real-world active learning problems.

There have been many heuristics proposed in the literature to balance exploration and exploitation, such as the epsilon-greedy strategy and the upper confidence bound strategy.  In this blog post, I'm going to introduce Thompson sampling, which is a particularly elegant application of Bayesian statistics that effectively balances exploration and exploitation.

Consider the n-armed bandit problem: there are n slot machines, and when the arm of a particular slot machine is pulled, a random real-valued reward is observed.  In general, each slot machine might have a different reward distribution, but we'll assume that the distribution is stationary (it doesn't change over time).  The agent wants to choose the slot machine as to maximize her long-term expected return.

In Thompson sampling, we assume that the reward distribution for arm $a$ is parameterized by an unknown parameter (or a set of parameters) $\theta_a$.  Together, these paremeters form a parameter vector $\theta$, and that we have a prior belief about possible values of $\theta$ in the form of a probability distribution $P(\theta)$.  As the agent interacts with the environment, she observes rewards for the actions she chooses: $D = (a_1, r_1), (a_2, r_2), \dots$.  She uses these observations to govern her next action.   Specifically, she updates her beliefs about $\theta$ by calculating the posterior distribution $P (\theta | D) = P(D | \theta) P (\theta)$.  In Bayesian statistics, we treat unknowns (such as $\theta$) as random variables, and $P(\theta | D)$ captures our beliefs about how likely each value of $\theta$ is based on how well it explains the evidence $D$ (combined with our prior beliefs $P(\theta)$.)  As more data comes in , her beliefs about $\theta$ strengthen.

If we knew the true values of the parameters $\theta$, we would exploit that information by always pulling the arm that maximizes the expected reward.  However, since the parameters are unknown, we have to balance exploration with exploitation instead, and Thompson sampling does that.

Thompson sampling is a randomized algorithm that chooses an action based on the evidence $D$.
The key idea of Thompson sampling is to choose action $a$ with probability equal to the probability that that action is optimal (we don't know which action is optimal and we are treating $\theta$ as a random variable).  An action is optimal if it maximizes the expected reward.  Formally, we select action $a$ with probability
$$\int \mathbb{1} [ a = argmax_{a'} \mathbb{E}(r | a', \theta)] P(\theta | D) \, d \theta$$ '
where $\mathbb{1} [ \cdot ]$ is the indicator function (1 if the argument is true, 0 otherwise) and the integral is over all possible parameter vectors.  In general, this integral is difficult to evaluate, so sampling from this distribution directly would also be difficult.  However, we can still sample from the distribution indirectly if it is easy to sample from the posterior distribution $P (\theta | D)$.  To do that, we sample $\hat{\theta} \sim P(\theta | D)$ then we compute the optimal action assuming $\hat{\theta}$ are the true parameters by finding the arm that maximizes the expected reward.  It is not immediately obvious that the action chosen by this procedure comes from the same distribution as the originally specified distribution, so you should think about it carefully and make sure you understand why that is the case.

To see Thompson sampling in action for the n-armed bandit problem, I created a nice animation.  In this environment, there are 4 slot machines that produce random rewards according to a Bernoilli distribution ($1$ with probability $p$, $0$ otherwise), with parameters $0.2, 0.4, 0.6, 0.8$.  These parameters are unknown to the agent, and as such Thompson sampling assumes that each parameter value is equally likely: $\theta_a \sim U[0,1]$.   The agent interacts with the environment, she makes observations and updates her beliefs based on the evidence.  If she pulled arm $a$ $n$ times and observed $k$ successes, then she would compute the posterior distribution for $\theta_a$ as follows:
$$P(\theta_a | D) = \frac{P(D | \theta_a) P(\theta_a)}{P(D)}$$
The first component, $P(D | \theta_a)$ is the probability of observing $k$ successes out of $n$ trials with success probability $\theta_a$.  This is just the binomial distribution $P(D | \theta_a) = \binom{n}{k} \theta_a^k (1-\theta_a)^{n-k}$.  The second component, $P(\theta_a)$ is the prior distribution over $\theta_a$, which we assumed to be uniform, so $P(\theta_a) = 1$ for $0 \leq \theta_a \leq 1$.  The final component, $P(D)$ is the probability of observing the data, which we can compute using the law of total probability:
$$P(D) = \int_0^1 P(D | \theta_a^{'}) P(\theta_a^{'}) d \theta_a^{'} = \int_0^1 \binom{n}{k} \theta_{a}^{'k} (1-\theta_a^{'})^{n-k} 1 d \theta_a^{'}$$
Putting it all together, we find that
$$P(\theta_a | D) = \frac{\theta_a^k (1-\theta_a)^{n-k}}{\int_0^1 \theta_a^{' k} (1-\theta_a^{'})^{n-k} d \theta_a^{'}}$$
This happens to be the beta distribution, so $P(\theta_a | D) = Beta(k+1, n-k+1)$.  In general, the beta distribution has a peak at $\theta_a = \frac{k}{n}$, which is also intuitively the most likely value of $\theta_a$.  If $0 < k < n$, then the distribution is also $0$ at $\theta_a = 0$ and $\theta_a = 1$ because it would have been impossible to observe both successes and failures if the parameter to the Bernoulli we're $0$ or $1$.  Additionally, as $n$ becomes larger, the width of the distribution shrinks and the height increases, which intuitively follows from the fact that we have more evidence to justify our beliefs about $\theta_a$.

The animation below shows the algorithm in action on this problem.  Each curve is the posterior distribution $P(\theta_a | D)$ for arm $a$ , and the distributions change with $D$.  At each time step, there is one point on the curve that is explicitly plotted - that point is the value randomly sampled from the distribution.  The agent then chooses the arm with the highest expected reward (assuming the sampled parameters were the true parameters).  In this case the arm associated with the largest sample is obviously the best arm to choose.  After selecting an action, the agent observes a reward, either 0 or 1, but it is not explicitly shown in this figure.  It can be inferred based looking at the direction that the posterior shifts: a shift to the right indicates a reward of $1$ and a shift to the left indicates a reward of $0$.

Thompson sampling is a cool algorithm that is very effective at managing the exploration-exploitation trade off and finding the optimal arm very quickly.  Notice in the animation that the first arm ($\theta_1 = 0.2$) is only pulled a few times compared to the other arms.  Thus, the range of reasonable values of $\theta_1$ according to the posterior distribution is large compared to the other arms, but that arm is still not pulled because the probability that it is the best arm is very small.  On the other hand, the arms that have a better chance of being optimal are pulled more often.  Eventually, the best arm will be pulled almost every time (when the width of the posteriors get sufficiently small), although the animation cuts of before that behavior can be observed.

Thompson sampling is conceptually very simple and it's a beautiful application of Bayesian reasoning in my opinion.  If you would like to view the source code for that animation, I've made it available here.