I'm citing the introduction given in my lecture notes of the online learning class I've been attending this semester, since it gives an intuitive idea of both *online* and *active learning* in only a few lines:

In most classical problems, one considers settings where all data are available before hand. But this is not always the case and in many applications, the data become available gradually - this is the online learning setting. Sometimes, the learner does even have an impact on how the data is collected - this is a speciﬁc case of the online learning setting, which is called active learning.

In the stochastic online learning setting we are faced with the following problem:

We are given K data sources where each source outputs data randomly in an i.i.d. fashion. We will refer to these sources as “arms”. At each time \( t \), an agent chooses one of the arms \( k_t \in \{1, \ldots, K\} \) it wants to observe. After choosing \( k_t \) the agent observes the output of all arms

\begin{align*}
X_{1, t} &\sim \nu_1 \\
X_{2, t} &\sim \nu_2 \\
&\vdots \\
X_{k, t} &\sim \nu_k \\
\end{align*}

and receives the reward \( X_{k_t,t} \sim \nu_{k_t} \). At the end of the game at time \( n \) the performance of the agent is measured by the cumulative reward

$$
L_n = \sum_t X_{k_t,t} \tag{1}\label{reward}
$$

Our objective is to construct an algorithm \( (k_t)_t \) that maximizes the expected gain \( \mathbb{ E }(L_n) \).

We have

$$
\mathbb{ E }(L_n) = \sum_\limits{k=1}^K \mu_k \mathbb{ E }\left[ T_{k,n}\right] \tag{2}\label{expected_reward}
$$

where

$$
\mu_k = \mathbb{ E }_{\nu_k}(X_k,t) \tag{3}\label{avg}
$$

and \( T_{k,n}\) denotes the number of times arm k has been sampled at time n. Intuitively spoken: the expected reward is the mean of arm k times the average number of time we have sampled arm k, summed up over all possible arms.

\begin{align*}
\mathbb{ E }(L_n) &\overset{\eqref{reward}}{=} \mathbb{ E }\left[\sum_\limits{t=1}^n X_{k_t, t} \right] \\
&= \mathbb{ E }\left[\sum_\limits{t=1}^n \sum_\limits{k=1}^K X_{k, t} \mathbb{ 1}_{\{k_t = k\}}\right] \\
&=\sum_\limits{t=1}^n \sum_\limits{k=1}^K \mathbb{ E }\left[ X_{k, t} \mathbb{ 1}_{\{k_t = k\}}\right] \\
\end{align*}

\( \mathbb{ 1}_{\{k_t = k\}} \) denotes the Indicator function here. Since \( k_t \) depends only on (previous) \( X_{\cdot, 1}, \ldots, X_{\cdot, t-1}\) the expectation of the product can be written as the product of expectations

\begin{align*}
&=\sum_\limits{t=1}^n \sum_\limits{k=1}^K \mathbb{ E }\left[ X_{k, t}\right] \mathbb{ E }\left[\mathbb{ 1}_{\{k_t = k\}}\right] \\
&\overset{\eqref{avg}}{=}\sum_\limits{t=1}^n \sum_\limits{k=1}^K \mu_{k} \mathbb{ E }\left[\mathbb{ 1}_{\{k_t = k\}}\right] \\
&= \sum_\limits{k=1}^K \mu_{k} \sum_\limits{t=1}^n \mathbb{ E }\left[\mathbb{ 1}_{\{k_t = k\}}\right] \\
&= \sum_\limits{k=1}^K \mu_{k} \mathbb{ E }\left[\color{royalblue}{\underbrace{\color{black}{ \sum_\limits{t=1}^n \mathbb{ 1}_{\{k_t = k\}} }}_{=T_{k,n}}} \right] \\
&= \sum_\limits{k=1}^K \mu_k \mathbb{ E }\left[T_{k,n}\right]
\tag*{$\blacksquare$}
\end{align*}

Similar to above, we can now introduce a new formulation of our objective:

Find a strategy \( (k_t)_t \) that minimizes the expected regret.

That is, we compare the performance of our strategy with that of an optimal strategy which always plays the best arm:

\begin{align*}
\mathbb{E}\left[R_n\right] &= N \max_k{\mu_k} - \mathbb{ E }\left[L_n\right] \\
&= \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n} \right] \max_k{\mu_k} - \mathbb{ E }\left[L_n\right]\\
&\overset{\eqref{expected_reward}}{=} \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n} \right] \max_k{\mu_k} - \sum_\limits{k=1}^K \mu_k \mathbb{ E }\left[T_{k,n}\right] \\
&= \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n}\right]
\color{royalblue}{\underbrace{\color{black}{
\left( \max_{k'}{\mu_{k'}} - \mu_k \right)
}}_{\Delta_k}} \\
&= \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n}\right] \Delta_k
\end{align*}

where \( \color{royalblue}{\Delta_k}\) denotes the deviation of the expectation of arm \( k\) from the expectation of the optimal arm.

Over the next posts we will propose some algorithms and analyse their performance via upper bounds, using Hoeffding's inequality and union bounds.

English Active Learning