Stochastic Online Learning: Reward and Regret

Introduction

I'm citing the introduction given in my lecture notes of the online learning class I've been attending this semester, since it gives an intuitive idea of both online and active learning in only a few lines:

In most classical problems, one considers settings where all data are available before hand. But this is not always the case and in many applications, the data become available gradually - this is the online learning setting. Sometimes, the learner does even have an impact on how the data is collected - this is a speciﬁc case of the online learning setting, which is called active learning.

Prof. Dr. Alexandra Carpentier

The problem

In the stochastic online learning setting we are faced with the following problem:

We are given K data sources where each source outputs data randomly in an i.i.d. fashion. We will refer to these sources as “arms”. At each time $ t $, an agent chooses one of the arms $ k_t \in \{1, \ldots, K\} $ it wants to observe. After choosing $ k_t $ the agent observes the output of all arms

\begin{align*} X_{1, t} &\sim \nu_1 \\ X_{2, t} &\sim \nu_2 \\ &\vdots \\ X_{k, t} &\sim \nu_k \\ \end{align*}

and receives the reward $ X_{k_t,t} \sim \nu_{k_t} $. At the end of the game at time $ n $ the performance of the agent is measured by the cumulative reward

$$ L_n = \sum_t X_{k_t,t} \tag{1}\label{reward} $$

Our objective is to construct an algorithm $ (k_t)_t $ that maximizes the expected gain $ \mathbb{ E }(L_n) $.

Lemma

We have

$$ \mathbb{ E }(L_n) = \sum_\limits{k=1}^K \mu_k \mathbb{ E }\left[ T_{k,n}\right] \tag{2}\label{expected_reward} $$

where

$$ \mu_k = \mathbb{ E }_{\nu_k}(X_k,t) \tag{3}\label{avg} $$

and $ T_{k,n}$ denotes the number of times arm k has been sampled at time n. Intuitively spoken: the expected reward is the mean of arm k times the average number of time we have sampled arm k, summed up over all possible arms.

Proof

\begin{align*} \mathbb{ E }(L_n) &\overset{\eqref{reward}}{=} \mathbb{ E }\left[\sum_\limits{t=1}^n X_{k_t, t} \right] \\ &= \mathbb{ E }\left[\sum_\limits{t=1}^n \sum_\limits{k=1}^K X_{k, t} \mathbb{ 1}_{\{k_t = k\}}\right] \\ &=\sum_\limits{t=1}^n \sum_\limits{k=1}^K \mathbb{ E }\left[ X_{k, t} \mathbb{ 1}_{\{k_t = k\}}\right] \\ \end{align*}

$ \mathbb{ 1}_{\{k_t = k\}} $ denotes the Indicator function here. Since $ k_t $ depends only on (previous) $ X_{\cdot, 1}, \ldots, X_{\cdot, t-1}$ the expectation of the product can be written as the product of expectations

\begin{align*} &=\sum_\limits{t=1}^n \sum_\limits{k=1}^K \mathbb{ E }\left[ X_{k, t}\right] \mathbb{ E }\left[\mathbb{ 1}_{\{k_t = k\}}\right] \\ &\overset{\eqref{avg}}{=}\sum_\limits{t=1}^n \sum_\limits{k=1}^K \mu_{k} \mathbb{ E }\left[\mathbb{ 1}_{\{k_t = k\}}\right] \\ &= \sum_\limits{k=1}^K \mu_{k} \sum_\limits{t=1}^n \mathbb{ E }\left[\mathbb{ 1}_{\{k_t = k\}}\right] \\ &= \sum_\limits{k=1}^K \mu_{k} \mathbb{ E }\left[\color{royalblue}{\underbrace{\color{black}{ \sum_\limits{t=1}^n \mathbb{ 1}_{\{k_t = k\}} }}_{=T_{k,n}}} \right] \\ &= \sum_\limits{k=1}^K \mu_k \mathbb{ E }\left[T_{k,n}\right] \tag*{$\blacksquare$} \end{align*}

New formulation of the objective: expected regret

Similar to above, we can now introduce a new formulation of our objective:

Find a strategy $ (k_t)_t $ that minimizes the expected regret.

That is, we compare the performance of our strategy with that of an optimal strategy which always plays the best arm:

\begin{align*} \mathbb{E}\left[R_n\right] &= N \max_k{\mu_k} - \mathbb{ E }\left[L_n\right] \\ &= \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n} \right] \max_k{\mu_k} - \mathbb{ E }\left[L_n\right]\\ &\overset{\eqref{expected_reward}}{=} \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n} \right] \max_k{\mu_k} - \sum_\limits{k=1}^K \mu_k \mathbb{ E }\left[T_{k,n}\right] \\ &= \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n}\right] \color{royalblue}{\underbrace{\color{black}{ \left( \max_{k'}{\mu_{k'}} - \mu_k \right) }}_{\Delta_k}} \\ &= \sum_\limits{k=1}^K \mathbb{ E }\left[T_{k,n}\right] \Delta_k \end{align*}

where $ \color{royalblue}{\Delta_k}$ denotes the deviation of the expectation of arm $ k$ from the expectation of the optimal arm.

Over the next posts we will propose some algorithms and analyse their performance via upper bounds, using Hoeffding's inequality and union bounds.

Getagged mit:
English Active Learning

Reward and Regret in the Stochastic Online Learning Setting

A mathematical formulation

Introduction

The problem

Lemma

Proof

New formulation of the objective: expected regret