For deriving upper bounds both in the stochastic online learning setting and the multi-armed bandit problems we will use two classical tools: Hoeffding's inequality in combination with union bounds.

In this post we will prove a corollary that allows us to control the deviation of the empirical mean from the true mean for all \( t \leq n \), given \( k \leq K \) sources.

That is, we will look at \( X_{k,1}, \ldots, X_{k,n} \) i.i.d. random variables and bound the probability of the following event

$$
\bigg \{ \bigcap\limits_{k \leq K} \bigcap\limits_{t \leq n}|\frac{1}{t} \sum_\limits{i\leq t} X_{k,i} - \mu_k | \leq c \bigg \} \tag{1}\label{intro}
$$

for some constant \( c \), which we will specify at the end of this article.

Let \( X_1, \ldots, X_n \) be i.i.d. random variables such that their distribution has support in \( [0,1] \), and mean \( \mu \). It holds that

$$
\mathbb{ P }\bigg(|\frac{1}{n} \sum_\limits{i=1}^n X_i - \mu| > \epsilon \bigg) \leq 2\exp\left(-2n\epsilon^2\right)
$$

Equivalently it holds with probability larger than \( 1-\delta \) that

$$
|\frac{1}{n} \sum_\limits{i=1}^n X_i - \mu| \leq \sqrt{\frac{\log(2/\delta)}{2n}}.
$$

Let \( \xi_1, \ldots, \xi_n \) be events of probability \( 1 - \delta_1, \ldots, 1 - \delta_n \). It holds that

$$
\mathbb{ P }\left(\bigcap\limits_{i}\xi_i \right) \geq 1 - \sum\limits_i \delta_i.
$$

A proof for Hoeffding can be seen e.g. on Wikipedia. Union bound can be proven by looking at the complement of the event and using the sub-additivity of the probability measure.

The Hoeffding inequality gives us an upper bound on the probability that the empirical mean deviates from the expected value by more than a certain amount. Note that this holds for an arbitrary but fixed \( n \). The following corollary provides us an upper bound for **all** \( t \leq n \).

Let for any \( k \leq K, X_{k,1}, \ldots, X_{k,n}\) be i.i.d random variables such that their distribution \( \nu_k \) has support in \( [0,1] \), and mean \( \mu_k \). It holds that

$$\mathbb{ P }(\exists k \leq K, \exists t\leq n, |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | > \frac{\epsilon}{\sqrt{t}}) \leq 2nK \exp\left(-2\epsilon^2\right)$$

By Hoeffding we know that for some \( t \leq n \)

\begin{equation}
\mathbb{ P }(|\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | > \frac{\epsilon}{\sqrt{t}} ) \leq 2 exp\left(-2\epsilon^2\right) \tag{2}\label{bound}
\end{equation}

With the sub-additivity of the probability measure we get

\begin{align*}
&\mathbb{ P }\bigg(\exists k \leq K, \exists t\leq n, |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | > \frac{\epsilon}{\sqrt{t}} \bigg) \\
&=
\mathbb{ P }\bigg(\bigcup\limits_{k \leq K} \bigcup\limits_{t \leq n} \bigg\{ |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | > \frac{\epsilon}{\sqrt{t}} \bigg\} \bigg) \\
&\leq \sum\limits_{k\leq K}\sum\limits_{t\leq n} \mathbb{ P }\left( |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | > \frac{\epsilon}{\sqrt{t}} \right) \\
&\overset{\eqref{bound}}{\leq} \sum\limits_{k\leq K}\sum\limits_{t\leq n} 2 \exp\left(-2 \epsilon^2 \right)\\
&\leq Kn 2 \exp\left(-2 \epsilon^2 \right)
\tag*{$\blacksquare$}
\end{align*}

Let \( \epsilon = \sqrt{\frac{\log(2nK/\delta)}{2}} \). It follows

$$
\mathbb{ P }\bigg(\exists k \leq K, \exists t\leq n, |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | > \sqrt{\frac{\log(2nK/\delta)}{2t}} \bigg) \leq \delta.
$$

So for the complement we get

$$
\mathbb{ P }\bigg(\color{royalblue}{\forall k \leq K}, \color{royalblue}{\forall t\leq n}, |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | \color{royalblue}{\leq} \sqrt{\frac{\log(2nK/\delta)}{2t}} \bigg) \color{royalblue}{> 1- \delta} .
$$

Using the set notation once more, we get

$$
\mathbb{ P }\bigg(\color{royalblue}{\bigcap\limits_{k \leq K}} \color{royalblue}{\bigcap\limits_{t \leq n}} |\frac{1}{t} \sum\limits_{i\leq t} X_{k,i} - \mu_k | \leq \frac{\epsilon}{\sqrt{t}} \bigg) > 1- \delta
$$

and we finally have the bound on the event \( \eqref{intro} \) from the beginning of this post.

Over the next few posts we will use the above technique for proving upper bounds of the *Follow the Leader* and *UCB* algorithm.

Bandit proof Active Learning