We are given the following REINFORCE policy gradient

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \left(\sum\limits_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) \right) \left(\sum\limits_{t=0}^T r(s_t, a_t)\right) \right], \tag{1}\label{gradient}$$

where $$\tau$$ denotes a trajectory of state and actions $$s_0, a_0, \ldots, s_T, a_T$$ with horizon length $$T$$ distributed according to $$p_\theta(\tau)$$, $$r(s_t, a_t)$$ the reward function and $$\pi_\theta$$ the policy.

You might have heard that longer trajectories, i.e. a larger $$T$$, increases the variance of the policy gradient. However you have never seen any mathematical example of that claim. Neither have I, until I read Peters & Schaal's paper Reinforcement learning of motor skills with policy gradients.

## A Simple Example

Peters & Schaal provide an example which demonstrates the influence of the horizon T and the return $$\sum_{t=0}^T r(s_t, a_t)$$ (which we will abbreviate with $$R(\tau)$$) on the variance of the gradient estimate.

First, note that an unbiased estimate of the above expectation is given by the sample mean, which in practice means we collect a set of trajectories $$\mathcal{D} = \{ \tau_i \}_{i=1,\ldots, N}$$ by letting the policy $$\pi_\theta$$ run in the environment, and estimate $$\nabla_\theta J(\theta)$$ with

$$\widehat{\nabla_\theta J(\theta)} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) R(\tau) \tag{2}\label{gradest} .$$

Now consider a constant, single reward for each time step, that is

\begin{equation} r(s_t, a_t) = c,\quad \textrm{for }t = 0, \ldots, T \textrm{ and } c \in \mathbb{R}. \tag{3}\label{def:reward} \end{equation}

Then the variance of the gradient estimate in $$\eqref{gradest}$$ is

\begin{align*} \mathbb{V}\left[ \widehat{\nabla_\theta J(\theta)}\right] &= \mathbb{V}\left[\frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) R(\tau) \right] \\ &= \mathbb{V}\left[\frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) \left(\sum\limits_{t=0}^T r(s_t, a_t)\right) \right]\\ &\stackrel{\eqref{def:reward}}{=} \mathbb{V}\left[\frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) \left(\sum\limits_{t=0}^T c\right) \right]\\ &= \mathbb{V}\left[\frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) (T+1)c \right]\\ &= \frac{(T+1)^2 c^2}{|\mathcal{D}|^2} \mathbb{V}\left[\sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) \right] \end{align*}

and it can be seen that both the horizon length $$T$$ and the reward $$c$$ contribute quadratically to the variance.

## Conclusion

In practice, noisy gradient estimates will - in the best case - delay the convergence of the reinforcement learning process. In the worst case, they ruin it.

Therefore, several methods have been developed to address the issue of the high variance gradient estimate. From our above calculations, we can already see two such options:

1. shorter trajectories,
2. smaller rewards.

As a matter of fact, 2. is realized by using a discount factor $$0 \leq \gamma \leq 1$$ in the return. However, taking shorter trajectories is not always feasible and depends heavily on the RL problem. The good news is that the policy gradient can be rewritten in such a way that for a given time step t only future rewards are considered when calculating the return. This discards all past reward terms and leads effectively to a shorter return horizon. It is sometimes referred to as the reward-to-go policy gradient (e.g. in S. Levine's DeepRL course at Berkeley):

\begin{align*} \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \sum\limits_{t=0}^T \nabla_\theta \log(\pi_\theta(a_t | s_t)) \sum\limits_{t'=t}^T r(s_{t'}, a_{t'}) \right]. \end{align*}

Even though this equation might seem different than $$\eqref{gradient}$$ it is in fact exactly equivalent. A proof for the equivalence can be found here.

An intuitive interpretation of the rewritten gradient is that an agent should reinforce its actions only based on their consequences and, therefore, rewards before taking an action should not contribute to the evaluation of how good that action was.

Getagged mit: