In practice, noisy gradient estimates will - in the best case - delay the convergence of the reinforcement learning process. In the worst case, they ruin it.
Therefore, several methods have been developed to address the issue of the high variance gradient estimate. From our above calculations, we can already see two such options:
- shorter trajectories,
- smaller rewards.
As a matter of fact, 2. is realized by using a discount factor \( 0 \leq \gamma \leq 1\) in the return. However, taking shorter trajectories is not always feasible and depends heavily on the RL problem. The good news is that the policy gradient can be rewritten in such a way that for a given time step t only future rewards are considered when calculating the return. This discards all past reward terms and leads effectively to a shorter return horizon. It is sometimes referred to as the reward-to-go policy gradient (e.g. in S. Levine's DeepRL course at Berkeley):