Policy Gradient still unbiased after subtracting state dependent baseline

This is actually just a cross-post from a proof I posted on ai.stackexchange.com. For an introduction have a look at:

https://ai.stackexchange.com/a/8086/18040

We want to show:

$$ \triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = 0$$

Proof

Using the law of iterated expectations one has:

$$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = \nabla_\theta \sum_{t=1}^T \mathbb{E}_{s_t \sim p(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ b(s_t) \right]\right] =$$

written with integrals and moving the gradient inside (linearity) you get

$$= \sum_{t=1}^T \int_{s_t} p(s_t) \left(\int_{a_t} \nabla_\theta b(s_t) \pi_\theta(a_t | s_t) da_t \right)ds_t =$$

moving $\nabla_\theta$ (due to linearity) and $b(s_t)$ (does not depend on $a_t$) form the inner integral to the outer one:

$$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta \left(\int_{a_t} \pi_\theta(a_t | s_t) da_t \right)ds_t= $$

$\pi_\theta(a_t | s_t)$ is a (conditional) probability density function, so integrating over all $a_t$ for a given fixed state $s_t$ equals $1$:

$$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta 1 ds_t = $$

Now $\nabla_\theta1 = 0$, which concludes the proof.

Getagged mit:
proof