This is actually just a cross-post from a proof I posted on ai.stackexchange.com. For an introduction have a look at:

https://ai.stackexchange.com/a/8086/18040

We want to show:

$$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = 0$$

## Proof

Using the law of iterated expectations one has:

$$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = \nabla_\theta \sum_{t=1}^T \mathbb{E}_{s_t \sim p(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ b(s_t) \right]\right] =$$

written with integrals and moving the gradient inside (linearity) you get

$$= \sum_{t=1}^T \int_{s_t} p(s_t) \left(\int_{a_t} \nabla_\theta b(s_t) \pi_\theta(a_t | s_t) da_t \right)ds_t =$$

moving $$\nabla_\theta$$ (due to linearity) and $$b(s_t)$$ (does not depend on $$a_t$$) form the inner integral to the outer one:

$$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta \left(\int_{a_t} \pi_\theta(a_t | s_t) da_t \right)ds_t=$$

$$\pi_\theta(a_t | s_t)$$ is a (conditional) probability density function, so integrating over all $$a_t$$ for a given fixed state $$s_t$$ equals $$1$$:

$$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta 1 ds_t =$$

Now $$\nabla_\theta1 = 0$$, which concludes the proof.

Getagged mit: