This is actually just a cross-post from a proof I posted on For an introduction have a look at:

We want to show:

$$ \triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = 0$$


Using the law of iterated expectations one has:

$$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = \nabla_\theta \sum_{t=1}^T \mathbb{E}_{s_t \sim p(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ b(s_t) \right]\right] =$$

written with integrals and moving the gradient inside (linearity) you get

$$= \sum_{t=1}^T \int_{s_t} p(s_t) \left(\int_{a_t} \nabla_\theta b(s_t) \pi_\theta(a_t | s_t) da_t \right)ds_t =$$

moving \(\nabla_\theta\) (due to linearity) and \(b(s_t)\) (does not depend on \(a_t\)) form the inner integral to the outer one:

$$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta \left(\int_{a_t} \pi_\theta(a_t | s_t) da_t \right)ds_t= $$

\(\pi_\theta(a_t | s_t)\) is a (conditional) probability density function, so integrating over all \(a_t\) for a given fixed state \(s_t\) equals \(1\):

$$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta 1 ds_t = $$

Now \(\nabla_\theta1 = 0\), which concludes the proof.

Getagged mit:
blog comments powered by Disqus