This is actually just a cross-post from a proof I posted on ai.stackexchange.com. For an introduction have a look at:
We want to show:
Using the law of iterated expectations one has:
written with integrals and moving the gradient inside (linearity) you get
moving \(\nabla_\theta\) (due to linearity) and \(b(s_t)\) (does not depend on \(a_t\)) form the inner integral to the outer one:
\(\pi_\theta(a_t | s_t)\) is a (conditional) probability density function, so integrating over all \(a_t\) for a given fixed state \(s_t\) equals \(1\):
Now \(\nabla_\theta1 = 0\), which concludes the proof.