TLDR: Some quick notes about policy gradient reinforcement learning algorithms.
Setup/Notation
We usually think of reinforcement learning in terms of a Markovian decision process: our world consists of a set of states \(s \in S\) and possible actions \(a \in A\). At each time stamp, our agent can observe our current state \(s_t\), and then use some policy \(\pi(a |s)\) to choose some action \(a_t\). The world then updates according to some (usually unknown) transition probability \(\Pr[s_{t+1} | s_t, a_t]\) to a new state and we are given some reward \(\gamma^t R(s_t, a_t, s_{t+1})\), where \(0 < \gamma \le 1\) is some arbitrary discounting factor.
Let \(\tau\) denote some arbitrary sequence of \((s_t, a_t)\) pairs. Then we say our objective is to maximize our expected reward:
\[J_\pi = \mathbb{E}_{\tau} \big[ R(\tau) \big] \]
Where we use \(R(\tau)\) to denote our reward \(\sum_{\tau} \gamma^t R_t\), using \(R_t\) as shorthand for \(R(s_t, a_t, s_{t+1})\). Our expectation is over the probability distribution induced by our current policy (and the unknown but fixed environmental state update rule).
Policy Gradient
Policy gradient algorithms are an approach to solving this problem that try to solve this problem by directly optimizing over our policy \( \pi\). Suppose we parametrize our policy with some parameters \(\theta\) to get \(\pi(a | s, \theta)\). Then, we can calculate the gradient of our expected reward as
\[\begin{aligned} \nabla_\theta J_\pi &= \nabla_\theta \mathbb{E}_\tau \big[ R(\tau) \big] \\ &= \int R(\tau) \nabla_\theta p(\tau) d\tau = \int R(\tau) p(\tau) \nabla_\theta \ln p(\tau) d\tau \\ &= \mathbb{E}_{\tau} \big[ R(\tau) \nabla_\theta \ln p(\tau) \big] \\ &= \mathbb{E}_{\tau} \big[ R(\tau) \nabla_\theta \sum_t [\ln \pi(a_t | s_t, \theta) + \ln \Pr(s_{t+1} | s_t, a_t)] \big] \\ &= \mathbb{E}_{\tau} \big[ R(\tau) \sum_t \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] \\ \end{aligned} \]
Intuitively, this means we should update our gradients according to the the log-probability of all of the actions our policy selects, weighted by the reward we see. In other words, if have a trajectory that gives us a large reward, we should upweight the probability of generating that trjaectory.
This formulation is nice because we can estimate this “policy gradient” easily by a simple Monte Carlo procedure, where we just draw multiple sample trajectories (i.e. run our agent through the environment/simulation) and look at the empirical mean of \(R(\tau \sum \nabla_\theta \ln \pi(a_t | s_t, \theta)\).
Unfortunately, this estimation formulation has high variance. We can do better by noticing for any state-dependent baseline function \(b(s_t, t)\):
\[\begin{aligned} \mathbb{E}_{\tau} \big[ b(s_t, t) \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] &= \mathbb{E}_{s_t, a_t} \big[ b(s_t, t) \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] \\ &= \mathbb{E}_{s_t} \big[ b(s_t, t) \mathbb{E}_{a_t}[\nabla_\theta \ln \pi(a_t | s_t, \theta)] \big] \\ &= \mathbb{E}_{s_t} \big[ b(s_t, t) \int \pi(a_t | s_t, \theta) \nabla_\theta \ln \pi(a_t | s_t, \theta) da_t \big] \\ &= \mathbb{E}_{s_t} \big[ b(s_t, t) \int \nabla_\theta \pi(a_t | s_t, \theta) da_t \big] \\ &= \mathbb{E}_{s_t} \big[ b(s_t, t) \nabla_\theta \int \pi(a_t | s_t, \theta) da_t \big] \\ &= 0 \end{aligned} \]
The last equality follows because \(\pi(a_t | s_t, \theta)\) is a probability distribution, and so its integral must be 1. Thus we can subtract out some state/time dependent baseline from our reward estimate \(R(\tau)\) without biasing our gradient estimate. This observation leads us to a number of different improvements:
REINFORCE
Applying our previous identity, using the time-dependent baseline \(b(t) = \mathbb{E} \big[ \sum_{t' < t} \gamma^{t'} R_{t'} \big] \), we get:
\[\begin{aligned} \nabla_\theta J_\pi &= \mathbb{E}_{\tau} \big[ R(\tau) \sum_t \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] \\ &= \mathbb{E}_{\tau} \big[ \sum_t \gamma^t R_{t'} \cdot \sum_t \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] \\ &= \mathbb{E}_{\tau} \big[ \sum_t \big( \sum_{t' \ge t} \gamma^{t'} R_{t'} \cdot \nabla_\theta \ln \pi(a_t | s_t, \theta) \big) \big] \end{aligned} \]
This new formulation basically says: “we should upweight the probability of taking some action based on the reward we get”. Unlike our earlier formulation though, we only take into account rewards that could possibly have been generated by that action, not rewards that happened causally before that action".
If we estimate this expectation using only a single sample and then apply stochastic gradient ascent, we end up with the REINFORCE learning algorithm.
Interlude: Q-Functions and Value Functions
Another way to conceptualize the policy gradient is with Q-functions and value functions. We define the value function as the expected reward we get from entering some state \(s\).
\[V_\pi(s, t) = \mathbb{E}_{\tau | s_t = s} \big[ \sum_{t' \ge t} \gamma^{t'} R_{t'} \big] \]
Q-functions are similar, except that they represent the expected reward of taking an action \(a\) at state \(t\):
\[Q_\pi(s, a, t) = \mathbb{E}_{\tau | s_t = s, a_t = a} \big[ \sum_{t' \ge t} \gamma^{t'} R_{t'} \big] \]
I’ve written both functions as time-dependent here to simplify things with the discount factors and to account for any boundary effects in our sum over time steps (e.g. if our trajectory only runs for a fixed number of time steps), although it is common to have RL problems where both can be expressed as time-independent function (e.g. a game of chess, where our reward is winning/losing at the end of the game).
Actor-Critic
Using these functions, we can refine our gradient estimation:
\[\begin{aligned} \nabla_\theta J_\pi &= \mathbb{E}_{\tau} \big[ \sum_t \big( \sum_{t' \ge t} \gamma^{t'} R_{t'} \cdot \nabla_\theta \ln \pi(a_t | s_t, \theta) \big) \big] \\ &= \mathbb{E}_{\tau} \big[ \sum_t \big( [\gamma^t R_t + \sum_{t' > t} \gamma^{t'} R_{t'}] \cdot \nabla_\theta \ln \pi(a_t | s_t, \theta) \big) \big] \\ &= \sum_t \mathbb{E}_{t} \big[ \mathbb{E}_{s_{t+1} | s_t=s, a_t=a} \big( \gamma^t R(s_t, a_t, s_{t+1}) + V_\pi(s_{t+1}, t+1) \big) \cdot \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] \\ &= \sum_t \mathbb{E}_{t} \big[ Q_\pi(s, a, t) \nabla_\theta \ln \pi(a_t | s_t, \theta) \big] \end{aligned} \]
Where in the last step, we exprss the Q-function in terms of the state function:
\[\begin{aligned} Q_\pi(s, a, t) &= \mathbb{E}_{\tau | s_t = s, a_t = a} \big[ \sum_{t' \ge t} \gamma^{t'} R_{t'} \big] \\ &= \mathbb{E}_{s_{t+1} | s_t = s, a_t = a} [ \gamma^t R(s_t, a_t, s_{t+1}) ] + \mathbb{E}_{t' > t} \big[ \sum_{t' > t} \gamma^{t'} R_{t'} \big] \\ &= \mathbb{E}_{s_{t+1} | s_t = s, a_t = a} [ \gamma^t R(s_t, a_t, s_{t+1}) ] + \mathbb{E}_{s_{t +1}} V_\pi(s_{t + 1}, t + 1) \\ &= \mathbb{E}_{s_{t+1} | s_t = s, a_t = a} \big[ \gamma^t R(s_t, a_t, s_{t+1}) + V_\pi(s_{t + 1}, t + 1) \big] \end{aligned} \]
To reduce the variance of this estimate, we can use the value function as our state-and-time-dependent baseline.
\[\nabla_\theta J_\pi = \mathbb{E}_{\tau} \big[ \sum_t \big( [Q_\pi(s, a, t) - V_\pi(s, t)] \cdot \nabla_\theta \ln \pi(a_t | s_t, \theta) \big) \big] \]
This formulation tries to tackle the credit-assignment issue: some actions give high reward just because they are taken from an advantageous state. In some sense, then, this reward is “because” of the state (and thus the previous actions that got us to this state), not because of the current action choice. By subtracting away \(V_\pi(s, t)\), we only upweight actions that do better than what we’d expect from just the current state.
Of course, the problem here is that usually the “advantage” \(Q_\pi(s, a, t) - V_\pi(s, t)\) is unknown! The solution is to substitute the advantage with some empirical estimator, which will itself be parametrized by some learnable parameters. Although this might introduce bias if our estimator is not unbiased, with a strong function approximator we can hopefully reduce our variance enough for this tradeoff to be worth it.
This leads naturally to the “actor-critic” reinforcement learning framework, which has two parts: an actor (the policy \(\pi(a_t | s_t, \theta) \)) and a critic which estimates \( Q_\pi(s, a, t) - V_\pi(s, t)\). The details vary based on the algorithm, but this framework works by drawing samples and using them to alternatively update the actor and the critic.