Policy Gradients
Optimize a parameterized policy directly — the REINFORCE idea, the log-likelihood trick, and how it contrasts with value-based RL.
Q-learning and TD learn values and read off a policy by acting greedily. Policy gradient methods skip the middleman: they parameterize the policy directly and adjust by gradient ascent on expected return. The core intuition of REINFORCE is simple — push up the probability of actions that led to high return, push down the rest.
The visualizer below has one decision with four actions and a softmax policy. Each bar is the current probability ; dashed lines mark each action’s true mean reward. Press Learn and watch probability mass flow toward the best action (a2) as returns are observed.
A parameterized policy
A common choice is the softmax over learnable preferences (logits) :
This always yields a valid probability distribution, is differentiable in , and naturally captures stochastic policies — sometimes the optimal behavior is to randomize. In deep RL the logits are the output of a neural network instead of a plain table.
The objective and its gradient
We maximize the expected return . The policy gradient theorem gives a gradient we can estimate from sampled episodes — the “log-likelihood trick”:
REINFORCE turns this into a stochastic ascent step. After sampling action and observing return :
Each update scales the log-probability gradient by the return: good outcomes increase , bad ones decrease it. For the softmax, .
Baselines reduce variance
Raw returns make these estimates noisy. Subtracting a baseline (such as the average return) leaves the gradient unbiased but far less variable:
Now only actions that beat the baseline get reinforced — the quantity is an estimate of the advantage. The visualizer uses a running-average baseline.
Policy-based vs value-based
- Value-based (Q-learning): learn , act greedily. Great for discrete actions; the policy is implicit and deterministic.
- Policy-based (REINFORCE): learn directly. Handles continuous action spaces and stochastic policies, but gradient estimates are high-variance.
- Actor-critic combines both — a policy (actor) guided by a learned value function (critic) as the baseline — and underlies most state-of-the-art RL.
Takeaways
- Policy gradients optimize a parameterized policy directly, not via values.
- REINFORCE updates — reinforce high-return actions.
- A baseline subtracts the average return to cut variance, yielding the advantage.
- Policy methods shine for continuous/stochastic actions; actor-critic blends policy and value.
References
- Sutton & Barto, Reinforcement Learning: An Introduction — Chapter 13 (Policy Gradient Methods).
- Williams, Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Machine Learning, 1992.