cs.thefarshad
hard

Policy Gradients

Optimize a parameterized policy directly — the REINFORCE idea, the log-likelihood trick, and how it contrasts with value-based RL.

Q-learning and TD learn values and read off a policy by acting greedily. Policy gradient methods skip the middleman: they parameterize the policy πθ(as)\pi_\theta(a\mid s) directly and adjust θ\theta by gradient ascent on expected return. The core intuition of REINFORCE is simple — push up the probability of actions that led to high return, push down the rest.

The visualizer below has one decision with four actions and a softmax policy. Each bar is the current probability πθ(a)\pi_\theta(a); dashed lines mark each action’s true mean reward. Press Learn and watch probability mass flow toward the best action (a2) as returns are observed.

softmax policy π(a) over 4 actions
bars = policy probability π(a) · dashed = each action's true mean reward
episode 0/300
running average return = 0.00. REINFORCE pushes probability toward actions whose return beats the baseline; the policy concentrates on action 2 (best mean reward).

A parameterized policy

A common choice is the softmax over learnable preferences (logits) θa\theta_a:

πθ(a)=eθabeθb\pi_\theta(a) = \frac{e^{\theta_a}}{\sum_{b} e^{\theta_b}}

This always yields a valid probability distribution, is differentiable in θ\theta, and naturally captures stochastic policies — sometimes the optimal behavior is to randomize. In deep RL the logits are the output of a neural network instead of a plain table.

The objective and its gradient

We maximize the expected return J(θ)=Eπθ[G]J(\theta) = \mathbb{E}_{\pi_\theta}[\,G\,]. The policy gradient theorem gives a gradient we can estimate from sampled episodes — the “log-likelihood trick”:

θJ(θ)=Eπθ ⁣[Gθlogπθ(a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\, G \,\nabla_\theta \log \pi_\theta(a) \,\right]

REINFORCE turns this into a stochastic ascent step. After sampling action aa and observing return GG:

θθ+αGθlogπθ(a)\theta \leftarrow \theta + \alpha\, G\, \nabla_\theta \log \pi_\theta(a)

Each update scales the log-probability gradient by the return: good outcomes increase logπθ(a)\log \pi_\theta(a), bad ones decrease it. For the softmax, θilogπθ(a)=1[i=a]πθ(i)\nabla_{\theta_i} \log \pi_\theta(a) = \mathbf{1}[i = a] - \pi_\theta(i).

Baselines reduce variance

Raw returns make these estimates noisy. Subtracting a baseline bb (such as the average return) leaves the gradient unbiased but far less variable:

θθ+α(Gb)θlogπθ(a)\theta \leftarrow \theta + \alpha\,(G - b)\,\nabla_\theta \log \pi_\theta(a)

Now only actions that beat the baseline get reinforced — the quantity GbG - b is an estimate of the advantage. The visualizer uses a running-average baseline.

Policy-based vs value-based

  • Value-based (Q-learning): learn QQ, act greedily. Great for discrete actions; the policy is implicit and deterministic.
  • Policy-based (REINFORCE): learn πθ\pi_\theta directly. Handles continuous action spaces and stochastic policies, but gradient estimates are high-variance.
  • Actor-critic combines both — a policy (actor) guided by a learned value function (critic) as the baseline — and underlies most state-of-the-art RL.

Takeaways

  • Policy gradients optimize a parameterized policy πθ\pi_\theta directly, not via values.
  • REINFORCE updates θ+=αGθlogπθ(a)\theta \mathrel{+}= \alpha\, G\, \nabla_\theta \log \pi_\theta(a) — reinforce high-return actions.
  • A baseline subtracts the average return to cut variance, yielding the advantage.
  • Policy methods shine for continuous/stochastic actions; actor-critic blends policy and value.

References