cs.thefarshad
medium

Temporal-Difference Learning

Learn value estimates from incomplete episodes — Monte Carlo returns vs TD(0) bootstrapping and the TD error.

How should an agent estimate the value of a state — the expected return from there on? Two ideas compete. Monte Carlo (MC) waits until an episode ends and learns from the actual return. Temporal-difference (TD) learning updates after every single step, using its own next estimate as a stand-in for the rest. TD is the key idea behind Q-learning and most modern RL.

The visualizer below is the classic five-state random walk: start in the middle (C), step left or right with equal probability, exit right for +1+1 or left for 00. The dashed lines mark each state’s true value. Toggle TD(0) vs Monte Carlo and watch the estimates climb toward truth.

updates every step (bootstrapped)
dashed = true value (1/6 … 5/6)
episode 0/120
RMSE vs true values = 0.236. TD(0) updates each state from the next state's estimate (bootstrapping); Monte Carlo waits for the episode's actual return.

Monte Carlo: learn from the return

MC waits for the full episode, computes the return GtG_t (here the final reward, since γ=1\gamma = 1), then nudges each visited state’s value toward it:

V(s)V(s)+α[GtV(s)]V(s) \leftarrow V(s) + \alpha\,\bigl[\, G_t - V(s) \,\bigr]

The target GtG_t is an unbiased sample of the true value, but you must reach a terminal state before learning anything, and returns are noisy (high variance).

TD(0): bootstrap from the next estimate

TD doesn’t wait. After one transition sss \to s' with reward rr, it updates using its current estimate of V(s)V(s') as a proxy for the unseen rest of the episode:

V(s)V(s)+α[r+γV(s)V(s)TD error δ]V(s) \leftarrow V(s) + \alpha\,\bigl[\, \underbrace{r + \gamma\,V(s') - V(s)}_{\text{TD error }\delta} \,\bigr]

The bracketed quantity is the TD error δ\delta — the gap between the one-step prediction r+γV(s)r + \gamma V(s') and the old estimate V(s)V(s). Replacing the real return with an estimate is called bootstrapping: learning a guess from a guess.

The trade-off

Monte CarloTD(0)
Updatesend of episodeevery step
Targetactual return GtG_tr+γV(s)r + \gamma V(s')
Bias / varianceunbiased, high variancebiased, low variance
Needs episodes to end?yesno

In the walk, TD(0) typically settles faster and steadier because each update is small and low-variance, while MC jolts states by the full episode outcome. TD also works in continuing tasks that never terminate — a decisive practical advantage.

Takeaways

  • Value estimation can learn from full returns (MC) or from each step (TD).
  • TD(0) bootstraps: it updates toward r+γV(s)r + \gamma V(s') using its own estimate.
  • The TD error δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s) drives the update.
  • TD trades a little bias for much lower variance and learns online, without waiting for episodes to end.

References