Temporal-Difference Learning
Learn value estimates from incomplete episodes — Monte Carlo returns vs TD(0) bootstrapping and the TD error.
How should an agent estimate the value of a state — the expected return from there on? Two ideas compete. Monte Carlo (MC) waits until an episode ends and learns from the actual return. Temporal-difference (TD) learning updates after every single step, using its own next estimate as a stand-in for the rest. TD is the key idea behind Q-learning and most modern RL.
The visualizer below is the classic five-state random walk: start in the middle (C), step left or right with equal probability, exit right for or left for . The dashed lines mark each state’s true value. Toggle TD(0) vs Monte Carlo and watch the estimates climb toward truth.
Monte Carlo: learn from the return
MC waits for the full episode, computes the return (here the final reward, since ), then nudges each visited state’s value toward it:
The target is an unbiased sample of the true value, but you must reach a terminal state before learning anything, and returns are noisy (high variance).
TD(0): bootstrap from the next estimate
TD doesn’t wait. After one transition with reward , it updates using its current estimate of as a proxy for the unseen rest of the episode:
The bracketed quantity is the TD error — the gap between the one-step prediction and the old estimate . Replacing the real return with an estimate is called bootstrapping: learning a guess from a guess.
The trade-off
| Monte Carlo | TD(0) | |
|---|---|---|
| Updates | end of episode | every step |
| Target | actual return | |
| Bias / variance | unbiased, high variance | biased, low variance |
| Needs episodes to end? | yes | no |
In the walk, TD(0) typically settles faster and steadier because each update is small and low-variance, while MC jolts states by the full episode outcome. TD also works in continuing tasks that never terminate — a decisive practical advantage.
Takeaways
- Value estimation can learn from full returns (MC) or from each step (TD).
- TD(0) bootstraps: it updates toward using its own estimate.
- The TD error drives the update.
- TD trades a little bias for much lower variance and learns online, without waiting for episodes to end.
References
- Sutton & Barto, Reinforcement Learning: An Introduction — Chapter 6 (Temporal-Difference Learning).
- Sutton, Learning to Predict by the Methods of Temporal Differences, Machine Learning, 1988.