Q-Learning
Model-free control — learn a Q-table from experience with the temporal-difference update and ε-greedy exploration.
In the reinforcement learning lesson, value iteration assumed we knew the environment’s rules. Q-learning drops that assumption: the agent learns purely from experience — by acting, observing rewards, and updating estimates. It is the canonical model-free control algorithm.
Press Learn below. The agent starts at S with no knowledge; each cell shows its current and the greedy arrow. Watch a coherent policy crystallize from rewards alone. Drag to change how much it explores, or shuffle for a fresh random seed.
The Q-table
Instead of a value per state, Q-learning stores an action-value : the expected discounted return from taking action in state and acting greedily afterward. For a small world this is just a table with one row per state and one column per action. The optimal action-values satisfy the Bellman optimality equation:
The update
After each transition we nudge toward the bootstrapped target :
The bracketed quantity is the TD error . The learning rate (here 0.5) controls step size; the discount (0.9) weights future reward. Crucially the target uses — Q-learning learns the value of the greedy policy even while behaving more randomly, which makes it an off-policy method.
ε-greedy exploration
If the agent always took its current best action it might never discover a better route — the exploration vs exploitation trade-off. ε-greedy picks a random action with probability and the greedy one otherwise:
a = random action with probability ε
a = argmaxₐ Q(s, a) otherwise
Set in the visualizer and the agent often gets stuck; a little exploration lets it find the goal and propagate value back. Under standard conditions (every state-action visited infinitely often, decaying) Q-learning provably converges to .
Takeaways
- Q-learning learns action-values from experience, with no model of the environment.
- Each step applies the TD update .
- It is off-policy: the target learns the greedy policy regardless of behavior.
- ε-greedy balances exploring new actions against exploiting known-good ones.
References
- Sutton & Barto, Reinforcement Learning: An Introduction — Chapter 6 (Temporal-Difference Learning).
- Watkins & Dayan, Q-learning, Machine Learning, 1992.