Multi-Armed Bandits
The purest exploration-vs-exploitation problem — ε-greedy, UCB, estimated action values, and regret.
A multi-armed bandit is reinforcement learning stripped to its core: there is no state to navigate, just slot machines (the “arms”), each paying a random reward from an unknown distribution. Pull arms one at a time to maximize total reward. With no transitions or planning, what remains is the central dilemma — exploration vs exploitation.
Press Pull below. Each bar is the learner’s estimated mean for an arm; the dashed lines are the unknown true means. The lower chart tracks cumulative regret. Switch between ε-greedy and UCB and watch how fast each homes in on the best arm (arm 3).
Estimated action values
The learner keeps a running estimate of each arm’s mean — just the average reward seen from that arm so far:
where is how many times arm has been pulled. By the law of large numbers these estimates converge to the true means — but only for arms you actually pull, which is exactly why exploration matters.
ε-greedy
The simplest strategy: most of the time exploit the current best arm, but with probability pick a random arm to keep learning.
a = random arm with probability ε
a = argmaxₐ Qₜ(a) otherwise
It works, but it keeps exploring arms it already knows are bad. A larger explores faster yet wastes more pulls; the trade-off shows up directly in the regret curve.
UCB: optimism under uncertainty
Upper Confidence Bound explores smarter. It adds a bonus to arms it is uncertain about — ones pulled few times — and picks the highest optimistic estimate:
The bonus shrinks as grows, so UCB naturally stops over-exploring well-known arms and focuses on the genuinely promising ones — usually beating ε-greedy in the visualizer.
Regret
We measure performance by regret: how much reward was lost versus always playing the best arm . After pulls,
Good strategies achieve regret growing only as — the curve flattens as the learner stops making mistakes.
Takeaways
- A bandit is RL with no states: pull arms to maximize reward.
- Action-value estimates are running averages and converge only for arms you pull.
- ε-greedy explores randomly; UCB explores by optimism, favoring uncertain arms.
- Regret measures lost reward; the best strategies grow it only logarithmically.
References
- Sutton & Barto, Reinforcement Learning: An Introduction — Chapter 2 (Multi-armed Bandits).
- Auer, Cesa-Bianchi & Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 2002.