cs.thefarshad
medium

Multi-Armed Bandits

The purest exploration-vs-exploitation problem — ε-greedy, UCB, estimated action values, and regret.

A multi-armed bandit is reinforcement learning stripped to its core: there is no state to navigate, just kk slot machines (the “arms”), each paying a random reward from an unknown distribution. Pull arms one at a time to maximize total reward. With no transitions or planning, what remains is the central dilemma — exploration vs exploitation.

Press Pull below. Each bar is the learner’s estimated mean for an arm; the dashed lines are the unknown true means. The lower chart tracks cumulative regret. Switch between ε-greedy and UCB and watch how fast each homes in on the best arm (arm 3).

cumulative regret
0/400 pulls
regret = 0.00 · arm 3 is best (true mean 0.70). Bars are estimates; dashed lines are the unknown true means — estimates of well-pulled arms converge.

Estimated action values

The learner keeps a running estimate Qt(a)Q_t(a) of each arm’s mean — just the average reward seen from that arm so far:

Qt(a)=1Nt(a)i=1tri1[ai=a]Q_t(a) = \frac{1}{N_t(a)} \sum_{i=1}^{t}\, r_i \,\mathbf{1}[a_i = a]

where Nt(a)N_t(a) is how many times arm aa has been pulled. By the law of large numbers these estimates converge to the true means — but only for arms you actually pull, which is exactly why exploration matters.

ε-greedy

The simplest strategy: most of the time exploit the current best arm, but with probability ε\varepsilon pick a random arm to keep learning.

a = random arm           with probability ε
a = argmaxₐ Qₜ(a)        otherwise

It works, but it keeps exploring arms it already knows are bad. A larger ε\varepsilon explores faster yet wastes more pulls; the trade-off shows up directly in the regret curve.

UCB: optimism under uncertainty

Upper Confidence Bound explores smarter. It adds a bonus to arms it is uncertain about — ones pulled few times — and picks the highest optimistic estimate:

at=argmaxa[Qt(a)+clntNt(a)]a_t = \arg\max_a \left[\, Q_t(a) + c\,\sqrt{\frac{\ln t}{N_t(a)}} \,\right]

The bonus shrinks as Nt(a)N_t(a) grows, so UCB naturally stops over-exploring well-known arms and focuses on the genuinely promising ones — usually beating ε-greedy in the visualizer.

Regret

We measure performance by regret: how much reward was lost versus always playing the best arm aa^* . After TT pulls,

Regret(T)=t=1T(μμat)\text{Regret}(T) = \sum_{t=1}^{T} \bigl(\, \mu^* - \mu_{a_t} \,\bigr)

Good strategies achieve regret growing only as O(logT)O(\log T) — the curve flattens as the learner stops making mistakes.

Takeaways

  • A bandit is RL with no states: pull kk arms to maximize reward.
  • Action-value estimates Qt(a)Q_t(a) are running averages and converge only for arms you pull.
  • ε-greedy explores randomly; UCB explores by optimism, favoring uncertain arms.
  • Regret measures lost reward; the best strategies grow it only logarithmically.

References