19th Ave New York, NY 95822, USA

Multi-armed Bandits

Exploration vs Exploitation

Overview

In Reinforcement Learning the agent is faced repeatedly with a choice among different options, or actions. After each choice it receives a numerical reward chosen from a stationary or nonstationary probability distribution that depends on the action you selected. Its objective is to maximize the expected total reward over some time period.

First experiment: find best distribution

In this experiment, you are presented with three slot machines. Click on the lever or arm to run the machine and collect the reward.
The legend shows the winning combinations and the prizes.

Try to find which slot machine has the best expected values by running different trials.

You can click on the “Run” button to make 10K trials.
The “Show” button, reveal the theoretically expected values for each machine.

The  “Reset” button, resets the values and reshuffles the reward distribution.

2nd Experiment: Comparison

To assess the different policies, strategies, or methods, we consider 10 slot machines, each with a reward system that follows a normal distribution. The mean of the normal distribution can be set using the sliders below, and its variance is 1.

Use these sliders to adjust the mean of the normal distributions, and use the parameters on the left to set the iteration number and the properties of several methods, such as Epsilon-Greedy, Decaying Epsilon Greedy, Upper-Confidence-Bound (UCB).

Nonstationary option moves the mean of the normal distributions by 0.5 in either direction. This happens at each 20% of the total iterations.

Use the run button below to execute the test and observe the result on the graph.
The graph shows the average reward per method.

The buttons Case-1, Case-2 have preconfigured settings that show interesting scenarios.

 

 

 

Credits: the initial design of the slot machine is borrowed from svenfinger on codepen.

WhatsApp