CartPole – RL Lab

OVerview

Cart pole is a classic example for training and using deep reinforcement learning.
The idea is to move the cart in a way that the pole does not fall more than 30° from the vertical. The cart must not reach the edge either.
A win is achieved when the total reward reaches 300.

In this lab unit the user can play by directly using the arrow keys, or configure the ai and train it, in order to attain the objective.

IMPORTANT: note that the computations are done in the browser, using inappropriate values might cause the browser to freeze.

See the instructions below for more info.

This CartPole uses DQN algorithm without CNN, so to minimize the resources needed and the computation time.

Mechanics

The game does not use a true physics engine to simulate the fall of the pole.
However, this does not change the mechanics of the game and the aim to keep the pole in a near-vertical position (less than 30° from the vertical).
The acceleration parameter is an indicator to give the pole an acceleration as it falls.
A value of 0 means no acceleration, so the pole falls at a constant speed, while a value of 10 means a maximum acceleration.

User Play

The user can manually play this game by pressing Start button then using the left and right arrow keys to move the cart.

AI Training/Playing

To train the AI, the user should configure the parameters that are above the game board.
It is also advisable to set the acceleration to 10 so that the algorithm adapts to maximum speed.

The following is a description of the parameters:

Iterations: the number of iterations that the algorithm does during the training
Hidden Layers: the number of hidden layers in the DQN network
Nodes per layer: the number of nodes in each of the hidden layers
Initial epsilon: the epsilon that the algorithm starts with, in epsilon-greedy strategy
Final epsilon: the final epsilon (must be less than Initial epsilon) that the algorithm should reach after a number of iterations specified in Decay period
Decay period: the number of iterations needed to go from Initial epsilon to Final epsilon
Sync Frequency: the number of iterations before the DQN updates its Target network, which is a secondary network used to train the main network
Gamma: the discount factor
Learning rate: the learning rate used in the DQN
Exp. memory: the size of the experience replay memory that stores the experience of the main network in DQN. It consists of a series of records (state, action, reward, next state)
Sampling size: is the size of the sample that is drawn from Exp. memory. Must be less than the size of the Exp. memory

When ready, press ‘Train Ai’ button to start training.
For reasonable parameter values, the training time can be between 5 and 10 minutes.

After training, the user can press ‘Play AI’ button to let AI play the game and test its ability to win.

The ‘Play AI Demo’ button runs a ready-made model, to give the user a real experience of a working configuration. The configuration does not appear to the user.