An Introduction to Reinforcement Learning with Keras and Gym
In this article we'll learn the basics of Reinforcement Learning (RL) and code an agent in Keras trying to beat the standard AI in OpenAI Gym's Atari2600 game Pong.
Ideas/TODO:
- Add a manual play cell in the start and choose a rather unknown game so people can 'watch' themselves and what happens when they learn a new game in order to emphasize the point 'how we humans tend to learn' in section 'RL Basics'
1. The Setting
Typically in RL we have an environment (here Atari2600 Pong) the 'agent' experiences. In our case the agent 'sees' the environment being fed pixels as input, but in the general case this could be any other kind of input (e.g. sensory data). Upon experiencing the environment the agent then acts in it, thereby changing it and his inputs.
In our Pong game setting we take the possible actions the agent can perform to be discrete. Pixel input and discrete actions? This sounds like a pretty standard image classification problem - and it is!
2. Reinforcement Learning Basics
If there would be a training dataset with recorded expert-level pong games we could just go ahead and teach the agent in the usual supervised-learning way there and then.
However, in RL we try to get by without the extensive and often hard-to-gather datasets. That's the more general approach to learning anyhow, if you think about how we humans tend to learn things. If you start playing a game like pong you would most likely just start playing and figure things out on the go. If you don't have any prior knowledge about how the game works or what the goals are, you will probably not be very good in the beginning, but the score will tell you if what you did was any good and so you can adapt taking into account this score (or reward). This reward is also what we're going to use for our agent when trying to adapt its behavior. Such a reward is usually quite easily obtainable (or even constructable) for many tasks.
The problem with such rewards is that we only see it after the game is completed. As such the reward - which can be seen as an analogy to the label in the supervised learning setting - is sparse (we don't have rewards judging every frame of the game) and also time-delayed. This time delay is a problem we have to solve and we'll see that it also conjures up some other related problems as we take a closer look at the training process.