unsupervised feature extraction for reinforcement learning · 2017-05-12 · unsupervised feature...
TRANSCRIPT
Faculteit Wetenschappen en Bio-ingenieurswetenschappen
Vakgroep Computerwetenschappen
Unsupervised Feature Extraction for
Reinforcement Learning
Proefschrift ingediend met het oog op het behalen van de graad van
Master of Science in de Ingenieurswetenschappen: Computerwetenschappen
Yoni Pervolarakis
Promotor: Prof. Dr. Peter Vrancx
Prof. Dr. Ann Nowe
Juni 2016
Faculty of Science and Bio-Engineering Sciences
Department of Computer Science
Unsupervised Feature Extraction for
Reinforcement Learning
Thesis submitted in partial fulfillment of the requirements for the degree of
Master of Science in de Ingenieurswetenschappen: Computerwetenschappen
Yoni Pervolarakis
Promotor: Prof. Dr. Peter Vrancx
Prof. Dr. Ann Nowe
June 2016
Abstract
When using high dimensional features chances are that most of the features
are not important to a specific problem. To eliminate those features and
potentially finding better features different possibilities exist. For example,
feature extraction that will transform the original input features to a new
smaller dimensional feature set or even a feature selection method where
only features are taken that are more important than other features. This
can be done in a supervised or unsupervised manner. In this thesis, we will
investigate if we can use autoencoders as a means of unsupervised feature
extraction method on data that is not necessary interpretable. These new
features will then be tested in a Reinforcement Learning environment. This
data will be represented as RAM states and are blackbox since we cannot
understand them. The autoencoders will receive a high dimensional feature
set and will transform it into a lower dimension, these new features will be
given to an agent who will make use of those features and tries to learn from
them. The results will be compared to a manual feature selection method
and no feature selection method.
i
Acknowledgements
First and foremost I would like to thank Prof. Dr. Peter Vrancx for helping
me find a subject I am passionate about, taking the time for weekly updates
and for all his suggestions and numerous conversions on how this subject
could be tackled.
Secondly, I would also like to thank Prof. Dr. Ann Nowe for piquing my
interest in the master Artificial Intelligence when taking her course in my
first year on the Vrije Universiteit Brussel.
And finally I would also like to thank my mother for supporting me to pursue
my studies at university level and my girlfriend for her endless support.
ii
Contents
1 Introduction 1
1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Machine Learning 6
2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Underfitting and overfitting . . . . . . . . . . . . . . . . . . . 13
2.4 Bias - Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Ensembles methods . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Evaluating models . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . 20
3 Artificial Neural Networks 21
3.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Training perceptrons . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Hyperbolic tangent . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . 28
3.4.4 Which is better? . . . . . . . . . . . . . . . . . . . . . 29
3.5 Tips and tricks . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii
3.6 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Reinforcement Learning 34
4.1 The setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . 39
4.4 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Incrementing Q-values . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Monte Carlo & Dynamic Programming . . . . . . . . . . . . . 45
4.8 Temporal Difference . . . . . . . . . . . . . . . . . . . . . . . 46
4.8.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8.2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.10 Function approximation . . . . . . . . . . . . . . . . . . . . . 51
5 Experiments and results 54
5.1 ALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Space Invaders . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Flow of experiments . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Manual features and basic RAM . . . . . . . . . . . . . . . . . 60
5.6 Difference between bits and bytes . . . . . . . . . . . . . . . . 61
5.7 Comparing different activation functions . . . . . . . . . . . . 63
5.8 Initializing Q-values . . . . . . . . . . . . . . . . . . . . . . . . 65
5.9 Pretraining and extracting other layers . . . . . . . . . . . . . 68
5.10 Combination of RAM and layer . . . . . . . . . . . . . . . . . 72
5.11 Visualizing high dimensional data . . . . . . . . . . . . . . . . 73
6 Conclusions 75
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Appendices 77
A Extended graphs and tables 78
7 Bibliography 82
iv
List of Figures
1 Architecture of data processing . . . . . . . . . . . . . . . . . 5
2 Example of a decision tree . . . . . . . . . . . . . . . . . . . . 8
3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Data of two features . . . . . . . . . . . . . . . . . . . . . . . 12
6 k-mean clustering . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 Unsupervised learning: reduction of dimensions . . . . . . . . 13
7a MNIST example of the number 2 . . . . . . . . . . . . 13
7b MNIST reduction of dimensions . . . . . . . . . . . . . 13
8 Difference between under and overfitting . . . . . . . . . . . . 15
9 Dartboard analogy from (Sammut & Webb, 2011) . . . . . . . 16
10 Bias Variance trade-off . . . . . . . . . . . . . . . . . . . . . . 17
11 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 18
12 Searching in different dimensions . . . . . . . . . . . . . . . . 19
12a 1D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19
12b 2D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19
12c 3D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19
13 Example of a perceptron . . . . . . . . . . . . . . . . . . . . . 21
14 Bitwise operations . . . . . . . . . . . . . . . . . . . . . . . . 23
14a AND operator . . . . . . . . . . . . . . . . . . . . . . . 23
14b OR operator . . . . . . . . . . . . . . . . . . . . . . . . 23
14c XOR operator . . . . . . . . . . . . . . . . . . . . . . . 23
15 XOR with decision boundaries by learnt MLP . . . . . . . . . 25
16 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 26
17 Other activation functions: linear and step function . . . . . . 27
18 Sigmoid activation function . . . . . . . . . . . . . . . . . . . 27
v
19 Hyperbolic tangent activation function . . . . . . . . . . . . . 28
20 ReLU activation function . . . . . . . . . . . . . . . . . . . . . 29
21 Example of an autoencoder . . . . . . . . . . . . . . . . . . . 33
22 A Skinner’s Box from (Skinner, 1938) . . . . . . . . . . . . . . 35
23 Agent Environment setting . . . . . . . . . . . . . . . . . . . . 36
24 Another view of the agent environment setting . . . . . . . . . 36
25 Mountain car; image from (RL-Library, n.d.) . . . . . . . . . . 37
26 Pole Balancing; image from (Anji, n.d.) . . . . . . . . . . . . . 37
27 Maze world . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
28 Eligibility trace; image from (Sutton & Barto, 1998) . . . . . . 49
29 Replacing traces; image from (Sutton & Barto, 1998) . . . . . 51
30 Coarse coding; image from (Sutton & Barto, 1998) . . . . . . 53
31 The difference between RAM and Frames . . . . . . . . . . . . 55
31a RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
31b Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
32 Space Invaders screen . . . . . . . . . . . . . . . . . . . . . . . 56
33 MSE of autoencoder with 128 bits input . . . . . . . . . . . . 58
34 MSE Autoencoder from 1024 bits input . . . . . . . . . . . . . 59
35 Difference RAM and RAM with AND . . . . . . . . . . . . . . 61
36 Autoencoders on 128 bytes . . . . . . . . . . . . . . . . . . . . 62
37 Autoencoders on 1024 bytes . . . . . . . . . . . . . . . . . . . 63
38 Q = −1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
39 Q = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
40 Extraction of a layer other than the bottleneck . . . . . . . . . 68
41 Pretraining with extraction of layer 512 . . . . . . . . . . . . . 69
42 Pretraining with extraction to a hidden layer of 4 nodes . . . . 70
43 Pretraining with extraction of layer 512 with dropout . . . . . 71
44 Pretraining with extraction of layer 512 with dropout . . . . . 72
45 Combining the original layer with the encoded version . . . . . 73
46 t-tsne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
47 Linear activation function on an autoencoder . . . . . . . . . . 78
48 Sigmoid activation function on an autoencoder . . . . . . . . . 79
49 ReLU activation function on an autoencoder . . . . . . . . . . 79
50 Pretraining with extraction of layer 512 . . . . . . . . . . . . 80
51 Combining the original layer with the encoded version . . . . . 80
vi
List of Tables
1 Classification of animals . . . . . . . . . . . . . . . . . . . . . 8
2 Predicting the price of a house . . . . . . . . . . . . . . . . . . 10
3 V ∗(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 π∗(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Gridworld Example . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Comparing different activation functions . . . . . . . . . . . . 64
7 P-values of the MannWhitney U test . . . . . . . . . . . . . . 65
8 The difference between in setting different Q-values . . . . . . 67
9 Training to a specific layer and extracting a chosen layer . . . 81
vii
List of Algorithms
1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 SARSA(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Q-Learning(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
viii
Chapter 1Introduction
Artificial Intelligence is a field in computer science which studies a wide range
of topics like Machine Learning, Reinforcement Learning and a new rising
topic, Deep Learning. Artificial Intelligence is now more part of the daily life
than two decades ago. Take for example a robotic vacuum cleaners where the
robot knows when to clean the house, to know exactly when the robot must
return to the charging station to get a full battery and even to pick up where
he has left off after recharging. More than ten years ago the vacuum cleaner
robots were not seen as an AI because the robot would simply do random
walks, if doing a random walk in a house long enough the whole house would
eventually be cleaned. With new algorithms available, the robot can map
the house to vacuum efficiently and detect how to make a detour if a object
is suddenly in the way. The only way to gather all this data is to perceive
all features possible.
Another example are the new smart thermostats like Nest thermostat devel-
oped by Google or the ATAG ONE thermostat. These new smart thermostats
know when the house is empty, when the owners go to work and come back.
By learning the behaviour of the owners the thermostat will automatically
adapt so that the heating will be higher just before the owners are coming
home and the heating will be set lower after the owners go to work or go
sleeping, this can ultimately have a great impact on the energy consump-
tion.
All these new domestic devices make the daily life easier and do seem nat-
urally. Behind the hood is often a complicated AI that uses many features,
measurements of sensory inputs like the velocity, battery usage, IR detector
or thermometer. These features can be very specific, comprehensive and can
in general consist of thousands or millions different inputs. Not all of them
1
are equally important depending on the task that must be completed.
Going back to the example of the robotic vacuum cleaner, features like the
texture of the floor will have an impact on the duration of the task. Vacuum-
ing a carpet is harder than a concrete floor is. Features like the temperature
outside will have little to no impact. It is therefore of most important to se-
lect features that only matter to the task at hand. For simple task with little
features manual feature selection is feasible, but when millions of features are
in play it is not. DNA microarrays for example, store an enormous amount
of features. Manually selecting which features are important for some task
is a horrendous job, not only because the person who selects these features
needs knowledge about the task and features, but also because features in
isolation can seem unimportant but when combined can have a strong influ-
ence on the result. Feature extraction is on of the key business in Machine
Learning. Many problems may rise when using many features, such as the
curse of dimensionality, overfitting and a longer training time with a much
larger chance to get stuck in local minima. When using too many features
there is also a possibility that many features are redundant and do not have
any methods to for example a classification. Many feature selection or ex-
traction rely on a supervised method. There are different feature selection
methods like entropy, which is sometimes used in Decision Trees, correlation
techniques to find which features are high correlated and thus useful for a
certain task or even dimension reduction techniques like PCA which is a lin-
ear transformation of data. All of these techniques are linear or need some
supervised manner in setting them.
One technique of supervised feature extraction is template matching where
similarities or equivalently the dissimilarities between the input data and
the labelled data are measured to use for classification. This method is of-
ten used in Optical Character Recognition (OCR) software (Trier, Jain, &
Taxt, 1996). Researchers also combined ImageNet, which is an online public
database with more then 14 million images all manual labelled into roughly
more than 21.000 categories, and deep learning for classification. By using
deep convolutional neural network researchers were able to classify those im-
ages, the used deep networks consisted of more than 650,000 neurons with 60
million parameters (Krizhevsky, Sutskever, & Hinton, 2012). These neural
networks are a supervised feature extraction because layers can learn ab-
straction of raw inputs, for example from pixels to edges to objects. But also
in regression (Geoffrey E. Hinton & Salakhutdinov, 2008) where unlabelled
2
data is used to learn a good covariance kernel. Autoencoders (Section 3) can
be used to find features by reducing the dimensionality and extracting those
compressed features (Geoffrey E Hinton & Salakhutdinov, 2006; Ng, 2011).
Solving problems is what keeps AI an interesting business. Researches in
the field of AI have a particular interest in solving games because they rep-
resent problems that provide a challenging search space but still have a clear
set of rules and the AI performance can be directly compared to human per-
formance. The program Chinook (Schaeffer et al., 1992) was one of the first
AI that has solved chess and has beaten some expert champions by using
heuristics and search trees. The engine Deep Blue is another chess AI that
uses databases with game data and parallelism with search engines (Camp-
bell, Hoane, & Hsu, 2002) and has defeated the worlds best champion chess
player. Another example of an AI is TD-Gammon that has won from the
best backgammons players by using neural networks and TD(λ). This was
achieved by playing repetitively against itself (Tesauro, 1994) and by doing
so training itself.
One example to see how popular Artificial Intelligence has become and more
in particular Deep Learning, is Go. Google DeepMind has succeeded in de-
feating the worlds top Go player (Silver et al., 2016). Go is a boardgame
with relative simple rules. Players must take turns and put white or black
stones on the board. But nevertheless, Go is one of the hardest game for an
AI to learn, this because there are more moves possible than there are atoms
in the world. Traditional AI algorithms build trees for all possible moves
and settings and try to look where the agent has the most possible chance of
winning before selecting a move. Because the choices and different options of
Go this is simply not feasible. Google Deepmind trained neural networks of
recorded strategies and moves from top Go players and tried to predict them,
afterwards they used Reinforcement Learning with these neural networks to
play against itself and try and learn new moves. Afterwards they used Monte
Carlo tree search to estimate values of a state instead of browsing through
the whole tree.
More recently, Google Deepmind has created DQN which is combining deep
neural networks with reinforcement learning together with experience replay
(Mnih et al., 2015) and has succeeded in beating human players on different
games. Deep Learning is not only popular with classification and regression
tasks but also in the field of Natural Language Processing, where the deep
3
network can return tags, semantic roles and even semantic similarity give
a sentence (Collobert & Weston, 2008). In this thesis we will consider the
problem of applying machine learning methods to computer games, by using
autoencoders as a feature extraction method.
1.1 Research Question
In this thesis we will develop automatic feature extraction methods that can
be used in combination with Reinforcement Learning. This is an important
problem as the performance of an RL agent is strongly dependent on the
representation used for learning. Selecting good features is challenging as it
requires knowledge of the problem domain and the task to be solved. This
thesis will investigate the use of unsupervised learning methods to replace
manual feature selection. A current example is the blackbox challenge 1
where the contestant receives a dataset that we, as a human, do not un-
derstand. Every time step the agent perceives a new state and a variety of
actions that are possible to take. These can be stochastic and late rewards
are possible after taking an action. This challenge was designed in a way that
contestants do not know how to interpret the data, so they cannot manually
do a feature selection method. The data is somewhat blackbox.
We will consider the problem of learning by playing Atari games using the
RAM game state as input. As a human we cannot interpret the RAM state
and so the step of manual feature selection will be skipped and instead do an
unsupervised feature extraction via autoencoders. Figure 1 shows the usual
case when dealing with too many dimensions. The idea is to replace the
middle box, manual feature selection, and replace is it by an unsupervised
feature extraction method. These autoencoders will be trained with different
settings and different levels in dimension reduction. These new features will
then be used by an RL-agent with SARSA(λ) who will play Space Invaders
on a Atari 2600 emulator. By using the game we can see how good these
new features will perform in comparison with the manual feature selection.
1http://blackboxchallenge.com
4
Figure 1: Current infeasible setting when dealing with too many dimensions
This thesis will first focus on the background of Machine Learning (Chap-
ter 2), Artificial Neural Networks (Chapter 3) and Reinforcement Learning
(Chapter 4). Followed by all the experiments done (Chapter 5) and the last
chapter will contain the final conclusion with some possibilities on future
work (Chapter 6).
5
Chapter 2Machine Learning
The term Machine Learning is a broad term that covers many subfields. To
give such a definition is difficult and many different definitions exist. In this
thesis the definition of Tom Mitchell will be adopted. He describes Machine
Learning as:
A computer program is said to learn from experience E with re-
spect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with
experience E. (Mitchell, 1997)
Applying this definition to this thesis will give a better understanding. This
thesis will research the unsupervised training of autoencoders and by doing
so unsupervised feature extraction. These new features will then be used to
train an agent by using Reinforcement Learning. This research question can
then be divided in two different parts.
• The unsupervised autoencoder training (Section 3.7), where the task
T is to learn a sort of compression and by doing so feature extraction.
The experience E will be the RAM or frames states, received from our
gameplay and the performance P is the Mean Square Error (MSE),
section 2.1.2 which will determine how good the reconstruction is of
the RAM or frame states and thus how good an autoencoder is.
• Reinforcement Learning (Section 4) of Atari games, where the task T
is to learn to play a game by getting a score as high as possible. The
experience E will be the interactions between the game and the results
that comes from it. The performance P is measured by the score itself
and the total reward.
6
Machine Learning is highly interesting for many current problems. For ex-
ample, cancer that can be detected and classified (Cruz & Wishart, 2006),
self-driven cars (Google, n.d.), speech recognition like Siri from Apple and
so on.
There are three major distinctions in learning some task T; supervised, un-
supervised learning and sequential decision making.
2.1 Supervised learning
Supervised learning is the task of receiving some input data X and output
data, or labelled data, y and creating a function y = f(x) that can map the
input values to output values. There are different kinds of supervised learn-
ing; classification and regression. Classification will classify features into a
small discrete number of groups, for example a breed of an animal. In regres-
sion problems on the contrary, the number of possible outputs can be very
large or even continuous.
Supervised learning searches for a function h(x), or also known as the hy-
potheses, that given the data x will return an estimated output value. For
example a linear hypothesis:
h(~x) = θ0 + θ1x1 + θ2x2 + ..θnxn
The linear hypothesis has some parameters θ that can be optimized through
learning. The linear hypothesis will return a value h(~x) that can be com-
pared to our labelled data, y. By using different techniques, which will be
explained later on, θ-values can be tweaked so that h(~x) will be equal to y.
Below we will discuss two classes of supervised learning problems; classi-
fication and regression.
2.1.1 Classification
A classification problem is a problem where the data is classified or labelled
in different classes. Take for example some input data that are features about
animals (Table 1); the number of feet, color and if the animal has wings or
not. The classification will then be the breed of the animal; in this case a
dog, duck or spider.
7
Feet Color Wings y
x1 4 Brown No Dog
x2 2 White Yes Duck
x3 8 Black No Spider
... ... ... ... ...
x100 8 Brown No ?
Table 1: Classification of animals
The classifier will try to determine a decision boundary between the dogs,
ducks and spiders. Take the last example in the previous table, where an
animal has 8 legs, has a brown color and no wings. Since there is no label,
the classifier must determine what animal x100 must be. As human it is clear
that if there are only three possible animals, the unknown animal must be a
spider since the only animal with 8 feet is a spider. But the classifier cannot
determine this so easily.
Figure 2: Example of a decision tree
One example of a supervised learning method are decision trees. Decision
trees are trees that have different nodes. Each node will ask a question.
This question will lead to another question or a leaf. A leaf will represent
8
the classification of an example. Everything depends on which questions is
asked first, this means that the most informative feature has the most po-
tential to generate a shorter and preciser decision tree. This can be done by
using for example entropy and information gain. Figure 2 shows an example
of a decision tree for the input data of Table 1. This tree could be shorter
by removing the color question after the question wings with answer yes,
because if the animal has wings, it is automatically a duck in our example.
Different adaptations of decision trees exist to optimize trees by for example
pruning (Quinlan, 1987).
Figure 3 shows a classification with two features x1 and x2. The red points
belong to a certain Class 1 and the blue points to Class 2. The classifier
tries to find a decision boundary in the input space where all data points, or
at least as many points as possible, belong to the correct class. In an ideal
situation the decision boundary can separate the classes exactly. But in real
world data, this would be highly unlikely since data is often noisy and/or cor-
rupted. The classifier needs to find a way where the cost of misclassification
is the lowest.
15 10 5 0 5 10 15 20
x1
40
20
0
20
40
60
80
x2
Decision boundry
Class 1
Class 2
Figure 3: Classification of 2 features into 2 classes, separated by a decision
boundary
9
2.1.2 Regression
Regression problems cannot be divided into classes but will have some con-
tinuous target value. Take for example the prediction of house prices with
features like the amount of bedrooms, kitchens, gardens and garages (Table
2). Obviously this cannot be labelled, and predicting the output of x100 is
not that simply.
Bedroom Kitchen Garden Garage Bathroom y
x1 1 1 0 1 1 e 153.314
x2 3 1 1 2 2 e 317.135
x3 6 2 1 3 4 e 683.562
.. .. .. .. .. .. ..
x100 2 1 1 0 1 e ?
Table 2: Predicting the price of a house
The question then remains how a regression model would predict values. A
linear model will try to create a fitted line through the data points, which in
the example above, are the amount of certain room types. This line is also
called the regression line. Figure 4 shows an example where the blue dotted
points are the input data and the green line will represent a regression line.
Multiple regression lines are possible but not all of them are equally good.
A well known simple linear regression function is yi = β0 + β1xi + εi where
i = 1..n and n data entries. The ε-value or disturbance term will represent
the noise in the data values.
Supervised learning will try and create a function h(x), by optimizing the
parameter values, and by doing so predicting the y as good as possible. To
get an idea how good a model is, there is a need for a cost function that de-
termines how good a model is. A commonly used cost function for regression
is the Mean Square Error, or MSE.
MSE =1
2m
m∑i=1
(h(xi)− y(xi))2
MSE finds the difference between the predicted output h(x) and the true
value, y. This will be squared so the difference signs will not make a differ-
10
ence. The additional 2 is used cancel out when differentiating which will be
used in the neural networks. The lower the error, the better the hypothesis
is fitted to the data.
0 2 4 6 8 10
Inputs, X
0
2
4
6
8
10
12
14
Ouput,
y
Input-output vector
Prediction
Figure 4: A regression line between the input values X and the output values
y
2.2 Unsupervised learning
Unlike supervised learning, unsupervised learning has no target output y but
only input data X, Figure 5. Because there is no target output, it is the job
of an unsupervised learning model to find a relationship or structure between
the input data. This relation can be used to group data or even reduce di-
mensions.
One example of finding structure in data and grouping them, is k-means
clustering (Figure 6), where k amount of clusters are formed. Each cluster
has a mean or also called a centroid. First k random centroids are placed
11
within these data points. Each iteration every data point is assigned to the
closest cluster. When all data points are assigned each centroid is recalcu-
lated and moved. This iteration is done until the centroids no longer move.
10 5 0 5 10 15 20 25
x1
10
5
0
5
10
15
20
25
x2
Figure 5: Data of two features
Another example of unsupervised learning is dimension reduction with au-
toencoders (Section 3.7). Autoencoders are a form of artificial neural network
but with their input equal to their output. By doing so, the autoencoder will
learn the identity function and in the internal representations used, autoen-
coders will learn to compress the data. MNIST is a database of handwritten
digits in their raw feature form. Each digit can be converted to a 28x28 image
and thus 784 pixels or dimensions (Figure 7a.). Autoencoders can be used to
go from 784 dimensions to 2 dimensions and by doing dimension reduction.
Each point shown in Figure 7b is the image of a number like Figure 7a. These
points where reduced from 784 to 2 dimensions and colors indicate the class
where the number belongs to. It can be seen that compressions maps the
same numbers close to each other.
12
10 5 0 5 10 15 20 25
x1
10
5
0
5
10
15
20
25
x2
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Centroids
Figure 6: k-mean clustering
(a) MNIST example of the number 2 (b) MNIST reduction of dimensions
Figure 7: Unsupervised learning: reduction of dimensions
2.3 Underfitting and overfitting
There are different ways to build models and not every model is good. Take
for example some data points, these data points must have some underlying
function that we do not know. These data points will represent one input
feature x and the output y with some random noise, y = f(x) + ε. A model
13
can then be created to predict this underlying function. Different functions
are plotted in Figure 8 by using linear regression. The blue dotted points will
denote the samples, the green line will represent the underlying function that
is unknown and the blue line will be the hypothesis h(x) of our model. The
first figure shows a model that is underfitting, it cannot represent the under-
lying function at all. The function is too simple and the underlying function
cannot be represented by a straight line, which is in this case a polynomial
with 1 degree or also a linear regression. The second figure shows that the
model has learnt the true underlying function, although without knowing the
underlying function it is still a hypothesis. In this case it is a polynomial of
4 degrees. The last figure shows a model that is overfitting, it tries to model
every training data too well and uses a polynomial of 15 degrees. If the
model then tries to predict unseen data it will fail because the model does
not generalize over dataset but tries to fit it perfectly. Note that neither
under- and overfitting are good.
A good way to test if the model is good or bad is dividing the data in a
training- and test set. Let the model train on the training set and when the
model has done training, let it predict on the test set. Seeing how much the
predicited output differs from the output of the test set gives a good indi-
cation. A good way is using the Mean Square Error, the smaller the error
the better the model fits the data. There is a difference between training
and test error. The training error is when the model is being trained. The
model receives an input value, predicts it and if it is wrong will adapt the
model. The test error is when the model is done training. A new set of data
is presented to the model. The model will predict the output and the error
will be calculate how far off the model is.
Figure 8 will present different models for an, unknown, underlying function,
this function will be the green line. The model that has been trained will
predict values, these values will be represented in the blue line. As a test set
is presented to the model, the blue line will give the answer which outcome
the model will have. The samples where the model has been training on are
the blue dots. It can be seen that the first model has a high training error as
well as test error. The model cannot represent the model with 1 polynomial
and can certainly not represent a new test set. The next model, one with 4
polynomials, has a very low training- and test error. It can fit the training
data and the test set will be predicted fairly good, because the function of
the model matches closely to the true underlying function. The last image
14
with 15 polynomials will have a low training error. As can be seen it can fit
the training samples perfectly. But it will have a high test error as it cannot
represent the new test data.
0.0 0.2 0.4 0.6 0.8 1.0
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
y
MSE = 0.37
Model
True function
Samples
0.0 0.2 0.4 0.6 0.8 1.0
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
y
MSE = 0.04
Model
True function
Samples
0.0 0.2 0.4 0.6 0.8 1.0
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
y
MSE = 182212904.43
Model
True function
Samples
Figure 8: Difference between under and overfitting. From left to right: poly-
nomial with 1 - 4 - 15 degrees. Image adapted from sklearn 1
2.4 Bias - Variance
The question remains how can the architect of a model detect if the model is
under- or overfitting. This can be seen by determining the bias and variance.
First there are expected values, which are values of a random variable. A
random variable associates numeric values with different outcomes of an ex-
periment. Random variables can or will change when repeating experiments.
Repeating an experiment to get average results is thus important. Bias is
then the difference of the expected value of the predicted outcome and the
real target outcome.
Bias(y) = E(y)− y
Bias will see how far off a model is to the correct output of the underlying,
unknown, function.
Variance will find the variability of the model, with respect to the expected
model.
1http://scikit-learn.org/stable/auto examples/model selection/plot underfitting overfitting.html
15
V ar(y) = E[(y − E(y))2]
The dartboard analogy (Figure 9) gives a more visual idea of what bias
and variance means. Imagine that someone is throwing darts and that the
bullseye represents a good model. If all the darts have been thrown and they
are spread out and thus not close to each other, than there is a form of high
variance. Bias on the other hand is the average distance to the bullseye. In a
case of low bias and low variance all darts are close to each other and directly
on or close to the bullseye itself.
Figure 9: Dartboard analogy from (Sammut & Webb, 2011)
The mean squared error, MSE, gives us a squared result of how good the
model is.
MSE(y) = E[(y − y)2]
MSE can then be decomposed as the bias-variance decomposition.
MSE(y) = (E(y)− y)2 + E[(y − E(y))2] + σ2
= Bias2 + V ar + Error
The last term, the irreducible error, will represent the noise in the data.
16
Figure 10: Bias Variance trade-off
When applying the bias and variance to under- and overfitting, it can be
seen that underfitting is when the bias is too high. The model is too simple
and can not learn the underlying function. Overfitting gives a high variance,
because it is too complex and fits the noise instead of the underlying function
(Figure 10).
2.5 Ensembles methods
One way to get a better performance of the model is using ensemble methods.
These methods combine different models that are more accurate than a single
model.
2.5.1 Bagging
Bootstrap aggregating, or also known as bagging, is mostly used for reducing
the variance of a model. Bagging belongs to the class of averaging methods
since they will average their result and by doing so getting a combined result.
It starts by taking random subsets of the training data. By using different
subsets and training on them, models will be different and predict differ-
ently. Bagging will then accumulate all separate models and combine them
in one concluding model (Breiman, 1996). An example of a bagging model
method is tree bagging or an extension; random forest, Figure 11. It starts
17
with training B trees, this can be for example decision trees. Each training
the model draws random and uniformly with replacement from the pool of
training data. After all B trees are trained, the models will be ensambled by
using the average 1B
∑fb(x) or voting, where the majority rules counts. This
only decreases the variance and does not increase the bias. Random forests
will also add a random feature subset while learning the trees.
Figure 11: Random Forest
2.5.2 Boosting
As in all models, there are strong learners and weak learners. Weak learners
are defined as being slightly better than a random prediction, but still not
good enough. The idea comes from combining multiple weak learners and
create a single strong model. The most popular boosting algorithm is Adap-
tive Boosting, or AdaBoost (Freund & Schapire, 1997). AdaBoost combines
the results of weak learners into a weighted sum or majority rule.
2.6 Curse of dimensionality
One might think, the more features data has, the better a learner or model
will perform. This is not true. Imagine if e1 is dropped on a straight line of
100 meters. The coin will be easily found. If a coin is dropped on a surface
of 100 x 100 meters which is 10000 m2, this is also possible but not so easy
18
anymore. If a a coin is dropped in a 3D space of 100 x 100 x 100, which
is 1000000 m3, it is more difficult than before (Figure 12). This analogy is
only to illustrate the difficulty in finding a coin in a multidimensional space.
In machine learning the dimensionality can go up to tens of thousands of
dimensions, for example DNA sequences. This also means the higher the
dimension goes, the sparser the data becomes. One way to reduce dimensions
is using feature selection or even feature extraction, like autoencoders.
0.0 0.2 0.4 0.6 0.8 1.0
(a) 1D space
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
(b) 2D space
0.47 0.48 0.49 0.50 0.510.52
0.530.470.48
0.490.50
0.510.52
0.530.06
0.04
0.02
0.00
0.02
0.04
0.06
(c) 3D space
Figure 12: Searching in different dimensions
2.7 Evaluating models
After models are created, there is a need to evaluate them. Often, if sufficient
data is available 70 or 80% of the data is taken at random and will be used
to train a model. The remaining number will be used predict and see how
far off a model is. This method is not flawless, there is a chance that only
outliers of the data are in the test set which can determine that the model is
bad when in fact it is actually quite good. Therefore different methods are
invented to get an average prediction of how good a model is.
19
2.7.1 Cross validation
The first method is cross validation. The goal is to see how effective a model
is. There are different cross validation methods such as k-fold cross validation
and holdout methods. These methods are classified as non-exhaustive. The
first method, k-fold cross validation, splits the data in k folds or subsets.
Each iteration, done k times, one subset is taken which will represent the
test set and the other subsets will be used as the training set. Hereafter the
results of all k predictions will be averaged and the error can be estimated.
The holdout method is the same as k-fold cross validation but in here k = 2.
Each datapoint is either assigned, at random, to the training set or test set.
20
Chapter 3Artificial Neural Networks
Artificial Neural Networks (ANN) are machine learning models inspired by
the human brain. The brain consists of approximately 1011 neurons, a neu-
ron itself is cell that transmits information to other neurons. This connec-
tion between other neurons is called a synapse, there are approximately 1014
synapses. These neurons with their synapses can make decisions based on
their input, for example a human can recognize his family members imme-
diately when seeing them. This is exactly why researches wanted to create
artificial neurons with a mathematical model for handling information.
3.1 Perceptrons
One type of Artificial Neural Networks are perceptrons (Rosenblatt, 1958)
which are binary classifiers, Figure 13.
Figure 13: Example of a perceptron
21
Perceptrons can only take real-valued inputs and construct one single binary
output. The output is calculated by a linear combination of real-valued
weights (w) and inputs (x), this will result in a value that, depending on a
certain threshold, will result in zero or one. This can be rewritten into the
following function;
f(~x) =
{0 if ~w . ~x+ b ≤ 0
1 if ~w . ~x+ b > 0
Where ~w . ~x is the dot product of vectors, note that x0 will be set equal to 1
for this vector notation. A bias will influence how easier it is to get a 0 or 1
as output. For example if the bias is negative, the dot product of vectors ~x
and ~w must have a value greater than the absolute value of the bias to get
over the threshold. The bias can thus adjust the decision boundary. Note
that for perceptrons only a linear decision boundary is possible. Bitwise op-
erations, like AND and OR, can be implemented by one single perceptron by
adapting the weights or the bias. Figures 14a and 14b show an example how
the perceptron can distinguish the bitwise operation AND and OR. Both
axis signify all states a bit can take, the color denotes if a bit will be 0 or 1
depending on the operation and the black line will be a decision boundary.
Not all operations can be represented by one perceptron, XOR is for ex-
ample not linearly separable, see Figure 14c, and thus needs more layers of
perceptrons to solve this problem.
3.2 Training perceptrons
The difficult part of perceptrons is setting the weights in a way that the
perceptron’s output results in a correct output. To do this there are several
ways to learn weights. The first way is called the perceptron training rule
where all weights are initialized at random. The next step is iterating over all
the training examples and whenever the classification is wrong the weights
are updated by the following rule:
wj = wj + ∆wj
where
∆wj = η(t− o)xj
22
0 1
Bit1
0
1
Bit
2
AND
1
0
Boundary
(a) AND operator
0 1
Bit1
0
1
Bit
2
OR
1
0
Boundary
(b) OR operator
0 1
Bit1
0
1
Bit
2
XOR
1
0
(c) XOR operator
Figure 14: Bitwise operations
This is done until all training examples are classified correctly. The rule
takes the difference between the target output t and the perceptron’s output
o which is then multiplied by a learning rate η and the input xj. It can be
seen that whenever the perceptron’s output is equal to the correct output
the update will be equal to 0 and thus no weights are updated. It has been
proven that the perceptron’s training rule will converge (Minsky & Papert,
1969), if the learning rate is sufficiently small and when the data is linearly
separable.
It is often unknown if the data is linearly separable. The delta rule or gra-
dient descent will therefore search for a good approximation for all outputs
by using gradient descent if the data is not linearly separable. The idea is
23
by minimizing the following error:
E =1
2
∑d∈D
(td − od)2
Where E will be the squared error and D the set of all training examples.
Note that the 12
is used to cancel out the exponent when differentiating.
The error is always non-negative due to the power. If the error is small
the perceptron’s output can represent the target output well. To find the
minimum of E, the derivative with respect to the weights can be taken.
∇E =[δEδw0
+ δEδw1
+ ..+ δEδwn
]The gradient gives the direction of the steepest increase of E. To find the
steepest decrease, the negative sign can be added. The learning rule will
then become:
w = w + ∆w
where
∆w = −η∇E(w)
This can be rewritten by
δE
δwi=
δ
δwi
1
2
∑d∈D
(td − od)2
=∑d inD
(td − od)(−xid)
∆wi = η∑d inD
(td − od)xid
The η will determine how big the step size will be in the gradient descent
search.
Another variation is called the stochastic or incremental gradient descent
where the gradient descent is calculated for each training data separately
instead of summing.
∆wi = η(t− o)xiStandard (or batch) gradient descent will thus go through all examples before
updating the weights. While stochastic gradient descent will take one exam-
ple and updates the weights based on that example. The gradient descent
24
will be a very costly algorithm when the size of training samples is large.
Stochastic gradient descent will improve much faster than gradient descent
ever will and will eventually converge faster but its error will be not as good
as the gradient descent will be.
3.3 Multilayer perceptron
As explained previously, a single perceptron cannot represent non-linear data
like XOR. Multilayer perceptrons, or MLP, can represent this by using mul-
tiple layers of perceptrons. This will result in, for example two different
decision boundaries for XOR, Figure 15. The layers of MLP’s are fully con-
nected, except the input layer and each perceptron has a non-linear activation
function, Figure 16.
0 1
Bit1
0
1
Bit
2
XOR
1
0
Boundary
Figure 15: Example of XOR with two decision boundaries learnt by a MLP
25
Figure 16: Example of a multilayer perceptron with 4 input nodes, 2 hidden
layers with each 5 hidden nodes and 3 output nodes
3.4 Activation functions
The activation, ϕ on Figure 13, is a function, possibly non-linear, applied
after multiplying inputs with their network weights. For example a linear
neuron, which uses a linear activation function, can set the output on or off,
which means it belongs to class A or B if there are only two features. It thus
activates the node or not. The problem with linear neurons is that using
multiple layers of linear neurons will still yield a linear result. The same goes
for a step function where the output will result in a 0 or 1 depending on the
threshold θ. There is thus a need for a unit that given an input will yield
an output which is a non-linear result of its input. The advantage of the
following described activations is that their functions are all differentiable,
this can minimize the computational load when training neural networks.
Other basic activations are the linear and step function, Figure 17.
26
1.0 0.5 0.0 0.5 1.01.0
0.5
0.0
0.5
1.0Activation function: Linear
Linear
4 3 2 1 0 1 2 3 41.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0Activation function: Step function
Step function
Figure 17: Other activation functions: linear and step function
3.4.1 Sigmoid
The sigmoid unit sets the threshold as a sigmoid function, Figure 18. This
results in a continuous function of its input by using:
σ(x) =1
1 + e−x
This output will map the input between a 0 and 1 output. The derivative of
the sigmoid function will be:
d
dxσ(x) = σ(x)(1− σ(x))
=1
1 + e−x(1− 1
1 + e−x)
4 2 0 2 40.0
0.2
0.4
0.6
0.8
1.0Activation function: Sigmoid
Sigmoid
Figure 18: Sigmoid activation function
27
3.4.2 Hyperbolic tangent
The same goes for the hyperbolic tangent or tanh, Figure 19. This will map
the input between a -1 and 1 output.
tanh(x) =sinh(x)
cosh(x)
=e2x − 1
e2x + 1d
dxtanh(x) = 1− tanh(x)2
4 2 0 2 41.0
0.5
0.0
0.5
1.0Activation function: Tanh
Tanh
Figure 19: Hyperbolic tangent activation function
3.4.3 Rectified Linear Unit
Another recently discovered activation is the rectified linear unit (Nair &
Hinton, 2010), or ReLU, Figure 20. This has the advantage that when there
is a neural network with random initialized weights, only 50 % of the hidden
neurons will be activated. This results in a sparse activation. ReLU is not
differentiable at 0, but the can differentiated at any other point. In the last
years ReLU has grown more popular in Deep Learning because they learn
must faster when going in neural networks with many layers (Y. LeCun,
Bengio, & Hinton, 2015). It can also compete with neural networks that
use pre-training and neural networks that do not use pre-training with the
activation function ReLU (Glorot, Bordes, & Bengio, 2011).
28
relu(x) = max(0, x)
d
dxrelu(x) =
{x = x > 0
0 = x ≤ 0
4 3 2 1 0 1 2 3 41.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0Activation function: ReLU
ReLU
Figure 20: ReLU activation function
3.4.4 Which is better?
The question then remains which activation function is better. Using a non-
linear function is essential when there the data is not linearly separable and
wanting a non-linear output. Unfortunately there is no activation function
that is best above all others. Often is the hyperbolic tangent more pre-
ferred because the data will be centered, if the data is normalized, around 0.
This causes the hyperbolic tangent to often converge faster than the sigmoid
function. (Y. A. LeCun, Bottou, Orr, & Muller, 2012). Over the last few
years ReLU has been typically preferred over other activation functions in
deep networks, ReLU has the advantage that it has no vanishing gradient
problem. When learning weights in deep networks with backpropagation it
is possible that the first layers will learn slowly because of the amount of
chainrules that must be surpassed before reaching the input layers. Because
so many chainrules are passed the derivative can be a very small number
which means updating can be very slow.
29
3.5 Tips and tricks
As can be seen there are many parameters that can be applied to a neural
network. This does not mean that neural networks will converge. There are
a few possibilities to speed up the process although this does not mean it
will lead to a good solution. One of those possibilities is batch or stochastic
learning. Batch learning is when all the training data is passed through the
neural network and only then the gradient will be computed and weights will
be updated, this is different from stochastic training where there is only one
update done after a forward pass on a single (random) input. Stochastic has
the advantage that it is much quicker than batch training and is often known
to perform better, although this is not sure.
Another option is to randomize the input so that the input ~x1 and ~x101 are
not likely related as ~x1 and ~x2 would be. For example, two consecutive RAM
states from a game are related. But two random RAM states are probably
not as related as the two consecutive would be.
It is often good practice to train on examples that return a bigger error
than examples that give a lower error. Another way to boost the process is
normalizing the input with mean 0, (Y. A. LeCun et al., 2012) shows that
whenever the input is for example all positive the weights will only increase
or decrease which means the update rule will only zigzag its way to find the
best weights. This causes inefficiency algorithm.
3.6 Backpropagation
Backpropagation is used to train a neural network to optimize the weights
of the network with gradient descent. First the previously error E needs to
be redefined because it was the error of only one unit. This can be done by
summing all difference between the target and output of all kth output units
with training data d;
E =1
2
∑d∈D
∑k∈outputs
(tkd − okd)2
The problem with backpropagation and the previous gradient descent for
one output unit is that the dimensional space of E contained only one local
minima, while backpropagation can have multiple. This means that back-
propagation will converge to any of those local minima but is not certain
30
that this local minima is also a global minima. This aside backpropagation
still produces good results. The algorithms starts by initializing the number
of nodes and outputs and setting random small weights. For each training
example the network calculates the output and the error. It then computes
the gradient of that error followed by adapting the weights of the network.
This iteration can and probably will be looped many times until the network
can calculate the output decently. There are many criteria that can be set
to end the iteration, for example a fixed number of iterations or having to
loop till the error falls below some threshold. The weights are updates with
the following rule
wji = wji + ∆wji
where
∆wji = ηδjxji
This rule is an adapted version of the previously seen delta rule. For output
units the new δ will be the previous (t − o), target value minus the output
value, but multiplied with the derivation of an activation function, ϕ.
δk = φ(tk − ok)
where
φ =d
dxϕ
For the inner nodes, lets assume there are only two layers one output layer
and one hidden layer, the δ will be defined differently, since there are no
target values available. The δ is then calculated by summing the δ of the
outputs weighted by the weights of the hidden node.
δh = φ∑
k∈output
wkhδk
where
φ =d
dxϕ
This can be extended with more than two layers by using the chain rule.
31
3.7 Autoencoders
Autoencoders are artificial neural networks with the special property that
they do not need target values. This makes autoencoders an unsupervised
learning method because the target values will be set equal to input values,
~y = ~x (Geoffrey E Hinton & Salakhutdinov, 2006; Ng, 2011). This forces
the autoencoder to learn the identity function. This may seem trivial, but
setting constraints on the network like limiting the number of layers and
nodes, see Figure 21, can create a bottleneck which forces the autoencoder
to reduce the input information and thus creating a compression technique.
Real world examples of input data, such as pictures and the amount of pixels,
DNA sequences, and so on have big input features. Therefore there is a need
for some kind of compression that reduces the amount of features.
Deep learning is a technique that can also be used together with autoen-
coders. Deep learning will have multiple layers, where each layer will learn
some abstraction of the input features and in the end will create some com-
plex structure of abstractions (Bengio, 2009; Y. LeCun et al., 2015; Schmid-
huber, 2015). The layers of such a deep network can be initialized by first
training an autoencoder on the input layers. The weights of the trained au-
toencoder then typically provide a good starting point for the deep network
weights.
There are different kinds of autoencoders. The first variation is a sparse
autoencoder, this is when the hidden layers of the autoencoders can have
more hidden nodes than the original feature input vector. Having more hid-
den nodes leads to more computational heavy calculations. The sparsity
parameter will enforce that a node will be on average active. This intro-
duces sparsity and can give interesting results (Ng, 2011). Another variation
on sparse autoencoders are the k-Sparse autoencoders (Makhzani & Frey,
2013), which only takes the k best activations and cancel out the rest, mean-
ing initialization them on zero. Denoising autoencoder, (Vincent, Larochelle,
Bengio, & Manzagol, 2008), can be an alternative for sparsity or bottleneck.
It will corrupt the input data and the autoencoder will be trained to fill in
missing parts, and thus reconstruct the input data. This is done by training
the autoencoder and removing features random.
32
Figure 21: Example of an autoencoder
3.8 Conclusion
This thesis will primarily focus on autoencoders and their capabilities in un-
supervised feature extraction. Because autoencoders have the capability of
reducing dimensions, it is interesting to investigate how good these features
are. Unfortunately there is no way to see if these features are good or what
they mean, since they are somewhat blackbox. These features have some nu-
meric value these are not easily interpreted, not unlike for example a decision
tree that is easy humanly readable. By using different activation functions
we can also see the impact that one function has on the reconstruction of the
input data. Because autoencoders are starting in relative high dimensions,
1024 or 128 depending of how to interpret the RAM state, it is also a good
idea to experiment with adding dropout in an autoencoder. Dropout forces
the autoencoder to randomly drop nodes with all their connections. By do-
ing so, the autoencoder is forced to learn in another connection and this also
prevents to overfit the network (Srivastava, Hinton, Krizhevsky, Sutskever,
& Salakhutdinov, 2014). The downside of using dropout is more learning
time.
33
Chapter 4Reinforcement Learning
The history of Reinforcement Learning, RL, has its roots in psychology. Ed-
ward Thorndike introduced the law of effect, which he defines as:
Of several responses made to the same situation, those which are
accompanied or closely followed by satisfaction to the animal will,
other things being equal, be more firmly connected with the situ-
ation, so that, when it recurs, they will be more likely to recur;
those which are accompanied or closely followed by discomfort to
the animal will, other things being equal, have their connections
with that situation weakened, so that, when it recurs, they will be
less likely to occur. The greater the satisfaction or discomfort, the
greater the strengthening or weakening of the bond. (Thorndike,
1911)
This will be one of the key points of Reinforcement Learning, only positive
interactions will be encouraged and negative interactions will be discouraged,
but not rejected.
Skinner invented the Skinner’s Box (Skinner, 1938), where animals have to
press a lever when receiving a signal. This can be anything from a light pulse
to a sound. When the animal presses the lever on the correct signal it will
receive a reward, which will most likely be food (Figure 22). But it can also
receive a negative reward like electrical shocks when pressing the lever at the
wrong signal. Skinner is known to perform this kind of tests on pigeons and
rats (Skinner, 1951, 1948).
34
Figure 22: A Skinner’s Box from (Skinner, 1938)
Many other examples exists where animals are trained, like Pavlov’s dogs
(Todes, 2002), where dogs were trained to response to receiving food, result-
ing in producing more saliva. This was trained by ringing a sound before
giving the dogs their food.
All the previous research, for example Pavlov’s dogs, forms a base of how
dogs are trained now. Dogs now will get a biscuit if the dog performs a
command correctly, if he does something bad he gets scold. This is exactly
what Reinforcement Learning tries to recreate.
Reinforcement Learning is used in many current applications. For exam-
ple a robot vacuum cleaner that adapts itself to know when to dock itself
to recharge and restart where it has left off or even to adapt the motors of
the robot depending on the material of the floor to save energy and be more
efficient. Even in games is Reinforcement Learning widely used. Researchers
let an AI play backgammon against itself and by doing so, learning from
itself and correcting his mistakes which made him a master level player and
close to one of the best backgammon players (Tesauro, 1994).
4.1 The setting
The Reinforcement Learning setting can be summarized in Figure 23 and
24 (Sutton & Barto, 1998). An agent is an entity that can observe the
environment and can act upon it and by doing so learn from the interactions.
35
The environment is where the actions take place, which will then yield a state
and reward. The agent will eventually learn how to map situations onto to
different kinds of actions based on what the agent has learnt. The goal of
Reinforcement Learning is maximizing its reward.
Figure 23: Agent Environment setting
Going back to the Figure 23, an agent can interact each time step t =
0, 1, 2, 3, 4, .. with the environment. After each time step t, the environment
produces a state st ∈ S, where S contains all possible states. Based upon a
state, an action will be chosen and taken, at ∈ A(st), where A(st) will be all
possible actions in the state st. The next time step t + 1, the environment
will yield a reward Rt+1 ∈ R with a new state St+1.
Figure 24: Starting from a state, the agent will choose an action. The next
time step the agent will receive a reward and comes in a new state. This is
done T times
Example 4.1.1. One of the most known well examples in Reinforcement
Learning is the mountain car (Figure 25). The agent has to drive the car to
the top of the mountain, but it does not have the power to get to the goal
position in one go. Therefore the agent can use gravity in order to get to the
goal as quickly as possible. The agent can do this by driving up the hill, let
go and drive backwards to gain momentum. The states of the mountain car
36
are, the position on the map which is one dimensional and the velocity of the
car. The actions can be, to drive forward, backward and do nothing. The
rewards are always negative per time step unless he reaches the goal. The
agent will learn to minimize his reward, since it is negative.
Figure 25: Mountain car; image from (RL-Library, n.d.)
Example 4.1.2. Another widely used example in Reinforcement Learning
is Pole Balancing, Figure 26 (Michie & Chambers, 1968; Barto, Sutton, &
Anderson, 1983). A pole is mounted on a cart at its center of mass, this
allows the pole to be balanced at an exact point. The cart itself can only
move left and right and the pole can only indirectly move from left to right.
The goal is balancing the pole in an upward position. This can be done by
moving the cart back and forth to get to that point. The states are the
pole’s angle and angular velocity. The actions are moving left and right and
by doing so creating a force to get the pole to a balanced state. The rewards
can for example be, for each time step an incremental reward with reward 1
until the cart fails.
Figure 26: Pole Balancing; image from (Anji, n.d.)
37
4.2 Rewards
The goal of Reinforcement Learning is to have a maximum reward over time.
The agent receives a reward every time step + 1, because when the agent
does the action, he can only observe the reward the next time step. This can
be formally written as;
Gt = Rt+1 +Rt+2 + ..RT
Where Gt is the expected total reward. The agent does not know the exact
reward, he can only expect a certain reward. T is the final time step, when
the agent goes into an end state. When the environment has a notion of
time as in learning an episode which starts and ends, like for example a maze
environment (Figure 27). The agent starts on the left and each time step
can move only one adjacent square where there is no wall. The agent needs
to find a way outside of the maze. This is called an episodic task be because
each episode the agent can do an action at in time step t.
Figure 27: A maze world where the agent starts from the left and needs to
find a way to get outside
An episodic task will eventually always go into a final state. When there
is no terminal state it is called a continuous task. This means that the
formally noted Gt is no longer true because there is no final time step T . Gt
38
can easily be adapted from T to infinity∞ 1. An additional approach to the
expected reward is adding a discount factor. This factor is used to determine
if whether the agent is interested in an immediate return or more interested
in a future reward.
Gt = Rt+1 + γRt+2 + γ2Rt+3 + ..
= Rt+1 + γ(Rt+2 + γRt+3 + ..)
= Rt+1 + γ(Gt+1)
or
=∞∑k=0
γkRt+1+k
This means that γ, the discount factor, decides whether the agents seeks for
a long term and future reward or an immediate reward. The discount factor
is bounded between 0 ≤ γ ≤ 1. This is interpreted as follows, imagine that
γ is equal to 0. Then is Gt = Rt+1, meaning that the agent only cares about
the reward it is about to receive. If γ = 1, it can be seen that rewards in the
future are equally important as the immediate reward.
For most parts the reward scheme is unknown, meaning the rewards are
chosen by the designer of the implementation. In the example of mountain
car, Example 4.1.1, the reward scheme is always −1 until the car reaches
the mountain. But not always, in this thesis the focus will lie on the reward
scheme given by Space Invaders itself (Section 5.2).
4.3 Markov Decision Process
The Markov Property states that whenever the agent is in a state s it contains
all valuable information to go to the next state s′ with its reward r′. From
this information it can decide in the future where to go. It is said that, when
the reward and transition probabilities only depend on the current state,
action and time step and not on the previous visited states, the problem has
1example of contin task?
39
the Markov Property. It can thus be defined as;
P (Rt+1 = r, St+1 = s′|S0, A0, R1, ..., St−1, At−1, Rt, St, At)
=
P (Rt+1 = r, St+1 = s′|St, At)
Which states what the probability is of the reward r and the next state s′
given all previous information is equal to only the previous state, which is
exactly what the Markov Property defines.
A Markov Decision Process is when a Reinforcement Learning task has the
Markov Property. It consists of:
• Set of States S: S0, S1, .. , Sn
• Set of Actions A: A0, A1, .. , An
• Transition function: T (s, a, s′) = P (St+1 = s′|St = s, At = a)
• Reward function: r(s, a, s′) = E[Rt+1|St = s, At = a, St+1 = s′]
The Transition function T gives the probability of a state s′ given the current
state and action. The Reward function r gives the expected reward given
the current state, action and next state.
Applying this to Example 4.1.1, the mountain car, the first transition func-
tion can be going from the current state, which is standing still, and the
action acceleration, to a next state which is higher on the mountain. The
reward scheme was designed as follows, only negative rewards are given un-
less the goal is reached. Since the car is in the start state and the action is
acceleration, the expected reward will be −1 since it is the first time step
and the goal was not reached.
4.4 Value functions
A policy π is the long term goal of an agent where the agent selects an
action in a state at any given time. A policy will take all elements into
consideration with regards to maximize the reward. It thus maps each state
with a probability onto an action, π(a|s). The value of taken that action a
40
in state s and further following the policy π is denoted as V π(s) and is called
the state-value function for policy π. This can be formally written as:
V π(s) = Eπ[Gt|St = s]
= Eπ[∞∑k=0
γkRt+1+k|St = s]
Meaning the expected value, E, will be the expected reward given the state
the agent is currently in. Equivalently the action-value function can be de-
fined for a policy π, this will be denoted as Qπ(s, a). The action-value func-
tion returns the expected return from a chosen state s and an action a by
following the policy π. The action-value function can thus be defined as:
Qπ(s, a) = Eπ[Gt|St = s, At = a]
= Eπ[∞∑k=0
γkRt+1+k|St = s, At = a]
Value functions give an indication if going into a state is a good or a bad op-
tion regarding the future. These value functions only come from experience
and the only method to get experience is gaining as much as information as
possible by traversing the environment.
The state-value function has a special property between the current state,
the action taken and the successor of the state followed from that action,
which is a recursive relationship. The following equation is named the Bell-
man equation for State Values. It looks at the action s and all the following
states s′ that follow from action a. The same can be applied on state-action
values.
V π(s) = Eπ[∞∑k=0
γkRt+1+k|St = s]
= Eπ[Rt+1 + γ∞∑k=0
γkRt+2+k|St = s]
=∑a
π(s, a)∑s′
T (s, a, s′)[R(s, a, s′) + γV π(s′)]
Qπ(s, a) = Eπ[∞∑k=0
γkRt+1+k|St = s, At = a]
=∑s′
T (s, a, s′)[R(s, a, s′) + γV π(s′)]
41
The Bellman equation will look at a start state and calculates for every
possible action the states of the successor with their expected reward. The
Bellman equation is going to average all the potentials with their weighted
probability of occurring.
It is only logical that multiple policies exist and thus also multiple state-
value functions, as a designer to solve the problematic task, there is a need
to find the optimal policy and optimal state-value function.
π ≥ π′
if and only if
V π(s) ≥ V π′(s) ∀s ∈ S
A policy is only better when the state-value or action-value function is better
or equal than the every other policy. Of all policies there is one policy which
is the best and thus optimal, π∗ with the associated optimal state-value
function, V ∗ or optimal action-value function Q∗. Note that an optimal
policy is not unique but an optimal action-value function is. Which will be
defined as:
V ∗(s) = maxπVπ(s) ∀s ∈ S
Q∗(s, a) = maxπQπ(s, a) ∀s ∈ S and ∀a ∈ A
Example 4.4.1. Take the gridworld as an example. Where an agent is put
on the grid and needs to find the goal, here the goal is indicated by a green
square. The agent can only move right, left, up and down. The agents
receives in this example +100 when moving to the goal. The optimal state-
value is then showed in Table 3. The optimal policy π∗, will be the shortest
way to the goal. Every possible optimal policy path is identified by arrows,
multiple arrows indicate multiple optimal paths exist, this is shown in Table
4. To be sure that the agent finds the optimal policy the agents must visit
every possible state, in the gridworld this is a doable option. But when there
are millions of states this can be quite exhausting, therefore these functions
can be approximated.
4.5 Action Selection
The problem remains which action to select and why to select a certain
action. A naive way to select an action is always selecting the action with
42
54 63 72 63 54
63 72 81 72 63
72 81 90 81 72
81 90 100 90 81
90 100 0 100 90
Table 3: V ∗(s) Table 4: π∗(s)
Table 5: Gridworld Example
the highest Q-value, Qt(A∗t ) = maxaQt(a). This method will always choose
to take the action which yield the highest reward above all other rewards,
this is also called exploitation. It only uses what the agent has learnt and will
not explore other options. One of the disadvantage of this greedy method
is that the agent will never find another possibility or another way that has
more rewards and is perhaps shorter. There is a way to force the selection
method to explore, which is initializing the Q-values on another value. A
more optimistic way of exploring while keeping some exploiting is the ε-
greedy method where there is a probability of ε to select a random action or
choosing the greedy method. Equivalently in this case there are also different
ways to optimise the action selection method. There is a possibility to keep
the ε fixed over different episodes but there is also a way to keep the ε high in
the beginning of the episode, to force the agent to explore as much as possible,
and after a certain time t, the exploration rate will be reduced to force the
agent to change his exploration to more exploitation. A disadvantage of the
ε-greedy method is that when selecting an action, it will choose each action
with same probability. This means that it could choose a very good action,
but also a extreme bad action. The softmax action selection solves this by
using probabilities of selecting an action which are ranked by their estimation
of Q-values.
P (s, a) =eQ(s,a)τ∑n
i=0 eQ(s,n)τ
The parameter τ , or temperature, is used to determine how long the explo-
ration will continue, the higher τ is, the more randomly it will play. The
closer τ is to 0, the more greedily it will be. The same thought can be ap-
plied by reducing the τ over time.
43
Balancing the amount of exploration and exploitation is one of the impor-
tant elements of learning. There is no need to always exploit the same path,
because the first best path is not per se the all time best path. The same
goes for exploration, always exploring random actions will never yield a good
result. Although when exploring tremendously the agent knows all possible
paths that can be taken. There is no current research that declares which
action selection method is the best. Both ε-greedy and softmax are methods
that are widely used today in Reinforcement Learning. In current research
ε-greedy is more used, simply because setting the ε parameter is easy under-
standable while the τ parameters needs knowledge of the action values and
e.
4.6 Incrementing Q-values
When using action selection methods, there is a need for a value of for an
action other than the reward. A simplistic way of representing these values
is by averaging all rewards.
Qt(s, a) =R1 +R2 +R3 + ..RKa
Ka
The rewards are averaged when the action a was selected K times before a
time step t. When the agent just starts, K is equal to 0, that makes Q(s, a)
undefined. Therefore Q-values are always initialized by some number, for
example 0. The law of number states that when K → ∞, Q(s, a) will
converge to Q∗(s, a). This method is also called the sample-average method.
As being said, this is a fairly naive way of implementing these values. For
the method to work, the computer needs to remember all possible reward
to average them, this will only increase the longer the task lasts. The same
goes for computational power, each time a new action is taken the computer
needs to recalculate the entire average, with thousands of rewards of only one
action in a state can overload a computer. One way to avoid this problem is
using incremental updates.
44
Qk+1 =1
k
k∑i=1
Ri
=1
k(Rk +
k−1∑i=1
Ri)
=1
k(Rk + (k − 1)Qk +Qk −Qk)
=1
k(Rk + kQk −Qk)
= Qk +1
k[Rk −Qk]
The computer only needs to remember the Qk and k value, which makes the
computational load a lot smaller. This incremental update can be generalized
by using the following equation:
Estimatenew = Estimateold + stepsize[Target− Estimateold]
The difference between the Target and Estimateold can be seen as the error
between the estimated value of an action method and the target. Usually in
Reinforcement Learning the stepsize will be replaced by α. The α can be a
constant, which makes the current reward weighted heavier than the older
rewards, where 0 < α ≤ 1, which is then called the weighted average.
4.7 Monte Carlo & Dynamic Programming
Monte Carlo methods used in Reinforcement Learning do not need full knowl-
edge of an environment but only needs experience. It can even learn from
simulated experience by sampling the environment. By doing so it only needs
to generate sample transitions. Monte Carlo methods are based on averaging
sample returns. This means that averages can only be calculated when the
episodes are completed, assuming the states are finite and episodic. Monte
Carlo methods can also be used to mimic policy iteration. The first phase is
Policy Evaluation, where given a policy π, the goal is to compute the Qπ(s, a)
or an approximation for all pairs. These pairs can be estimated by averaging
the sampled returns. When running long enough Q will approximate Qπ.
45
The next phase is Policy Improvement where a greedy policy is calculated
with respect to Q. The greedy policy will return an action a, given a state s
and a new policy that maximizes the state-action values. Monte Carlo meth-
ods are more complicated when used in non-episodic tasks because averaging
is only done after the episode is finished. When data has high variance, con-
vergence will be slower because more samples are needed. This means that
Monte Carlo is an unbiased method, while on the other hand Bootstrapping,
which is a method from Dynamic Programming, is a biased learner because
bootstrapping updates after one single step. These updates are calculated
on estimations. This will converge in finite and discrete cases to their true
values.
Rt = rt+1 + γrt+2 + γ2rt+3 + ..+ γT−t−1rT
vs.
Rt = rt+1 + γV (St+1)
This equation shows the difference between a Monte Carlo method, the first
equation, because it needs all rewards over an episode and the bootstrapping
method, second equation, only calculates the estimate of an estimate.
4.8 Temporal Difference
Temporal Difference, TD, learning is a mix between Dynamic Programming
and Monte Carlo methods. They learn from experience, by sampling by some
policy π, without any knowledge of the environment and updates are learnt
from other estimates. TD methods only need the next time step to update,
while Monte Carlo methods need the whole episode before updating.
V (St)← V (St) + α[Gt − V (St)]
vs.
V (St)← V (St) + α[Rt+1 + γV (St+1)− V (St)]
It can be seen that the first method shown is a Monte Carlo method because
it must wait until it has the Gt value, which is only gatherable after a whole
episode. The second method is a TD method because it can update after the
next time step. It uses the bootstrapping technique which is an estimation
of the Q values by only using estimates for the next state. This is a useful
feature which lower the computational load.
46
Before going into algorithms, there is a need to make a distinction between
different policies; on-policy and off-policy. On-policy is when an agent im-
proves the policy it is currently following to get a result. While off-policy is
learning the value of a policy, independently of the actions of the agent.
4.8.1 Q-Learning
An example of off-policy learning is Q-learning (C. J. C. H. Watkins, 1989).
The Q-values will approximate the optimal action value function independent
of the policy it is following, which makes it off-policy. Q-learning will converge
as long as states are visited and updated. (C. J. Watkins & Dayan, 1992).
Algorithm 1 Q-Learning
1: Initialize all Q(s, a) for s ∈ S, a ∈ A2: Repeat (for every episode):
3: Initialize s
4: Repeat (for each step of episode):
5: Choose a from s using policy derived from Q (e.g., ε-greedy)
6: Take action a, observe r, s′
7: Q(s, a)← Q(s, a) + α[r + γmaxaQ(s′, a)−Q(s, a)]
8: s← s′
9: until s is terminal
The algorithm goes as follows, first all Q(s, a) states are initialized arbitrarily.
For every episode the agent will choose a start location, s. For every step
of that episode the agent will choose an action a from a policy like ε-greedy.
The agent will take the action and receives a new state s′ and a reward
for going from state s to s′. Then the Q-values are updated by using the
following rule; use the old Q-value from were the agent started. Then the
agent calculates the reward the agent got plus the maximum of the Q-value
of the next state, which will be the estimate of the future reward multiplied
by a discount factor. This will be subtracted by the old value, all of this will
be multiplied by a learning rate which is then added to the old reward. The
agent will now go to the observed new state s′ and the iteration is restarted
with the new state.
47
4.8.2 SARSA
SARSA, previously named modified Q-learning (Rummery & Niranjan, 1994)
and renamed to SARSA by (Sutton, 1996), is an on-policy method. The
name stands for State Action Reward State Action and comes from the
agent which is in state s1, chooses action a1 and receives reward r. The
agent will then go in state s2 after taking action a1 and chooses its next
action which will be action a2.
Algorithm 2 SARSA
1: Initialize all Q(s, a) for s ∈ S, a ∈ A2: Repeat (for every episode):
3: Initialize s
4: Choose a from s using policy derived from Q (e.g., ε-greedy)
5: Repeat (for each step of episode):
6: Take action a, observe r, s′
7: Choose a′ from s′ using policy derived from Q (e.g., ε-greedy)
8: Q(s, a)← Q(s, a) + α[r + γQ(s′, a′)−Q(s, a)]
9: s← s′ a← a′
10: until s is terminal
The SARSA algorithms starts equivalently the same as Q-learning, where
all Q values are initialized arbitrarily. Then for every episode a state s is
chosen and immediately the following action a is derived from a policy like
ε-greedy. Then the agent will go into a loop until the state s is terminal. The
agent will take the action a and observes the reward r and the new state s′.
From this new state it will choose a new action a′ derived from a policy like
ε-greedy. The Q-values are then updated by calculating the reward it got
plus the new state action values multiplied by a discount factor. This will
then be subtracted by the old state-action values. This result will then be
multiplied by a learning rate and then added to the old state-action value.
The agent will now go to the new state s′ and and the new action a′.
4.9 Eligibility traces
TD methods use the current reward together with the estimated value, Monte
Carlo methods uses the exact reward but only after the episode is finished.
There is also a method in between where the numbers of steps (or backups)
48
are chosen, n-step method (C. J. C. H. Watkins, 1989) , before using the
estimated value.
G(1)t = Rt+1 + γV (St+1)
G(2)t = Rt+1 + γRt+2 + γ2V (St+2)
G(3)t = Rt+1 + γRt+2 + γ2Rt+3 + γ3V (St+3)
G(n)t = Rt+1 + γRt+2 + γ2Rt+3 + ..+ γnV (St+n)
The first equation, is simply bootstrapping. The second equation is called
the 2-step method , the third is called the 3-step method and so on. It can
be seen that whenever n is equal to the number of steps in an episode, it is
no longer the estimation but the actual reward which means it is the Monte
Carlo method.
With eligibility traces each state receives an extra variable, e, called the
eligibility trace. When the agent comes in a state s the eligibility trace of
that variable will be incremented, all other states will be decayed.
et(s) =
{γλet−1(s) if s 6= st
γλet−1(s) + 1 if s = st
Figure 28 shows what happens to a state. Every time a state is visited the
eligibility trace will be incremented, when the agents does not visit the state,
the state wil automatically decay. This is done by the decay parameter,
denoted as λ. By doing this, it can be seen that when learning happens
some states will be more affected than other states because of the frequented
visited states. When λ = 0, it can be seen that bootstrapping will happen,
because only the current trace is the important one and all other traces will
be zero. When the trace is set to λ = 1, it will mimic the Monte Carlo
methods.
Figure 28: Eligibility trace; image from (Sutton & Barto, 1998)
49
Eligibility traces can thus also be applied on SARSA, which is called SARSA(λ).
The idea from the original SARSA remains the same and is still on-policy.
Only now state action values are calculated with their eligibility trace and
the use of a TD error which is;
δ = rt+1 + γV (St+1)− V (st)
But can also be calculated for q(s, a) values in stead of V (s).
Algorithm 3 SARSA(λ)
1: Initialize all Q(s, a) for s ∈ S, a ∈ A and e(s, a) = 0
2: Repeat (for every episode):
3: Initialize s, a
4: Repeat (for each step of episode):
5: Take action a, observe r, s′
6: Choose a′ from s′ using policy derived from Q (e.g., ε-greedy)
7: δ ← r + γQ(s′, a′)−Q(s, a)
8: e(s, a)← e(s, a) + 1
9: For all s, a:
10: Q(s, a)← Q(s, a) + αδe(s, a)
11: e(s, a)← γλe(s, a)
12: s← s′ a← a′
13: until s is terminal
The same can be applied on Q-learning, Q(λ). But with the single adaptation
that whenever Q-learning is following the greedy action selection, the expe-
rience can be followed but not when the random action or the non-greedy
action is selected. When a non-greedy action is selected will the eligibility
traces be reset to zero.
50
Algorithm 4 Q-Learning(λ)
1: Initialize all Q(s, a) for s ∈ S, a ∈ A and e(s, a) = 0
2: Repeat (for every episode):
3: Initialize s, a
4: Repeat (for each step of episode):
5: Take action a, observe r, s′
6: Choose a′ from s′ using policy derived from Q (e.g., ε-greedy)
7: a∗ ← argmaxbQ(s′, b) (if a′ ties for the max, then a∗ ← a′)
8: δ ← r + γQ(s′, a∗)−Q(s, a)
9: e(s, a)← e(s, a) + 1
10: For all s, a:
11: Q(s, a)← Q(s, a) + αδe(s, a)
12: If a′ = a∗13: then e(s, a)← γλe(s, a)
14: else e(s, a)← 0
15: s← s′; a← a′
16: until s is terminal
Sometimes better performance can be gathered by using replacing traces in
stead of the standard traces where;
et(s) =
{γλet−1(s) if s 6= st
1 if s = st
Figure 29: Replacing traces; image from (Sutton & Barto, 1998)
4.10 Function approximation
Previously it was assumed that all Q-values would have a table. In this
table each Q(s, a) pair would have some value. This is a feasible method
when having states and actions on a small scale. If there are millions of
state-action pairs this would require a lot of memory but also time and data
51
to accurately compute them. Think for example the difference between the
state-space of backgammon, 1020, and the state-space for a robotic helicopter.
The robotic helicopter cannot map the whole world in his table and has thus
a continuous state-space. The solution to this problem will be to generalize
by gathering previously visited states and generalize them over the complete
set of states even if they are not yet visited. This generalization is also called
function approximation, where it take samples from the value function and
tries to generalize them and by doing so constructing an approximation of
the function. From now on these functions will be generalized and will be
parametrized by a vector w ∈ R;
v(s,w) ≈ vπ(s)
q(s, a,w) ≈ qπ(s, a)
This new function v(s,w) can be computed by a linear combination, a neural
network where the w will be the weights or a decision tree where w will be
split points and leaves.
Learning these function approximation can be done by gradient descent.
Where w = (w1, w2, ..., wn)T and v(s,w) can be differentiated denoted as
J(w). Each time step the agent observes a selected state St and its true
value under the policy vπ(St). With those values the gradient can be calcu-
lated by trying to minimizing the error as much as possible and going to a
local minima. This is done by updating the weights where the error will be
the lowest;
wt+1 = wt −1
2α∇wt
[vπ(St)− v(St,wt)
]2= wt + α
[vπ(St)− v(St,wt)
]∇wt v(St,wt)
where α is the step size and ∇wtJ(wt) is the partial derivative defined as;
∇wtJ(wt) =(∂J(wt)
∂wt,1
,∂J(wt)
∂wt,2
, ...,∂J(wt)
∂wt,n
,)T
The goal will be to find a local minima by updating the weights where the
error will be the lowest and by doing so finding a local minimum. Value
function can thus also be represented by a linear combination of v and w.
52
This can be written as;
v(s,w) = wTx(s) =n∑i=1
wixi(s)
Where each state has a vector of features x(s) = (x1(s), x2(s), .., xn(s))T with
the same amount of weights. The gradient descent with respect to w will
then be;
∇wv(s,w) = x(s)
These features can be constructed by using different methods. One example
of such a method is Coarse Coding. Where the state is a continuous space,
which in this example will be a two dimensional space (Figure 30). The
feature vector can in this example be if the state is in the circle or not. The
feature will be zero if the state is absent and 1 if the feature is present in a
certain circle. These features can overlap because the state can be in multiple
circles at once. Gradient-descent will update the weights of all the circles the
agent is in. The approximate value function will affect every point that is
between the union of the intersected circles with a greater affect if they have
more point in common.
Figure 30: Coarse coding; image from (Sutton & Barto, 1998)
53
Chapter 5Experiments and results
5.1 ALE
The environment that this thesis is based on is the Arcade Learning Environ-
ment (M. G. Bellemare, Naddaf, Veness, & Bowling, 2013), or abbreviated
ALE. It allows anyone to write AI-agents that can interact with Atari 2600
games. ALE is written on top of Stella 1 which is an open-source Atari emu-
lator. ALE enables interactions with the Stella emulator which permits the
user to gather all sorts of data like RAM and frame states parallel while the
game is playing and can even send data, like action moves, to the game.
The Atari 2600 console was invented in 1977. The hardware of the con-
sole is rather simple, compared to consoles today, it has a CPU of 1.19 Mhz
and has a RAM of 128 bytes. Games only had a screen of 160 pixels wide
and 210 pixel high with a maximum of 128 colors. The screen has thus 33600
pixels in total. The ALE system allows an agent to observe the current game
screen and/or the RAM state of the console. The advantage of frames is
that they are human interpretable (Figure 31b). But unfortunately, frames
provide an agent with only partial information as a single frame does not
provide information about the movement of objects. The RAM is not hu-
manly interpretable, but has more information and even holds the complete
state of the game (Figure 31a). The console has a joystick with 18 different
possible moves, but not all of them are used when playing a game. Because
the console is -hardware wise- not powerful it can easily be emulated. This
makes an excellent testbed for AI-agents because the possibilities with the
1http://stella.sourceforge.net
54
frames, RAM and the limited possible actions. This on the contrary to cur-
rent games which have millions of pixels and multiple gigabytes of RAM
states. This does not mean that Atari 2600 games can easily be learnt, take
for example a game where only 4 possible actions are valid, this means that
when the game is running at 60 frames per second, only looking one second
ahead means searching through 460 different simulations that can be done.
0 5 10 15 20 25 30
0
5
10
15
20
25
30
(a) RAM
0 20 40 60 80 100 120 140
0
50
100
150
200
(b) Frames
Figure 31: The difference between RAM and Frames
5.2 Space Invaders
The game of Space Invaders (Figure 32) was chosen as a test bed for com-
bining Reinforcement Learning together with autoencoders. Space Invaders
is one of the most used games as a test bed for RL-agents (Mnih et al., 2013;
M. G. Bellemare et al., 2013). It is known that Reinforcement Learning
agents can beat a human level player (Mnih et al., 2015).
Space Invaders was first released in 1978 by Tomohiro Nishikado, since then
many different adoptions exist. The player controls a space ship and can
fire missiles. The goal of the game is to hit all layers of aliens and go to
the next level. The player can hide behind walls to shield himself from the
lasers coming from the aliens. The player can only move left, right, shoot
and do nothing. When a player misses his shot, he must wait until the laser
is off the screen so he can fire his next missile. Once all rows of aliens are
cleared the game goes to the next level, where the aliens will move more
quickly. The Command Alien Ship will randomly come and when shot will
yield more points than the basic alien ships. When the aliens come too close
to the shields, the shields will disappear and when the aliens eventually come
55
too close to the players ship, the game ends and will restart. The player has
a total of 3 lives before the game starts from scratch. The players receives
only a reward when hitting an alien spaceship.
0 20 40 60 80 100 120 140
0
50
100
150
200
Figure 32: Space Invaders screen
5.3 Reconstruction
When using autoencoders for extracting features and dimensionality reduc-
tion, it is essential that they are trained properly and that the autoencoders
in question can reconstruct from their different hidden layers. Using the
Mean Square Error we can see how far off the prediction of an autoencoder
is from the input values.
MSE(~x, ~y) =1
n(~x− ~y)2
Where the ~x is the input, the original RAM state, and ~y is the reconstruc-
tion of the autoencoder of ~x. The input values are gathered by running an
agent with SARSA(λ) and saving all possible RAM states. The agents plays
a total of 3000 episodes, each episode consist of an undetermined amount of
steps. These steps are only known when the agent has died three times in
the game. Each step the agent receives a RAM state which is then saved.
The dots shows when an autoencoder is trained from an input of 128 bytes
RAM state. Autencoders can be trained in two ways, the direct and in-
direct way. The direct way is to go from the start dimension of 128 to a
56
specified number of hidden nodes and back to 128 output nodes. The di-
rect autoencoder has thus only 1 hidden layer. Figure 33 shows going from
128 → Number of nodes → 128, where each arrow denotes the interconnec-
tion between two layers. It was decided that when training autoencoders on
different amount of hidden nodes, the number of hidden nodes will always
be divided by 2. As can be seen the lower the amount of hidden nodes, with
lowest going from 128→ 1→ 128 and highest 128→ 128→ 128, the higher
the MSE will be. This is only a logical conclusion, 1 hidden node cannot
perform as well as 128 hidden nodes. There is too much information lost in
going from a high number of dimension to a too low number of dimensions in
contrast with a high number of hidden nodes. Although it can be argued that
an error of 0.086 for 1 hidden node in 1 hidden layer is not that high. One
way to counteract the loss of going from one big dimension to an immediate
lower dimension is adding multiple layers. The red dots shows us the MSE
when going to the a lower dimension with intermediate layers, for example
in the case of going from 128 bytes to 1 node will be:
128→ 64→ 32→ 16→ 8→ 4→ 2→ 1→ 2→ 4→ 8→ 16→ 32→ 64→ 128
This also means that the trainingtime of the autencoder with multiple hidden
layers will be higher than a direct autoencoder. But it can be seen that when
using multiple hidden layers the autoencoder in question can achieve a lower
MSE than a direct autencoder. Note that the indirect autoencoder from
128→ 64→ 128 is omitted since it does not use multiple layers.
57
1 2 4 8 16 32 64 128
Layer size
0.00
0.02
0.04
0.06
0.08
0.10
MSE
Trained autoencoder from 128 to another layer
Directly
Indirectly
Figure 33: Mean Square Error of a trained autoencoder from an input layer
with 128 bits to a smaller layer directly and indirectly
It is also a good idea when experimenting with RAM states and autoencoders
to also train autoencoders not only in their byte form but also in their bit
form, thus instead of using 128 bytes training the autoencoder with input
of 1024 bits, Figure 34. The blue dots shows us then going from 1024 →Number of nodes→ 1024 and the red dots shows us the MSE with multiple
hidden layers. The same conclusion can be drawn in here as in the case with
autoencoders with 128 bytes. The deeper an autoencoder goes the more
information is lost. This can be made up for by using multiple layers. To
compare the settings with 128 bytes and 1024 bits as input layer, it can be
seen that 128 bytes performs better in reconstructing the input and thus
going to a lower dimension and then back to the same dimension.
58
8 16 32 64 128 256 512 1024
Layer size
0.00
0.02
0.04
0.06
0.08
0.10
MSE
Trained autoencoder from 1024 to another layer
Directly
Indirectly
Figure 34: Mean Square Error of a trained autoencoder from a input layer
of 1024 bits to a smaller layer directly and indirectly
5.4 Flow of experiments
All experiments will follow the same phases but with different settings. The
first phase is the preparation phase where the manual features SARSA(λ) is
run for 3000 episodes and where all RAM states are captured. The second
phase is preprocessing phase where the autoencoder is trained. The settings
of the autoencoder must be specified, number of layers, hidden nodes, and so
on. The n epochs are set on 15, this is how many times all trainings exam-
ples are put through the autoencoder. One epoch is thus one trainingcycle.
Further is the batch size, the number of trainings examples put through
before updating the weights, set on the same number as input dimensions.
So if the input dimension is 1024, from 1024 bits RAM, then the batch size
will be set on 1024. Additional can the loss function and activation function
be specified. After the autoencoder is trained, the last phase starts. The
agents receives a RAM state. This RAM state will go through the trained
autoencoder. Depending on the criteria a specified layer will be exacted and
59
used as the features. The agent will use these feature and learn with them.
5.5 Manual features and basic RAM
In the paper of (Naddaf, 2010; M. G. Bellemare et al., 2013), they perform
manual feature extraction by concatenating the original RAM state with the
pairwise logical AND of every possible pair. Figure 35 shows the difference
between the two combinations, it also shows a random performance where
the agent chooses a random action no matter which feature are presented.
The x-axis denotes the amount of episodes played an the y-axis presents
the rewards ALE returns when choosing actions. As can be seen the RAM
states with the pairwise AND will perform better than the basic RAM states.
These pairwise AND feature construction is manually done, the designer of
the algorithm must implement the pairwise algorithm and before he can
decide that the pairwise AND performs better than the basic RAM states
many experiments have passed. This is the aim of this thesis to skip the
test of finding good features and let the autoencoders handle the feature
extraction. From now on, the RAM concatenated with the pairwise AND
will be seen as the manual features and the standalone RAM will be seen as
basic RAM.
60
0 500 1000 1500 2000 2500 3000
Episodes
0
50
100
150
200
250
300
350
400
450
Rew
ard
s
Difference between RAM & RAM + AND
RAM + pairwise AND
Random
RAM
Figure 35: The difference between RAM combined with the pairwise logical
AND and RAM alone
5.6 Difference between bits and bytes
When working with RAM states we can choose how to represent the RAM
state, as bytes or bits. Note that bytes are normalized by dividing them by
255, so that their range is between [0,1]. By normalizing the input values,
the converage will be usually faster than when using not-normalized data
(Y. A. LeCun et al., 2012).
Figure 36 shows when the input values are the normalized bytes with a hidden
layer of 128 nodes. By doing this, we will simulate the identity function with
the same amount of input values. As can be seen it cannot translate the input
values of the RAM bytes well to a good feature vector. There can be a wide
range of possible problems why the bytes are not a good feature extraction.
For example the batch size was too low or too high, perhaps a Denoising
Autoencoder could have helped or even different activation functions glued
together with multiple layers of the same amount of hidden nodes. Of course
if we put enough time and effort in tuning all different hyperparameters we
61
would eventually get a better result. This is not the goal of this thesis, we
want to find an autoencoder as simple as possible without tweaking too much
and finding a good feature vector. Another explanation possible is that the
agent simply does not have enough information available in the extracted
feature vector and that valuable information that was previously available
in the basic RAM has been lost. The agent still learns better than playing
random, but is not as good as the manual features.
0 500 1000 1500 2000 2500 3000
Episodes
50
100
150
200
250
300
350
400
Rew
ard
s
Autoencoders trained from 128 -> 128
128->128 Lin
128->128 Sig
128->128 Rel
Manual features
Random
Figure 36: Autoencoders on 128 bytes
In Figure 37 we see the results of an autoencoder with as input value the RAM
state represented in bits. The same autoencoder was used as with bytes, with
the exact same settings. As can be seen the agent could use all the extra
information available, in contrast with the 128 byte autoencoder, and could
actually learn from the extracted feature vector. With this confirmation the
rest of this thesis will investigate the bit version of RAM states.
62
0 500 1000 1500 2000 2500 3000
Episodes
0
50
100
150
200
250
300
350
400
Rew
ard
s
Autoencoders trained from 1024 -> 1024
1024->1024 Lin
1024->1024 Rel
1024->1024 Sig
Manual features
Random
Figure 37: Autoencoders on 1024 bytes
5.7 Comparing different activation functions
As said previously choosing the right activation function can help in creating
better results. Table 6 depicts autoencoders which uses different activation
functions. For a more visual representation, see Appendix A, Figures 47,
48 and 49. It shows averages the last 1000 rewards of episodes with their
standard deviation. Note that when an activation function is set, all layers
use the same activation. There is also a possibility to use different activation
function in different layers, but this was not investigated. Each activation
function has been tested with an autoencoder going from the input 1024 to
a chosen bottleneck and back to the original inputsize. Note that each layer
is each time divided by two. So using an autoencoder which is depicted as
1024 → 256, uses three hidden layers, encoding from 1024 → 512 to the
bottleneck of 256 and back encoding to 512→ 1024. As can be seen a linear
activation function performs best with encoding the original state 1024 to
an encoded version of 512. Going deeper with linear activation function will
yield, in this case, NaN values. Because linear activation functions have no
63
limit and will only keep rising. This in contrast with the Sigmoid function
which is bound between [0, 1] and ReLU which forces neuron to be approxi-
mately 50 % active. Note that a linear activation function is nearly equivalent
with using the method PCA, Principal Component Analysis. PCA is a lin-
ear technique that can be used for dimensionality reduction and by doing so
finding the principal components. They show directions where the data is
most spread out and has the biggest variance. Linear autoencoders can only
return a linear encoding because the activation is also linear, therefore we
will pursuit to research more in non-linear activation functions.
As can be seen the ReLU activation does not perform too well in contrast to
the other activation functions. Sigmoid performs well when using a hidden
layer with 1024 nodes, the same as with the linear activation. To statistically
confirm this we used the MannWhitney U test, which assumes the data is
not normal distributed. The first test was between the Manual features and
the Basic features and results in a p-value of 9.63357008643e-07. We can
assume that when the p-value is smaller than 0.05 that there is a difference
between the Manual features and the Basic with 95% certainty. Which is
exactly what can be seen on Figure 35.
Linear Sigmoid ReLU
1024 → 1024 323.43 (± 47.11) 325.01 (± 43.96) 288.85 (±44.41)
1024 → 512 323.83 (± 45.44) 290.53 (± 38.64) 230.74 (± 39.11)
1024 → 256 NA 250.09 (± 35.35) 267.08 (± 43.84)
1024 → 128 NA 250.9 (± 41.42) 191.86 (± 30.04)
1024 → 64 NA 152.75 (± 23.83) 116.1 (± 25.64)
Manual features 330.87 (± 35.26)
Basic 301.92 (± 36.39)
Table 6: Comparing different activation functions against the number of
hidden layers and nodes
To statically prove that there is a difference with the manual features and
the encoded feature extraction we will test the manual features against the
different activation function from 1024→ 1024, 1024→ 512 and 1024→ 256,
64
Linear Sigmoid ReLU
1024 → 1024 0.0322843539108 0.00798380208929 1.6703515625e-06
1024 → 512 0.293818666313 6.30184822139e-08 1.6703515625e-06
1024 → 256 NA 5.73303143758e-07 2.99746184625e-06
Table 7: P-values of the MannWhitney U test
Table 7. As can be seen almost all p-values are lower than 0.05 which means
we can assume with 95% certainty that they differ from the manual features.
This does not mean that they are better or worse features. Except we can-
not assume they are different with the autoencoder with a linear activation
function with 1024→ 512.
5.8 Initializing Q-values
When designing SARSA it is of most important to set the right and optimal
Q-values. Initializing the Q-values will influence the speed of learning and
the efficiency of the algorithm (Koenig & Simmons, 1996). When the agent
is put in a setting, for example the grid world, the agent needs to find the
goal before even searching for a good policy. One way to do this is by letting
the agent explore the whole world, when the agent is exploring he will adapt
Q-values and put them in a way that he will remember of going in a state
with a certain action is a good action or not. If we have some knowledge we
can even adapt the Q-values via some rule. For example if we know the goal
of the setting, it would be easier to set the Q-value on a higher or lower value
to reduce the exploring. For example;
Q(s, a) =
{0 if s ∈ G, a ∈ Aq if s ∈ S\G, a ∈ A
where the Q(s, a) will be set on zero when the state is also a goal state and
otherwise will be set on some value q when the state is not a goal state. This
forces the agent to learn with the given Q-values, which he will learn in a
more optimistic way, by doing this the learning time and exploration will be
less than when initializing everything on the same number.
65
Unfortunately Space Invaders is a never-ending game, so setting a differ-
ent value on the goal state cannot be done. Even if it was known we cannot
set the goal state differently than other states because the features are black-
box and do not mean anything to a human. We can adapt all Q-values to
some other number and see how this will evolve and if the agent can learn
more optimistically. All previous graphs and tables are Q-values which are
initialized on zero. This experiments were run with sigmoid, so we know our
values will be between [0, 1]. Taking an average of the whole Q-values on
the last 500 episodes of our best autoencoders gives us an averaged value of
±0.57. So initializing Q-values on −1 and 1 would affect the learning rate.
Figure 38 shows when the Q-values are initialized on Q(s, a) = −1 and Fig-
ure 39 shows when initialized on Q(s, a) = 1. We can immediately see the
difference in how quick the agent is learning. Take for example on Figure 38
and Figure 39 the autoencoder trained from 1024→ 1024, thus learning the
identity function. As can be seen that when the Q-values are −1 the agent
will learn incredibly slow, it is even so slow that only after 3000 episodes the
agent reaches the same value as randomly playing. While on episode 500 the
agent, where Q-values are initialized on 1, will already have 4 times more
reward than he has where the Q-values are initialized on −1. As can be seen
generally speaking the values will tend to the same result as Q = 0 as long
as the experiments run long enough.
0 500 1000 1500 2000 2500 3000
Episodes
0
50
100
150
200
250
300
350
400
450
Rew
ard
s
Sigmoid activation with Q-values=-1
1024->64
1024->128
1024->256
1024->512
1024->1024
Manual features
Random
Figure 38: Q = −1
66
0 500 1000 1500 2000 2500 3000
Episodes
0
50
100
150
200
250
300
350
400
450R
ew
ard
s
Sigmoid activation with Q-values=1
1024->64
1024->128
1024->256
1024->512
1024->1024
Manual features
Random
Figure 39: Q = 1
Table 8 shows the average of rewards that was received by using different
autoencoders. This average was taken on the 500 last episodes. As can be
seen the Q = −1 does not perform any good, it takes too much time to learn.
But there is a competition between the Q = 0 and Q = 1. Autoencoders
trained to 1024 and 512 perform better when the Q-values are initialized
on 1. But when trained deeper with multiple hidden layers tend to learn
better with the initialization on Q = 0. Since we are experimenting how
deep we can go with deep learning before losing to many information of our
unsupervised feature extraction method we will continue from now on using
the values initialized on Q = 0.
Q = −1 Q = 0 Q = 1
1024 → 1024 64.78 (± 17.71) 325.01 (± 43.96) 267.42 (± 37.49)
1024 → 512 216.06 (± 31.57) 290.53 (± 38.64) 296.39 (± 40.5)
1024 → 256 152.03 (± 24.25) 250.09 (± 35.35) 253.82 (± 39.31)
1024 → 128 239.04 (± 35.52) 250.9 (± 41.42) 238.25 (± 37.14)
1024 → 64 158.91 (± 26.29) 152.75 (± 23.83) 150.84 (± 29.8)
Table 8: The difference between in setting different Q-values
67
5.9 Pretraining and extracting other layers
In previous experiments only the bottleneck was used as the extracted fea-
ture method. But since we are experimenting with deep learning and thus
using different layers it could also be useful to go to a very small bottleneck
and extracting a different layer than first intended. This was also used in
previous research (Stadie, Levine, & Abbeel, 2015), where they did not take
the bottleneck layer. Figure 40 shows a visual way, where the third hidden
layer with the red box is extracted instead of the intended bottleneck.
Figure 40: Example of an autoencoder with another layer extracted than the
bottleneck
When going into deep learning it also a good idea to pretrain the network.
Pretaining is when each layer is trained separately and then concatenated
together. For example if we want to have a pretrained autoencoder from
1024→ 256, we will first train another autoencoder from 1024→ 512. Then
all the weights are saved together with all the encoded form of the input
layer, so now our input layer will be 512. The next step will be creating an
autoencoder from 512 → 256, this will be trained with are new, encoded,
input features. Afterwards a whole new autoencoder is created with the
weights that are saved for each layer. The autoencoder can then be fine-tuned
68
by training again on the whole layer. Note that this is very time-consuming
because multiple autoencoder are trained.
1024
-> 2
56: 5
12
1024
-> 1
28: 5
12
1024
-> 6
4: 5
12
1024
-> 3
2: 5
12
1024
-> 1
6: 5
12
1024
-> 8
: 512
1024
-> 4
: 512
Man
ual f
eatu
res
Basic
Rando
m50
100
150
200
250
300
350
400
Rew
ard
s
Training deep with pretraining and extracting layer 512
Manual features
Random
Basic
Figure 41: Pretraining with extraction of layer 512
Figure 41 shows when autencoders are trained with pretraining to a very
small layer and each time the layer 512 is extracted. Boxplots are shown for
the last 1000 episodes together with boxplots of the Basic, Manual features
and Random with their average line to get a good comparison. As can be
seen the deeper the autoencoder, which goes to layers of 32, 16, 8 and 4,
the more information is lost. This results in rewards which are not good
compared to the results of Manual features and Basic. But pretraining has
helped in training the autoencoder of 1024 → 64. It shows that it perfor-
mance is better than the Basic but still underperforms in comparison to the
Manual features. See in Appendix A Figure 50 for the detailed plot.
Figure 42 shows the result of training to a layer with 4 nodes. This means
that there are a total of 8 possible layers that can be extracted. When train-
ing to a layer with 4 features and extracting those 4 features will not yield a
good score. There is too much information lost from going to 1024 possible
69
features to only 4. But when the same autoencoder is extracting a layer
that has a higher number of hidden nodes than these 4, it will yield more
information and a higher result. The reason that 512 nodes does not yield
a bigger score than just training to one layer of 512 nodes is because of the
training error. As previously mentioned the deeper a network is trained the
more information is lost (Section 5.3).
1024
-> 4
: 512
1024
-> 4
: 256
1024
-> 4
: 128
1024
-> 4
: 64
1024
-> 4
: 32
1024
-> 4
: 16
1024
-> 4
: 8
1024
-> 4
: 4
Man
ual f
eatu
res
Basic
Rando
m50
100
150
200
250
300
350
400
Rew
ard
s
Training deep with to a layer with 4 nodes
Manual features
Random
Basic
Figure 42: Pretraining with extraction to a hidden layer of 4 nodes
A more detailed table of all the autoencoders with all their possible layers
extracted is depicted in Appendix A Table 9 with their result and standard
deviation of the 1000 last episodes.
As suggested by (Srivastava et al., 2014) adding dropout to a deep net-
work can prevent the network from overfitting. Remember that when the
network is trained on samples it will try to create a network that can fit
the data perfectly. But when the network can mimic the training samples
almost perfectly but cannot mimic the test samples, or new samples from
our agent, it is overfitting. By adding dropout, and thus randomly drop-
ping nodes and their connections, the network will try to learn the samples
70
via different nodes and connections. Figure 43 show what happens to the
performance when adding dropout. When using fewer hidden layers which
leads to also fewer hidden nodes it can be seen that the rewards gained from
the agent will be worse than before. But training with autoencoders with
1024 → 32, 16, 8 it can be seen that they perform better than before. The
network is probably overfitting and trying to recreate all training samples
exactly, by using a dropout of 30% this can be avoided. Although the box-
plots show that the autoencoder 1024 → 256 : 512 has a lower reward than
the autoencoder with 1024 → 256 : 512 after 3000 episodes but Figure 44
shows that the learning curve, the black line, is not converging and is still
increasing. This does mean that adding dropout means that learning will be
slower as well as for the autoencoder as for the agent.
1024
-> 2
56: 5
12
1024
-> 1
28: 5
12
1024
-> 6
4: 5
12
1024
-> 3
2: 5
12
1024
-> 1
6: 5
12
1024
-> 8
: 512
1024
-> 4
: 512
Man
ual f
eatu
res
Basic
Rando
m50
100
150
200
250
300
350
400
Rew
ard
s
Training deep with pretraining and extracting layer 512: dropout
Manual features
Random
Basic
Figure 43: Pretraining with extraction of layer 512 with dropout
71
0 500 1000 1500 2000 2500 3000
Episodes
0
100
200
300
400
500R
ew
ard
sTraining deep with pretraining and extracting layer 512 with dropout
1024 -> 4: 512
1024 -> 8: 512
1024 -> 16: 512
1024 -> 32: 512
1024 -> 64: 512
1024 -> 128: 512
1024 -> 256: 512
Manual features
Random
Figure 44: Pretraining with extraction of layer 512 with dropout
5.10 Combination of RAM and layer
Combining layers of RAM and the encoded version of RAM could give us
information of how much the encoded version of the RAM is contributing.
Figure 45 shows us the results, for a more detailed plot see Appendix A
Figure 51. Adding the RAM state will give a boost to a poorer feature ex-
traction. Note that it is important that RAM state is between [0, 1] because
the activation function sigmoid limits also the values between [0, 1]. Nonethe-
less with a weaker feature extraction the original RAM state will take over
and will be used over the extracted features from the autoencoder. Figure
45 also shows the difference between the boxplot of 1024→ 512 +RAM and
1024→ 512. It can be seen that the features from the autoencoder and RAM
perform a little better than an agent which uses only the feautres from the
autoencoder. This means that the autoencoder does not have captured all
valuable information that was in the RAM, if it would have the performance
would have been the same. Although it can be argued that the difference is
minimal so it has captured most parts of the valuable information.
72
1024 -> 512 +
RAM
1024 -> 256 +
RAM
1024 -> 128 +
RAM
1024 -> 64 +
RAM
Manual featu
res
Basic
1024 -> 512
Random50
100
150
200
250
300
350
400
Rew
ard
s
Combining RAM and encoded RAM
Manual features
Random
Basic
Figure 45: Combining the original layer with the encoded version
5.11 Visualizing high dimensional data
It is also possible to visualize our high dimension data by using a technique
called t-tsne, t-Distributed Stochastic Neighbor Embedding (Van der Maaten
& Hinton, 2008). This mapping will map the high dimensions onto a two
dimensional space, this is done by searching for states that are very similar.
Both of our axis will go from our best autoencoder, 1024 → 512, with the
sigmoid function and save all the encoded states. This will then be mapped
to a two dimensional space by using the t-tsne technique. This will result in
a scatter-plot. All points will then get a color by using the following;
colors = max(φ · θ)
Where φ will be the encoding of the RAM state and θ the state-action.
Since our φ will be of dimension (samples × nodes), where nodes is an
array of values from our autoencoder encoding and θ will be of dimension
(nodes × action), where action will be the possible actions that the agent can
73
take. We can then take the dot-product, this gives us an array of dimension
(sample × action) afterwards we will take the maximum value of the results,
which gives us a one-dimensional array. This array gives the maximum Q-
value for an input state. Figure 46 shows the result of the last 10.000 RAM
states, encoding and state-action values. As can be seen there are clusters
with the same colors like red, some blue-ish and even some green. This means
that there are states from the RAM state that are comparable and are closely
matched with states coming from the autoencoder. This is an indication that
the features that we use to learn values are in fact relevant features for the
task, despite that the values are not being used to learn features.
150 100 50 0 50 100 150150
100
50
0
50
100
150
Figure 46: t-tsne
74
Chapter 6Conclusions
We have developed a method for unsupervised feature extraction that outper-
forms the use of raw input features and almost matches the manual feature
encoding methods. Our method is based on the use of autoencoding neu-
ral networks to learn a compressed representation of the input data. We
have compared multiple possible autoencoders based approaches and com-
pared these empirically. A number of conclusions can be drawn from these
experiments. The non-linear autoencoder is in this case not better than a
linear autoencoder. The linear autoencoder can compete with the Manual
features, but it could have easily been a PCA method which would yield the
same results. It does yield results in researching different activation func-
tions, because as can be seen on graphs they do make a wide difference.
When finetuning autoencoders and reducing to a very small dimension, com-
ing from a big dimension, with many layers it is a good idea to add pretraining
and dropout. These mechanism are needed so that the autoencoder does not
overfit on the training data.
Seeing the visualization of the autoencoder we can indeed see some clusters,
thus the autoencoder does find a representation where the input RAM dimen-
sion is well represented by the encoded states together with the SARSA(λ)
values.
When using autoencoders as a feature extraction method, research in dif-
ferent layers, activation function and even different input methods must be
taken to get a wide range of possibilities in choosing the best autoencoder. It
is proven in this research that when working on a blackbox of data, because
RAM is not humanly interpretable, it is possible to get a better result than
using plain features.
75
6.1 Future work
This thesis is entirely based on RAM states, because RAM states are black-
box it is difficult to see what happens or to interpret what happens. We
know RAM states contains the entire state of a game. It knows where the
agent is, if the laser is fired and in what direction. Unfortunately it is practi-
cally impossible to find these things from the RAM state. This is in contrast
with frames. ALE offers also the possibility to receive frames, these frames
consists of pixels with different color values. In Atari 2600 games each color
is for a specific item, for example green is the players ship, orange the shield.
These are useful features that can be used to learn in a better way. This
can be learnt by removing the background, the static colors like the score,
the khaki base and so on. But when the agent receives pixels, he does not
know what happens. It does not contain the entire state of the game. For
example, see previous Figure 32, the agent receives the frame. But he cannot
determine from a single frame where the laser is going. This laser can come
from the agent itself, from a few time steps back, or even come from the
aliens. To overcome this problem multiple frames can be used in stead of
using one frame, like we did in this thesis only 1 RAM state per time step.
76
Appendices
77
Appendix AExtended graphs and tables
0 500 1000 1500 2000 2500 3000
Episodes
0
50
100
150
200
250
300
350
400
Rew
ard
s
Gamplay with autencoders and linear activation function
1024->512
1024->1024
Manual features
Random
Figure 47: Autoencoders with multiple hidden layers with a Linear activation
function
78
0 500 1000 1500 2000 2500 3000
Episodes
50
100
150
200
250
300
350
400R
ew
ard
sGamplay with autencoders and Sigmoid activation function
1024->16
1024->32
1024->64
1024->128
1024->256
1024->512
1024->1024
Manual features
Random
Figure 48: Autoencoders with multiple hidden layers with a Sigmoid activa-
tion function
0 500 1000 1500 2000 2500 3000
Episodes
50
100
150
200
250
300
350
400
Rew
ard
s
Gamplay with autencoders and ReLU activation function
1024->64
1024->128
1024->256
1024->512
1024->1024
Manual features
Random
Figure 49: Autoencoders with multiple hidden layers with a ReLU activation
function
79
0 500 1000 1500 2000 2500 3000
Episodes
0
100
200
300
400
500R
ew
ard
sTraining deep with pretraining and extracting layer 512
1024 -> 4: 512
1024 -> 8: 512
1024 -> 16: 512
1024 -> 32: 512
1024 -> 64: 512
1024 -> 128: 512
1024 -> 256: 512
Manual features
Random
Figure 50: Pretraining with extraction of layer 512
0 500 1000 1500 2000 2500 3000
Episodes
50
100
150
200
250
300
350
400
Rew
ard
s
Combining the encoded RAM + original RAM
1024->64
1024->128
1024->256
1024->512
Manual features
Random
RAM
Figure 51: Combining the original layer with the encoded version
80
1024→
256
1024→
128
1024→
6410
24→
3210
24→
1610
24→
810
24→
4
Lay
er51
229
1.67
(±33
.03)
297.
2(±
35.0
1)30
6.86
(±38
.53)
253.
74(±
32.0
1)25
1.57
(±30
.44)
249.
99(±
32.6
9)23
6.42
(±29
.71)
Lay
er25
629
4.01
(±37
.12)
289.
75(±
35.6
9)27
0.12
(±34
.52)
258.
19(±
30.8
2)24
0.92
(±33
.65)
247.
75(±
34.4
9)25
1.18
(±31
.93)
Lay
er12
827
2.67
(±34
.38)
232.
15(±
30.4
3)23
3.08
(±30
.86)
242.
41(±
31.8
2)21
0.0
(±32
.45)
238.
56(±
29.5
1)
Lay
er64
228.
96(±
27.8
7)21
9.76
(±26
.28)
249.
08(±
35.2
9)20
4.68
(±28
.71)
215.
38(±
28.7
4)
Lay
er32
240.
66(±
30.1
4)22
3.82
(±28
.25)
213.
47(±
28.3
8)21
2.85
(±27
.6)
Lay
er16
238.
78(±
32.9
7)24
8.57
(±31
.52)
185.
06(±
27.2
1)
Lay
er8
191.
69(±
31.4
7)15
3.39
(±23
.84)
Lay
er4
145.
6(±
29.4
5)
Tab
le9:
Tra
inin
gto
asp
ecifi
cla
yer
and
extr
acti
ng
ach
osen
laye
r
81
Chapter 7Bibliography
Anji. (n.d.). Pole balance. Retrieved April 29, 2016, from http : / / anji .
sourceforge.net/polebalance.htm
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive
elements that can solve difficult learning control problems. Systems,
Man and Cybernetics, IEEE Transactions on, (5), 834–846.
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013, June). The
arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research, 47, 253–279.
Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends R©in Machine Learning, 2 (1), 1–127.
Breiman, L. (1996). Bagging predictors. Machine learning, 24 (2), 123–140.
Campbell, M., Hoane, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial intel-
ligence, 134 (1), 57–83.
Collobert, R. & Weston, J. (2008). A unified architecture for natural language
processing: deep neural networks with multitask learning. In Proceed-
ings of the 25th international conference on machine learning (pp. 160–
167). ACM.
Cruz, J. A. & Wishart, D. S. (2006). Applications of machine learning in
cancer prediction and prognosis. Cancer informatics, 2.
Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of
on-line learning and an application to boosting. Journal of computer
and system sciences, 55 (1), 119–139.
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural net-
works. In International conference on artificial intelligence and statis-
tics (pp. 315–323).
82
Google. (n.d.). Google self-driving car project. Retrieved April 29, 2016, from
https://www.google.com/selfdrivingcar/reports/
Hinton, G. E. [Geoffrey E] & Salakhutdinov, R. R. (2006). Reducing the
dimensionality of data with neural networks. Science, 313 (5786), 504–
507.
Hinton, G. E. [Geoffrey E.] & Salakhutdinov, R. R. (2008). Using deep belief
nets to learn covariance kernels for gaussian processes. In J. C. Platt,
D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural in-
formation processing systems 20 (pp. 1249–1256). Curran Associates,
Inc.
Koenig, S. & Simmons, R. G. (1996). The effect of representation and knowl-
edge on goal-directed exploration with reinforcement-learning algorithms.
Machine Learning, 22 (1-3), 227–250.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification
with deep convolutional neural networks. In F. Pereira, C. J. C. Burges,
L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information
processing systems 25 (pp. 1097–1105). Curran Associates, Inc.
LeCun, Y. A., Bottou, L., Orr, G. B., & Muller, K.-R. (2012). Efficient back-
prop. In Neural networks: tricks of the trade (pp. 9–48). Springer.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553),
436–444.
RL-Library. (n.d.). Mountain car. Retrieved April 29, 2016, from http://
library.rl-community.org/wiki/Mountain Car (Java)
Makhzani, A. & Frey, B. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663.
Michie, D. & Chambers, R. A. (1968). Boxes: an experiment in adaptive
control. Machine intelligence, 2 (2), 137–152.
Minsky, M. & Papert, S. (1969). Perceptrons. MIT press.
Mitchell, T. (1997). Machine learning. McGraw-Hill International Editions.
McGraw-Hill.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra,
D., & Riedmiller, M. (2013). Playing atari with deep reinforcement
learning. arXiv preprint arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare,
M. G., . . . Ostrovski, G., et al. (2015). Human-level control through
deep reinforcement learning. Nature, 518 (7540), 529–533.
Naddaf, Y. et al. (2010). Game-independent ai agents for playing atari 2600
console games (Doctoral dissertation, University of Alberta).
83
Nair, V. & Hinton, G. E. [Geoffrey E]. (2010). Rectified linear units improve
restricted boltzmann machines. In Proceedings of the 27th international
conference on machine learning (icml-10) (pp. 807–814).
Ng, A. (2011). Sparse autoencoder. CS294A Lecture notes, 72, 1–19.
Quinlan, J. R. (1987). Simplifying decision trees. International journal of
man-machine studies, 27 (3), 221–234.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information
storage and organization in the brain. Psychological review, 65 (6), 386.
Rummery, G. A. & Niranjan, M. (1994). On-line q-learning using connec-
tionist systems.
Sammut, C. & Webb, G. I. (2011). Encyclopedia of machine learning. Springer
Science & Business Media.
Schaeffer, J., Culberson, J., Treloar, N., Knight, B., Lu, P., & Szafron, D.
(1992). A world championship caliber checkers program. Artificial In-
telligence, 53 (2), 273–289.
Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neu-
ral Networks, 61, 85–117.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche,
G., . . . Lanctot, M., et al. (2016). Mastering the game of go with deep
neural networks and tree search. Nature, 529 (7587), 484–489.
Skinner, B. F. (1938). The behavior of organisms: an experimental analysis.
Skinner, B. F. (1948). Superstition in the pigeon. Journal of experimental
psychology, 38 (2), 168.
Skinner, B. F. (1951). How to teach animals. Freeman.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov,
R. (2014). Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research, 15 (1), 1929–
1958.
Stadie, B. C., Levine, S., & Abbeel, P. (2015). Incentivizing exploration
in reinforcement learning with deep predictive models. arXiv preprint
arXiv:1507.00814.
Sutton, R. S. (1996). Generalization in reinforcement learning: successful
examples using sparse coarse coding. Advances in neural information
processing systems, 1038–1044.
Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning: an introduction.
MIT press.
Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program, achieves
master-level play. Neural computation, 6 (2), 215–219.
84
Thorndike, E. L. (1911). Animal intelligence: an experimental study of the
associative processes in animals.
Todes, D. P. (2002). Pavlov’s physiology factory: experiment, interpretation,
laboratory enterprise. JHU Press.
Trier, Ø. D., Jain, A. K., & Taxt, T. (1996). Feature extraction methods for
character recognition-a survey. Pattern recognition, 29 (4), 641–662.
Van der Maaten, L. & Hinton, G. (2008). Visualizing data using t-sne. Jour-
nal of Machine Learning Research, 9 (2579-2605), 85.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extract-
ing and composing robust features with denoising autoencoders. In
Proceedings of the 25th international conference on machine learning
(pp. 1096–1103). ACM.
Watkins, C. J. & Dayan, P. (1992). Q-learning. Machine learning, 8 (3-4),
279–292.
Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dis-
sertation, University of Cambridge England).
85