unsupervised feature extraction for reinforcement learning · 2017-05-12 · unsupervised feature...

Faculteit Wetenschappen en Bio-ingenieurswetenschappen

Vakgroep Computerwetenschappen

Unsupervised Feature Extraction for

Reinforcement Learning

Proefschrift ingediend met het oog op het behalen van de graad van

Master of Science in de Ingenieurswetenschappen: Computerwetenschappen

Yoni Pervolarakis

Promotor: Prof. Dr. Peter Vrancx

Prof. Dr. Ann Nowe

Juni 2016

Faculty of Science and Bio-Engineering Sciences

Department of Computer Science

Unsupervised Feature Extraction for

Reinforcement Learning

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in de Ingenieurswetenschappen: Computerwetenschappen

Yoni Pervolarakis

Promotor: Prof. Dr. Peter Vrancx

Prof. Dr. Ann Nowe

June 2016

Abstract

When using high dimensional features chances are that most of the features

are not important to a specific problem. To eliminate those features and

potentially finding better features different possibilities exist. For example,

feature extraction that will transform the original input features to a new

smaller dimensional feature set or even a feature selection method where

only features are taken that are more important than other features. This

can be done in a supervised or unsupervised manner. In this thesis, we will

investigate if we can use autoencoders as a means of unsupervised feature

extraction method on data that is not necessary interpretable. These new

features will then be tested in a Reinforcement Learning environment. This

data will be represented as RAM states and are blackbox since we cannot

understand them. The autoencoders will receive a high dimensional feature

set and will transform it into a lower dimension, these new features will be

given to an agent who will make use of those features and tries to learn from

them. The results will be compared to a manual feature selection method

and no feature selection method.

i

Acknowledgements

First and foremost I would like to thank Prof. Dr. Peter Vrancx for helping

me find a subject I am passionate about, taking the time for weekly updates

and for all his suggestions and numerous conversions on how this subject

could be tackled.

Secondly, I would also like to thank Prof. Dr. Ann Nowe for piquing my

interest in the master Artificial Intelligence when taking her course in my

first year on the Vrije Universiteit Brussel.

And finally I would also like to thank my mother for supporting me to pursue

my studies at university level and my girlfriend for her endless support.

ii

Contents

1 Introduction 1

1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Machine Learning 6

2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Underfitting and overfitting . . . . . . . . . . . . . . . . . . . 13

2.4 Bias - Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Ensembles methods . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Evaluating models . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . 20

3 Artificial Neural Networks 21

3.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Training perceptrons . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.2 Hyperbolic tangent . . . . . . . . . . . . . . . . . . . . 28

3.4.3 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . 28

3.4.4 Which is better? . . . . . . . . . . . . . . . . . . . . . 29

3.5 Tips and tricks . . . . . . . . . . . . . . . . . . . . . . . . . . 30

iii

3.6 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Reinforcement Learning 34

4.1 The setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . 39

4.4 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 Incrementing Q-values . . . . . . . . . . . . . . . . . . . . . . 44

4.7 Monte Carlo & Dynamic Programming . . . . . . . . . . . . . 45

4.8 Temporal Difference . . . . . . . . . . . . . . . . . . . . . . . 46

4.8.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8.2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.10 Function approximation . . . . . . . . . . . . . . . . . . . . . 51

5 Experiments and results 54

5.1 ALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Space Invaders . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 Flow of experiments . . . . . . . . . . . . . . . . . . . . . . . 59

5.5 Manual features and basic RAM . . . . . . . . . . . . . . . . . 60

5.6 Difference between bits and bytes . . . . . . . . . . . . . . . . 61

5.7 Comparing different activation functions . . . . . . . . . . . . 63

5.8 Initializing Q-values . . . . . . . . . . . . . . . . . . . . . . . . 65

5.9 Pretraining and extracting other layers . . . . . . . . . . . . . 68

5.10 Combination of RAM and layer . . . . . . . . . . . . . . . . . 72

5.11 Visualizing high dimensional data . . . . . . . . . . . . . . . . 73

6 Conclusions 75

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Appendices 77

A Extended graphs and tables 78

7 Bibliography 82

iv

List of Figures

1 Architecture of data processing . . . . . . . . . . . . . . . . . 5

2 Example of a decision tree . . . . . . . . . . . . . . . . . . . . 8

3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Data of two features . . . . . . . . . . . . . . . . . . . . . . . 12

6 k-mean clustering . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 Unsupervised learning: reduction of dimensions . . . . . . . . 13

7a MNIST example of the number 2 . . . . . . . . . . . . 13

7b MNIST reduction of dimensions . . . . . . . . . . . . . 13

8 Difference between under and overfitting . . . . . . . . . . . . 15

9 Dartboard analogy from (Sammut & Webb, 2011) . . . . . . . 16

10 Bias Variance trade-off . . . . . . . . . . . . . . . . . . . . . . 17

11 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 18

12 Searching in different dimensions . . . . . . . . . . . . . . . . 19

12a 1D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19

12b 2D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19

12c 3D space . . . . . . . . . . . . . . . . . . . . . . . . . . 19

13 Example of a perceptron . . . . . . . . . . . . . . . . . . . . . 21

14 Bitwise operations . . . . . . . . . . . . . . . . . . . . . . . . 23

14a AND operator . . . . . . . . . . . . . . . . . . . . . . . 23

14b OR operator . . . . . . . . . . . . . . . . . . . . . . . . 23

14c XOR operator . . . . . . . . . . . . . . . . . . . . . . . 23

15 XOR with decision boundaries by learnt MLP . . . . . . . . . 25

16 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . 26

17 Other activation functions: linear and step function . . . . . . 27

18 Sigmoid activation function . . . . . . . . . . . . . . . . . . . 27

v

19 Hyperbolic tangent activation function . . . . . . . . . . . . . 28

20 ReLU activation function . . . . . . . . . . . . . . . . . . . . . 29

21 Example of an autoencoder . . . . . . . . . . . . . . . . . . . 33

22 A Skinner’s Box from (Skinner, 1938) . . . . . . . . . . . . . . 35

23 Agent Environment setting . . . . . . . . . . . . . . . . . . . . 36

24 Another view of the agent environment setting . . . . . . . . . 36

25 Mountain car; image from (RL-Library, n.d.) . . . . . . . . . . 37

26 Pole Balancing; image from (Anji, n.d.) . . . . . . . . . . . . . 37

27 Maze world . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

28 Eligibility trace; image from (Sutton & Barto, 1998) . . . . . . 49

29 Replacing traces; image from (Sutton & Barto, 1998) . . . . . 51

30 Coarse coding; image from (Sutton & Barto, 1998) . . . . . . 53

31 The difference between RAM and Frames . . . . . . . . . . . . 55

31a RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

31b Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

32 Space Invaders screen . . . . . . . . . . . . . . . . . . . . . . . 56

33 MSE of autoencoder with 128 bits input . . . . . . . . . . . . 58

34 MSE Autoencoder from 1024 bits input . . . . . . . . . . . . . 59

35 Difference RAM and RAM with AND . . . . . . . . . . . . . . 61

36 Autoencoders on 128 bytes . . . . . . . . . . . . . . . . . . . . 62

37 Autoencoders on 1024 bytes . . . . . . . . . . . . . . . . . . . 63

38 Q = −1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

39 Q = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

40 Extraction of a layer other than the bottleneck . . . . . . . . . 68

41 Pretraining with extraction of layer 512 . . . . . . . . . . . . . 69

42 Pretraining with extraction to a hidden layer of 4 nodes . . . . 70

43 Pretraining with extraction of layer 512 with dropout . . . . . 71

44 Pretraining with extraction of layer 512 with dropout . . . . . 72

45 Combining the original layer with the encoded version . . . . . 73

46 t-tsne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

47 Linear activation function on an autoencoder . . . . . . . . . . 78

48 Sigmoid activation function on an autoencoder . . . . . . . . . 79

49 ReLU activation function on an autoencoder . . . . . . . . . . 79

50 Pretraining with extraction of layer 512 . . . . . . . . . . . . 80

51 Combining the original layer with the encoded version . . . . . 80

vi

List of Tables

1 Classification of animals . . . . . . . . . . . . . . . . . . . . . 8

2 Predicting the price of a house . . . . . . . . . . . . . . . . . . 10

3 V ∗(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 π∗(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Gridworld Example . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Comparing different activation functions . . . . . . . . . . . . 64

7 P-values of the MannWhitney U test . . . . . . . . . . . . . . 65

8 The difference between in setting different Q-values . . . . . . 67

9 Training to a specific layer and extracting a chosen layer . . . 81

vii

List of Algorithms

1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 SARSA(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Q-Learning(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

viii

Chapter 1Introduction

Artificial Intelligence is a field in computer science which studies a wide range

of topics like Machine Learning, Reinforcement Learning and a new rising

topic, Deep Learning. Artificial Intelligence is now more part of the daily life

than two decades ago. Take for example a robotic vacuum cleaners where the

robot knows when to clean the house, to know exactly when the robot must

return to the charging station to get a full battery and even to pick up where

he has left off after recharging. More than ten years ago the vacuum cleaner

robots were not seen as an AI because the robot would simply do random

walks, if doing a random walk in a house long enough the whole house would

eventually be cleaned. With new algorithms available, the robot can map

the house to vacuum efficiently and detect how to make a detour if a object

is suddenly in the way. The only way to gather all this data is to perceive

all features possible.

Another example are the new smart thermostats like Nest thermostat devel-

oped by Google or the ATAG ONE thermostat. These new smart thermostats

know when the house is empty, when the owners go to work and come back.

By learning the behaviour of the owners the thermostat will automatically

adapt so that the heating will be higher just before the owners are coming

home and the heating will be set lower after the owners go to work or go

sleeping, this can ultimately have a great impact on the energy consump-

tion.

All these new domestic devices make the daily life easier and do seem nat-

urally. Behind the hood is often a complicated AI that uses many features,

measurements of sensory inputs like the velocity, battery usage, IR detector

or thermometer. These features can be very specific, comprehensive and can

in general consist of thousands or millions different inputs. Not all of them

1

are equally important depending on the task that must be completed.

Going back to the example of the robotic vacuum cleaner, features like the

texture of the floor will have an impact on the duration of the task. Vacuum-

ing a carpet is harder than a concrete floor is. Features like the temperature

outside will have little to no impact. It is therefore of most important to se-

lect features that only matter to the task at hand. For simple task with little

features manual feature selection is feasible, but when millions of features are

in play it is not. DNA microarrays for example, store an enormous amount

of features. Manually selecting which features are important for some task

is a horrendous job, not only because the person who selects these features

needs knowledge about the task and features, but also because features in

isolation can seem unimportant but when combined can have a strong influ-

ence on the result. Feature extraction is on of the key business in Machine

Learning. Many problems may rise when using many features, such as the

curse of dimensionality, overfitting and a longer training time with a much

larger chance to get stuck in local minima. When using too many features

there is also a possibility that many features are redundant and do not have

any methods to for example a classification. Many feature selection or ex-

traction rely on a supervised method. There are different feature selection

methods like entropy, which is sometimes used in Decision Trees, correlation

techniques to find which features are high correlated and thus useful for a

certain task or even dimension reduction techniques like PCA which is a lin-

ear transformation of data. All of these techniques are linear or need some

supervised manner in setting them.

One technique of supervised feature extraction is template matching where

similarities or equivalently the dissimilarities between the input data and

the labelled data are measured to use for classification. This method is of-

ten used in Optical Character Recognition (OCR) software (Trier, Jain, &

Taxt, 1996). Researchers also combined ImageNet, which is an online public

database with more then 14 million images all manual labelled into roughly

more than 21.000 categories, and deep learning for classification. By using

deep convolutional neural network researchers were able to classify those im-

ages, the used deep networks consisted of more than 650,000 neurons with 60

million parameters (Krizhevsky, Sutskever, & Hinton, 2012). These neural

networks are a supervised feature extraction because layers can learn ab-

straction of raw inputs, for example from pixels to edges to objects. But also

in regression (Geoffrey E. Hinton & Salakhutdinov, 2008) where unlabelled

2

data is used to learn a good covariance kernel. Autoencoders (Section 3) can

be used to find features by reducing the dimensionality and extracting those

compressed features (Geoffrey E Hinton & Salakhutdinov, 2006; Ng, 2011).

Solving problems is what keeps AI an interesting business. Researches in

the field of AI have a particular interest in solving games because they rep-

resent problems that provide a challenging search space but still have a clear

set of rules and the AI performance can be directly compared to human per-

formance. The program Chinook (Schaeffer et al., 1992) was one of the first

AI that has solved chess and has beaten some expert champions by using

heuristics and search trees. The engine Deep Blue is another chess AI that

uses databases with game data and parallelism with search engines (Camp-

bell, Hoane, & Hsu, 2002) and has defeated the worlds best champion chess

player. Another example of an AI is TD-Gammon that has won from the

best backgammons players by using neural networks and TD(λ). This was

achieved by playing repetitively against itself (Tesauro, 1994) and by doing

so training itself.

One example to see how popular Artificial Intelligence has become and more

in particular Deep Learning, is Go. Google DeepMind has succeeded in de-

feating the worlds top Go player (Silver et al., 2016). Go is a boardgame

with relative simple rules. Players must take turns and put white or black

stones on the board. But nevertheless, Go is one of the hardest game for an

AI to learn, this because there are more moves possible than there are atoms

in the world. Traditional AI algorithms build trees for all possible moves

and settings and try to look where the agent has the most possible chance of

winning before selecting a move. Because the choices and different options of

Go this is simply not feasible. Google Deepmind trained neural networks of

recorded strategies and moves from top Go players and tried to predict them,

afterwards they used Reinforcement Learning with these neural networks to

play against itself and try and learn new moves. Afterwards they used Monte

Carlo tree search to estimate values of a state instead of browsing through

the whole tree.

More recently, Google Deepmind has created DQN which is combining deep

neural networks with reinforcement learning together with experience replay

(Mnih et al., 2015) and has succeeded in beating human players on different

games. Deep Learning is not only popular with classification and regression

tasks but also in the field of Natural Language Processing, where the deep

3

network can return tags, semantic roles and even semantic similarity give

a sentence (Collobert & Weston, 2008). In this thesis we will consider the

problem of applying machine learning methods to computer games, by using

autoencoders as a feature extraction method.

1.1 Research Question

In this thesis we will develop automatic feature extraction methods that can

be used in combination with Reinforcement Learning. This is an important

problem as the performance of an RL agent is strongly dependent on the

representation used for learning. Selecting good features is challenging as it

requires knowledge of the problem domain and the task to be solved. This

thesis will investigate the use of unsupervised learning methods to replace

manual feature selection. A current example is the blackbox challenge 1

where the contestant receives a dataset that we, as a human, do not un-

derstand. Every time step the agent perceives a new state and a variety of

actions that are possible to take. These can be stochastic and late rewards

are possible after taking an action. This challenge was designed in a way that

contestants do not know how to interpret the data, so they cannot manually

do a feature selection method. The data is somewhat blackbox.

We will consider the problem of learning by playing Atari games using the

RAM game state as input. As a human we cannot interpret the RAM state

and so the step of manual feature selection will be skipped and instead do an

unsupervised feature extraction via autoencoders. Figure 1 shows the usual

case when dealing with too many dimensions. The idea is to replace the

middle box, manual feature selection, and replace is it by an unsupervised

feature extraction method. These autoencoders will be trained with different

settings and different levels in dimension reduction. These new features will

then be used by an RL-agent with SARSA(λ) who will play Space Invaders

on a Atari 2600 emulator. By using the game we can see how good these

new features will perform in comparison with the manual feature selection.

1http://blackboxchallenge.com

4

http://blackboxchallenge.com

Figure 1: Current infeasible setting when dealing with too many dimensions

This thesis will first focus on the background of Machine Learning (Chap-

ter 2), Artificial Neural Networks (Chapter 3) and Reinforcement Learning

(Chapter 4). Followed by all the experiments done (Chapter 5) and the last

chapter will contain the final conclusion with some possibilities on future

work (Chapter 6).

5

Chapter 2Machine Learning

The term Machine Learning is a broad term that covers many subfields. To

give such a definition is difficult and many different definitions exist. In this

thesis the definition of Tom Mitchell will be adopted. He describes Machine

Learning as:

A computer program is said to learn from experience E with re-

spect to some class of tasks T and performance measure P, if

its performance at tasks in T, as measured by P, improves with

experience E. (Mitchell, 1997)

Applying this definition to this thesis will give a better understanding. This

thesis will research the unsupervised training of autoencoders and by doing

so unsupervised feature extraction. These new features will then be used to

train an agent by using Reinforcement Learning. This research question can

then be divided in two different parts.

• The unsupervised autoencoder training (Section 3.7), where the task

T is to learn a sort of compression and by doing so feature extraction.

The experience E will be the RAM or frames states, received from our

gameplay and the performance P is the Mean Square Error (MSE),

section 2.1.2 which will determine how good the reconstruction is of

the RAM or frame states and thus how good an autoencoder is.

• Reinforcement Learning (Section 4) of Atari games, where the task T

is to learn to play a game by getting a score as high as possible. The

experience E will be the interactions between the game and the results

that comes from it. The performance P is measured by the score itself

and the total reward.

6

Machine Learning is highly interesting for many current problems. For ex-

ample, cancer that can be detected and classified (Cruz & Wishart, 2006),

self-driven cars (Google, n.d.), speech recognition like Siri from Apple and

so on.

There are three major distinctions in learning some task T; supervised, un-

supervised learning and sequential decision making.

2.1 Supervised learning

Supervised learning is the task of receiving some input data X and output

data, or labelled data, y and creating a function y = f(x) that can map the

input values to output values. There are different kinds of supervised learn-

ing; classification and regression. Classification will classify features into a

small discrete number of groups, for example a breed of an animal. In regres-

sion problems on the contrary, the number of possible outputs can be very

large or even continuous.

Supervised learning searches for a function h(x), or also known as the hy-

potheses, that given the data x will return an estimated output value. For

example a linear hypothesis:

h(~x) = θ0 + θ1x1 + θ2x2 + ..θnxn

The linear hypothesis has some parameters θ that can be optimized through

learning. The linear hypothesis will return a value h(~x) that can be com-

pared to our labelled data, y. By using different techniques, which will be

explained later on, θ-values can be tweaked so that h(~x) will be equal to y.

Below we will discuss two classes of supervised learning problems; classi-

fication and regression.

2.1.1 Classification

A classification problem is a problem where the data is classified or labelled

in different classes. Take for example some input data that are features about

animals (Table 1); the number of feet, color and if the animal has wings or

not. The classification will then be the breed of the animal; in this case a

dog, duck or spider.

7

Feet Color Wings y

x1 4 Brown No Dog

x2 2 White Yes Duck

x3 8 Black No Spider

... ... ... ... ...

x100 8 Brown No ?

Table 1: Classification of animals

The classifier will try to determine a decision boundary between the dogs,

ducks and spiders. Take the last example in the previous table, where an

animal has 8 legs, has a brown color and no wings. Since there is no label,

the classifier must determine what animal x100 must be. As human it is clear

that if there are only three possible animals, the unknown animal must be a

spider since the only animal with 8 feet is a spider. But the classifier cannot

determine this so easily.

Figure 2: Example of a decision tree

One example of a supervised learning method are decision trees. Decision

trees are trees that have different nodes. Each node will ask a question.

This question will lead to another question or a leaf. A leaf will represent

8

the classification of an example. Everything depends on which questions is

asked first, this means that the most informative feature has the most po-

tential to generate a shorter and preciser decision tree. This can be done by

using for example entropy and information gain. Figure 2 shows an example

of a decision tree for the input data of Table 1. This tree could be shorter

by removing the color question after the question wings with answer yes,

because if the animal has wings, it is automatically a duck in our example.

Different adaptations of decision trees exist to optimize trees by for example

pruning (Quinlan, 1987).

Figure 3 shows a classification with two features x1 and x2. The red points

belong to a certain Class 1 and the blue points to Class 2. The classifier

tries to find a decision boundary in the input space where all data points, or

at least as many points as possible, belong to the correct class. In an ideal

situation the decision boundary can separate the classes exactly. But in real

world data, this would be highly unlikely since data is often noisy and/or cor-

rupted. The classifier needs to find a way where the cost of misclassification

is the lowest.

15 10 5 0 5 10 15 20

x1

40

20

0

20

40

60

80

x2

Decision boundry

Class 1

Class 2

Figure 3: Classification of 2 features into 2 classes, separated by a decision

boundary

9

2.1.2 Regression

Regression problems cannot be divided into classes but will have some con-

tinuous target value. Take for example the prediction of house prices with

features like the amount of bedrooms, kitchens, gardens and garages (Table

2). Obviously this cannot be labelled, and predicting the output of x100 is

not that simply.

Bedroom Kitchen Garden Garage Bathroom y

x1 1 1 0 1 1 e 153.314

x2 3 1 1 2 2 e 317.135

x3 6 2 1 3 4 e 683.562

.. .. .. .. .. .. ..

x100 2 1 1 0 1 e ?

Table 2: Predicting the price of a house

The question then remains how a regression model would predict values. A

linear model will try to create a fitted line through the data points, which in

the example above, are the amount of certain room types. This line is also

called the regression line. Figure 4 shows an example where the blue dotted

points are the input data and the green line will represent a regression line.

Multiple regression lines are possible but not all of them are equally good.

A well known simple linear regression function is yi = β0 + β1xi + εi where

i = 1..n and n data entries. The ε-value or disturbance term will represent

the noise in the data values.

Supervised learning will try and create a function h(x), by optimizing the

parameter values, and by doing so predicting the y as good as possible. To

get an idea how good a model is, there is a need for a cost function that de-

termines how good a model is. A commonly used cost function for regression

is the Mean Square Error, or MSE.

MSE =1

2m

m∑i=1

(h(xi)− y(xi))2

MSE finds the difference between the predicted output h(x) and the true

value, y. This will be squared so the difference signs will not make a differ-

10

ence. The additional 2 is used cancel out when differentiating which will be

used in the neural networks. The lower the error, the better the hypothesis

is fitted to the data.

0 2 4 6 8 10

Inputs, X

0

2

4

6

8

10

12

14

Ouput,

y

Input-output vector

Prediction

Figure 4: A regression line between the input values X and the output values

y

2.2 Unsupervised learning

Unlike supervised learning, unsupervised learning has no target output y but

only input data X, Figure 5. Because there is no target output, it is the job

of an unsupervised learning model to find a relationship or structure between

the input data. This relation can be used to group data or even reduce di-

mensions.

One example of finding structure in data and grouping them, is k-means

clustering (Figure 6), where k amount of clusters are formed. Each cluster

has a mean or also called a centroid. First k random centroids are placed

11

within these data points. Each iteration every data point is assigned to the

closest cluster. When all data points are assigned each centroid is recalcu-

lated and moved. This iteration is done until the centroids no longer move.

10 5 0 5 10 15 20 25

x1

10

5

0

5

10

15

20

25

x2

Figure 5: Data of two features

Another example of unsupervised learning is dimension reduction with au-

toencoders (Section 3.7). Autoencoders are a form of artificial neural network

but with their input equal to their output. By doing so, the autoencoder will

learn the identity function and in the internal representations used, autoen-

coders will learn to compress the data. MNIST is a database of handwritten

digits in their raw feature form. Each digit can be converted to a 28x28 image

and thus 784 pixels or dimensions (Figure 7a.). Autoencoders can be used to

go from 784 dimensions to 2 dimensions and by doing dimension reduction.

Each point shown in Figure 7b is the image of a number like Figure 7a. These

points where reduced from 784 to 2 dimensions and colors indicate the class

where the number belongs to. It can be seen that compressions maps the

same numbers close to each other.

12

10 5 0 5 10 15 20 25

x1

10

5

0

5

10

15

20

25

x2

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Centroids

Figure 6: k-mean clustering

(a) MNIST example of the number 2 (b) MNIST reduction of dimensions

Figure 7: Unsupervised learning: reduction of dimensions

2.3 Underfitting and overfitting

There are different ways to build models and not every model is good. Take

for example some data points, these data points must have some underlying

function that we do not know. These data points will represent one input

feature x and the output y with some random noise, y = f(x) + ε. A model

13

can then be created to predict this underlying function. Different functions

are plotted in Figure 8 by using linear regression. The blue dotted points will

denote the samples, the green line will represent the underlying function that

is unknown and the blue line will be the hypothesis h(x) of our model. The

first figure shows a model that is underfitting, it cannot represent the under-

lying function at all. The function is too simple and the underlying function

cannot be represented by a straight line, which is in this case a polynomial

with 1 degree or also a linear regression. The second figure shows that the

model has learnt the true underlying function, although without knowing the

underlying function it is still a hypothesis. In this case it is a polynomial of

4 degrees. The last figure shows a model that is overfitting, it tries to model

every training data too well and uses a polynomial of 15 degrees. If the

model then tries to predict unseen data it will fail because the model does

not generalize over dataset but tries to fit it perfectly. Note that neither

under- and overfitting are good.

A good way to test if the model is good or bad is dividing the data in a

training- and test set. Let the model train on the training set and when the

model has done training, let it predict on the test set. Seeing how much the

predicited output differs from the output of the test set gives a good indi-

cation. A good way is using the Mean Square Error, the smaller the error

the better the model fits the data. There is a difference between training

and test error. The training error is when the model is being trained. The

model receives an input value, predicts it and if it is wrong will adapt the

model. The test error is when the model is done training. A new set of data

is presented to the model. The model will predict the output and the error

will be calculate how far off the model is.

Figure 8 will present different models for an, unknown, underlying function,

this function will be the green line. The model that has been trained will

predict values, these values will be represented in the blue line. As a test set

is presented to the model, the blue line will give the answer which outcome

the model will have. The samples where the model has been training on are

the blue dots. It can be seen that the first model has a high training error as

well as test error. The model cannot represent the model with 1 polynomial

and can certainly not represent a new test set. The next model, one with 4

polynomials, has a very low training- and test error. It can fit the training

data and the test set will be predicted fairly good, because the function of

the model matches closely to the true underlying function. The last image

14

with 15 polynomials will have a low training error. As can be seen it can fit

the training samples perfectly. But it will have a high test error as it cannot

represent the new test data.

0.0 0.2 0.4 0.6 0.8 1.0

x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

y

MSE = 0.37

Model

True function

Samples

0.0 0.2 0.4 0.6 0.8 1.0

x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

y

MSE = 0.04

Model

True function

Samples

0.0 0.2 0.4 0.6 0.8 1.0

x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

y

MSE = 182212904.43

Model

True function

Samples

Figure 8: Difference between under and overfitting. From left to right: poly-

nomial with 1 - 4 - 15 degrees. Image adapted from sklearn 1

2.4 Bias - Variance

The question remains how can the architect of a model detect if the model is

under- or overfitting. This can be seen by determining the bias and variance.

First there are expected values, which are values of a random variable. A

random variable associates numeric values with different outcomes of an ex-

periment. Random variables can or will change when repeating experiments.

Repeating an experiment to get average results is thus important. Bias is

then the difference of the expected value of the predicted outcome and the

real target outcome.

Bias(y) = E(y)− y

Bias will see how far off a model is to the correct output of the underlying,

unknown, function.

Variance will find the variability of the model, with respect to the expected

model.

1http://scikit-learn.org/stable/auto examples/model selection/plot underfitting overfitting.html

15

http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

V ar(y) = E[(y − E(y))2]

The dartboard analogy (Figure 9) gives a more visual idea of what bias

and variance means. Imagine that someone is throwing darts and that the

bullseye represents a good model. If all the darts have been thrown and they

are spread out and thus not close to each other, than there is a form of high

variance. Bias on the other hand is the average distance to the bullseye. In a

case of low bias and low variance all darts are close to each other and directly

on or close to the bullseye itself.

Figure 9: Dartboard analogy from (Sammut & Webb, 2011)

The mean squared error, MSE, gives us a squared result of how good the

model is.

MSE(y) = E[(y − y)2]

MSE can then be decomposed as the bias-variance decomposition.

MSE(y) = (E(y)− y)2 + E[(y − E(y))2] + σ2

= Bias2 + V ar + Error

The last term, the irreducible error, will represent the noise in the data.

16

Figure 10: Bias Variance trade-off

When applying the bias and variance to under- and overfitting, it can be

seen that underfitting is when the bias is too high. The model is too simple

and can not learn the underlying function. Overfitting gives a high variance,

because it is too complex and fits the noise instead of the underlying function

(Figure 10).

2.5 Ensembles methods

One way to get a better performance of the model is using ensemble methods.

These methods combine different models that are more accurate than a single

model.

2.5.1 Bagging

Bootstrap aggregating, or also known as bagging, is mostly used for reducing

the variance of a model. Bagging belongs to the class of averaging methods

since they will average their result and by doing so getting a combined result.

It starts by taking random subsets of the training data. By using different

subsets and training on them, models will be different and predict differ-

ently. Bagging will then accumulate all separate models and combine them

in one concluding model (Breiman, 1996). An example of a bagging model

method is tree bagging or an extension; random forest, Figure 11. It starts

17

with training B trees, this can be for example decision trees. Each training

the model draws random and uniformly with replacement from the pool of

training data. After all B trees are trained, the models will be ensambled by

using the average 1B

∑fb(x) or voting, where the majority rules counts. This

only decreases the variance and does not increase the bias. Random forests

will also add a random feature subset while learning the trees.

Figure 11: Random Forest

2.5.2 Boosting

As in all models, there are strong learners and weak learners. Weak learners

are defined as being slightly better than a random prediction, but still not

good enough. The idea comes from combining multiple weak learners and

create a single strong model. The most popular boosting algorithm is Adap-

tive Boosting, or AdaBoost (Freund & Schapire, 1997). AdaBoost combines

the results of weak learners into a weighted sum or majority rule.

2.6 Curse of dimensionality

One might think, the more features data has, the better a learner or model

will perform. This is not true. Imagine if e1 is dropped on a straight line of

100 meters. The coin will be easily found. If a coin is dropped on a surface

of 100 x 100 meters which is 10000 m2, this is also possible but not so easy

18

anymore. If a a coin is dropped in a 3D space of 100 x 100 x 100, which

is 1000000 m3, it is more difficult than before (Figure 12). This analogy is

only to illustrate the difficulty in finding a coin in a multidimensional space.

In machine learning the dimensionality can go up to tens of thousands of

dimensions, for example DNA sequences. This also means the higher the

dimension goes, the sparser the data becomes. One way to reduce dimensions

is using feature selection or even feature extraction, like autoencoders.

0.0 0.2 0.4 0.6 0.8 1.0

(a) 1D space

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

(b) 2D space

0.47 0.48 0.49 0.50 0.510.52

0.530.470.48

0.490.50

0.510.52

0.530.06

0.04

0.02

0.00

0.02

0.04

0.06

(c) 3D space

Figure 12: Searching in different dimensions

2.7 Evaluating models

After models are created, there is a need to evaluate them. Often, if sufficient

data is available 70 or 80% of the data is taken at random and will be used

to train a model. The remaining number will be used predict and see how

far off a model is. This method is not flawless, there is a chance that only

outliers of the data are in the test set which can determine that the model is

bad when in fact it is actually quite good. Therefore different methods are

invented to get an average prediction of how good a model is.

19

2.7.1 Cross validation

The first method is cross validation. The goal is to see how effective a model

is. There are different cross validation methods such as k-fold cross validation

and holdout methods. These methods are classified as non-exhaustive. The

first method, k-fold cross validation, splits the data in k folds or subsets.

Each iteration, done k times, one subset is taken which will represent the

test set and the other subsets will be used as the training set. Hereafter the

results of all k predictions will be averaged and the error can be estimated.

The holdout method is the same as k-fold cross validation but in here k = 2.

Each datapoint is either assigned, at random, to the training set or test set.

20

Chapter 3Artificial Neural Networks

Artificial Neural Networks (ANN) are machine learning models inspired by

the human brain. The brain consists of approximately 1011 neurons, a neu-

ron itself is cell that transmits information to other neurons. This connec-

tion between other neurons is called a synapse, there are approximately 1014

synapses. These neurons with their synapses can make decisions based on

their input, for example a human can recognize his family members imme-

diately when seeing them. This is exactly why researches wanted to create

artificial neurons with a mathematical model for handling information.

3.1 Perceptrons

One type of Artificial Neural Networks are perceptrons (Rosenblatt, 1958)

which are binary classifiers, Figure 13.

Figure 13: Example of a perceptron

21

Perceptrons can only take real-valued inputs and construct one single binary

output. The output is calculated by a linear combination of real-valued

weights (w) and inputs (x), this will result in a value that, depending on a

certain threshold, will result in zero or one. This can be rewritten into the

following function;

f(~x) =

{0 if ~w . ~x+ b ≤ 0

1 if ~w . ~x+ b > 0

Where ~w . ~x is the dot product of vectors, note that x0 will be set equal to 1

for this vector notation. A bias will influence how easier it is to get a 0 or 1

as output. For example if the bias is negative, the dot product of vectors ~x

and ~w must have a value greater than the absolute value of the bias to get

over the threshold. The bias can thus adjust the decision boundary. Note

that for perceptrons only a linear decision boundary is possible. Bitwise op-

erations, like AND and OR, can be implemented by one single perceptron by

adapting the weights or the bias. Figures 14a and 14b show an example how

the perceptron can distinguish the bitwise operation AND and OR. Both

axis signify all states a bit can take, the color denotes if a bit will be 0 or 1

depending on the operation and the black line will be a decision boundary.

Not all operations can be represented by one perceptron, XOR is for ex-

ample not linearly separable, see Figure 14c, and thus needs more layers of

perceptrons to solve this problem.

3.2 Training perceptrons

The difficult part of perceptrons is setting the weights in a way that the

perceptron’s output results in a correct output. To do this there are several

ways to learn weights. The first way is called the perceptron training rule

where all weights are initialized at random. The next step is iterating over all

the training examples and whenever the classification is wrong the weights

are updated by the following rule:

wj = wj + ∆wj

where

∆wj = η(t− o)xj

22

0 1

Bit1

0

1

Bit

2

AND

1

0

Boundary

(a) AND operator

0 1

Bit1

0

1

Bit

2

OR

1

0

Boundary

(b) OR operator

0 1

Bit1

0

1

Bit

2

XOR

1

0

(c) XOR operator

Figure 14: Bitwise operations

This is done until all training examples are classified correctly. The rule

takes the difference between the target output t and the perceptron’s output

o which is then multiplied by a learning rate η and the input xj. It can be

seen that whenever the perceptron’s output is equal to the correct output

the update will be equal to 0 and thus no weights are updated. It has been

proven that the perceptron’s training rule will converge (Minsky & Papert,

1969), if the learning rate is sufficiently small and when the data is linearly

separable.

It is often unknown if the data is linearly separable. The delta rule or gra-

dient descent will therefore search for a good approximation for all outputs

by using gradient descent if the data is not linearly separable. The idea is

23

by minimizing the following error:

E =1

2

∑d∈D

(td − od)2

Where E will be the squared error and D the set of all training examples.

Note that the 12

is used to cancel out the exponent when differentiating.

The error is always non-negative due to the power. If the error is small

the perceptron’s output can represent the target output well. To find the

minimum of E, the derivative with respect to the weights can be taken.

∇E =[δEδw0

+ δEδw1

+ ..+ δEδwn

]The gradient gives the direction of the steepest increase of E. To find the

steepest decrease, the negative sign can be added. The learning rule will

then become:

w = w + ∆w

where

∆w = −η∇E(w)

This can be rewritten by

δE

δwi=

δ

δwi

1

2

∑d∈D

(td − od)2

=∑d inD

(td − od)(−xid)

∆wi = η∑d inD

(td − od)xid

The η will determine how big the step size will be in the gradient descent

search.

Another variation is called the stochastic or incremental gradient descent

where the gradient descent is calculated for each training data separately

instead of summing.

∆wi = η(t− o)xiStandard (or batch) gradient descent will thus go through all examples before

updating the weights. While stochastic gradient descent will take one exam-

ple and updates the weights based on that example. The gradient descent

24

will be a very costly algorithm when the size of training samples is large.

Stochastic gradient descent will improve much faster than gradient descent

ever will and will eventually converge faster but its error will be not as good

as the gradient descent will be.

3.3 Multilayer perceptron

As explained previously, a single perceptron cannot represent non-linear data

like XOR. Multilayer perceptrons, or MLP, can represent this by using mul-

tiple layers of perceptrons. This will result in, for example two different

decision boundaries for XOR, Figure 15. The layers of MLP’s are fully con-

nected, except the input layer and each perceptron has a non-linear activation

function, Figure 16.

0 1

Bit1

0

1

Bit

2

XOR

1

0

Boundary

Figure 15: Example of XOR with two decision boundaries learnt by a MLP

25

Figure 16: Example of a multilayer perceptron with 4 input nodes, 2 hidden

layers with each 5 hidden nodes and 3 output nodes

3.4 Activation functions

The activation, ϕ on Figure 13, is a function, possibly non-linear, applied

after multiplying inputs with their network weights. For example a linear

neuron, which uses a linear activation function, can set the output on or off,

which means it belongs to class A or B if there are only two features. It thus

activates the node or not. The problem with linear neurons is that using

multiple layers of linear neurons will still yield a linear result. The same goes

for a step function where the output will result in a 0 or 1 depending on the

threshold θ. There is thus a need for a unit that given an input will yield

an output which is a non-linear result of its input. The advantage of the

following described activations is that their functions are all differentiable,

this can minimize the computational load when training neural networks.

Other basic activations are the linear and step function, Figure 17.

26

1.0 0.5 0.0 0.5 1.01.0

0.5

0.0

0.5

1.0Activation function: Linear

Linear

4 3 2 1 0 1 2 3 41.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Activation function: Step function

Step function

Figure 17: Other activation functions: linear and step function

3.4.1 Sigmoid

The sigmoid unit sets the threshold as a sigmoid function, Figure 18. This

results in a continuous function of its input by using:

σ(x) =1

1 + e−x

This output will map the input between a 0 and 1 output. The derivative of

the sigmoid function will be:

d

dxσ(x) = σ(x)(1− σ(x))

=1

1 + e−x(1− 1

1 + e−x)

4 2 0 2 40.0

0.2

0.4

0.6

0.8

1.0Activation function: Sigmoid

Sigmoid

Figure 18: Sigmoid activation function

27

3.4.2 Hyperbolic tangent

The same goes for the hyperbolic tangent or tanh, Figure 19. This will map

the input between a -1 and 1 output.

tanh(x) =sinh(x)

cosh(x)

=e2x − 1

e2x + 1d

dxtanh(x) = 1− tanh(x)2

4 2 0 2 41.0

0.5

0.0

0.5

1.0Activation function: Tanh

Tanh

Figure 19: Hyperbolic tangent activation function

3.4.3 Rectified Linear Unit

Another recently discovered activation is the rectified linear unit (Nair &

Hinton, 2010), or ReLU, Figure 20. This has the advantage that when there

is a neural network with random initialized weights, only 50 % of the hidden

neurons will be activated. This results in a sparse activation. ReLU is not

differentiable at 0, but the can differentiated at any other point. In the last

years ReLU has grown more popular in Deep Learning because they learn

must faster when going in neural networks with many layers (Y. LeCun,

Bengio, & Hinton, 2015). It can also compete with neural networks that

use pre-training and neural networks that do not use pre-training with the

activation function ReLU (Glorot, Bordes, & Bengio, 2011).

28

relu(x) = max(0, x)

d

dxrelu(x) =

{x = x > 0

0 = x ≤ 0

4 3 2 1 0 1 2 3 41.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0Activation function: ReLU

ReLU

Figure 20: ReLU activation function

3.4.4 Which is better?

The question then remains which activation function is better. Using a non-

linear function is essential when there the data is not linearly separable and

wanting a non-linear output. Unfortunately there is no activation function

that is best above all others. Often is the hyperbolic tangent more pre-

ferred because the data will be centered, if the data is normalized, around 0.

This causes the hyperbolic tangent to often converge faster than the sigmoid

function. (Y. A. LeCun, Bottou, Orr, & Muller, 2012). Over the last few

years ReLU has been typically preferred over other activation functions in

deep networks, ReLU has the advantage that it has no vanishing gradient

problem. When learning weights in deep networks with backpropagation it

is possible that the first layers will learn slowly because of the amount of

chainrules that must be surpassed before reaching the input layers. Because

so many chainrules are passed the derivative can be a very small number

which means updating can be very slow.

29

3.5 Tips and tricks

As can be seen there are many parameters that can be applied to a neural

network. This does not mean that neural networks will converge. There are

a few possibilities to speed up the process although this does not mean it

will lead to a good solution. One of those possibilities is batch or stochastic

learning. Batch learning is when all the training data is passed through the

neural network and only then the gradient will be computed and weights will

be updated, this is different from stochastic training where there is only one

update done after a forward pass on a single (random) input. Stochastic has

the advantage that it is much quicker than batch training and is often known

to perform better, although this is not sure.

Another option is to randomize the input so that the input ~x1 and ~x101 are

not likely related as ~x1 and ~x2 would be. For example, two consecutive RAM

states from a game are related. But two random RAM states are probably

not as related as the two consecutive would be.

It is often good practice to train on examples that return a bigger error

than examples that give a lower error. Another way to boost the process is

normalizing the input with mean 0, (Y. A. LeCun et al., 2012) shows that

whenever the input is for example all positive the weights will only increase

or decrease which means the update rule will only zigzag its way to find the

best weights. This causes inefficiency algorithm.

3.6 Backpropagation

Backpropagation is used to train a neural network to optimize the weights

of the network with gradient descent. First the previously error E needs to

be redefined because it was the error of only one unit. This can be done by

summing all difference between the target and output of all kth output units

with training data d;

E =1

2

∑d∈D

∑k∈outputs

(tkd − okd)2

The problem with backpropagation and the previous gradient descent for

one output unit is that the dimensional space of E contained only one local

minima, while backpropagation can have multiple. This means that back-

propagation will converge to any of those local minima but is not certain

30

that this local minima is also a global minima. This aside backpropagation

still produces good results. The algorithms starts by initializing the number

of nodes and outputs and setting random small weights. For each training

example the network calculates the output and the error. It then computes

the gradient of that error followed by adapting the weights of the network.

This iteration can and probably will be looped many times until the network

can calculate the output decently. There are many criteria that can be set

to end the iteration, for example a fixed number of iterations or having to

loop till the error falls below some threshold. The weights are updates with

the following rule

wji = wji + ∆wji

where

∆wji = ηδjxji

This rule is an adapted version of the previously seen delta rule. For output

units the new δ will be the previous (t − o), target value minus the output

value, but multiplied with the derivation of an activation function, ϕ.

δk = φ(tk − ok)

where

φ =d

dxϕ

For the inner nodes, lets assume there are only two layers one output layer

and one hidden layer, the δ will be defined differently, since there are no

target values available. The δ is then calculated by summing the δ of the

outputs weighted by the weights of the hidden node.

δh = φ∑

k∈output

wkhδk

where

φ =d

dxϕ

This can be extended with more than two layers by using the chain rule.

31

3.7 Autoencoders

Autoencoders are artificial neural networks with the special property that

they do not need target values. This makes autoencoders an unsupervised

learning method because the target values will be set equal to input values,

~y = ~x (Geoffrey E Hinton & Salakhutdinov, 2006; Ng, 2011). This forces

the autoencoder to learn the identity function. This may seem trivial, but

setting constraints on the network like limiting the number of layers and

nodes, see Figure 21, can create a bottleneck which forces the autoencoder

to reduce the input information and thus creating a compression technique.

Real world examples of input data, such as pictures and the amount of pixels,

DNA sequences, and so on have big input features. Therefore there is a need

for some kind of compression that reduces the amount of features.

Deep learning is a technique that can also be used together with autoen-

coders. Deep learning will have multiple layers, where each layer will learn

some abstraction of the input features and in the end will create some com-

plex structure of abstractions (Bengio, 2009; Y. LeCun et al., 2015; Schmid-

huber, 2015). The layers of such a deep network can be initialized by first

training an autoencoder on the input layers. The weights of the trained au-

toencoder then typically provide a good starting point for the deep network

weights.

There are different kinds of autoencoders. The first variation is a sparse

autoencoder, this is when the hidden layers of the autoencoders can have

more hidden nodes than the original feature input vector. Having more hid-

den nodes leads to more computational heavy calculations. The sparsity

parameter will enforce that a node will be on average active. This intro-

duces sparsity and can give interesting results (Ng, 2011). Another variation

on sparse autoencoders are the k-Sparse autoencoders (Makhzani & Frey,

2013), which only takes the k best activations and cancel out the rest, mean-

ing initialization them on zero. Denoising autoencoder, (Vincent, Larochelle,

Bengio, & Manzagol, 2008), can be an alternative for sparsity or bottleneck.

It will corrupt the input data and the autoencoder will be trained to fill in

missing parts, and thus reconstruct the input data. This is done by training

the autoencoder and removing features random.

32

Figure 21: Example of an autoencoder

3.8 Conclusion

This thesis will primarily focus on autoencoders and their capabilities in un-

supervised feature extraction. Because autoencoders have the capability of

reducing dimensions, it is interesting to investigate how good these features

are. Unfortunately there is no way to see if these features are good or what

they mean, since they are somewhat blackbox. These features have some nu-

meric value these are not easily interpreted, not unlike for example a decision

tree that is easy humanly readable. By using different activation functions

we can also see the impact that one function has on the reconstruction of the

input data. Because autoencoders are starting in relative high dimensions,

1024 or 128 depending of how to interpret the RAM state, it is also a good

idea to experiment with adding dropout in an autoencoder. Dropout forces

the autoencoder to randomly drop nodes with all their connections. By do-

ing so, the autoencoder is forced to learn in another connection and this also

prevents to overfit the network (Srivastava, Hinton, Krizhevsky, Sutskever,

& Salakhutdinov, 2014). The downside of using dropout is more learning

time.

33

Chapter 4Reinforcement Learning

The history of Reinforcement Learning, RL, has its roots in psychology. Ed-

ward Thorndike introduced the law of effect, which he defines as:

Of several responses made to the same situation, those which are

accompanied or closely followed by satisfaction to the animal will,

other things being equal, be more firmly connected with the situ-

ation, so that, when it recurs, they will be more likely to recur;

those which are accompanied or closely followed by discomfort to

the animal will, other things being equal, have their connections

with that situation weakened, so that, when it recurs, they will be

less likely to occur. The greater the satisfaction or discomfort, the

greater the strengthening or weakening of the bond. (Thorndike,

1911)

This will be one of the key points of Reinforcement Learning, only positive

interactions will be encouraged and negative interactions will be discouraged,

but not rejected.

Skinner invented the Skinner’s Box (Skinner, 1938), where animals have to

press a lever when receiving a signal. This can be anything from a light pulse

to a sound. When the animal presses the lever on the correct signal it will

receive a reward, which will most likely be food (Figure 22). But it can also

receive a negative reward like electrical shocks when pressing the lever at the

wrong signal. Skinner is known to perform this kind of tests on pigeons and

rats (Skinner, 1951, 1948).

34

Figure 22: A Skinner’s Box from (Skinner, 1938)

Many other examples exists where animals are trained, like Pavlov’s dogs

(Todes, 2002), where dogs were trained to response to receiving food, result-

ing in producing more saliva. This was trained by ringing a sound before

giving the dogs their food.

All the previous research, for example Pavlov’s dogs, forms a base of how

dogs are trained now. Dogs now will get a biscuit if the dog performs a

command correctly, if he does something bad he gets scold. This is exactly

what Reinforcement Learning tries to recreate.

Reinforcement Learning is used in many current applications. For exam-

ple a robot vacuum cleaner that adapts itself to know when to dock itself

to recharge and restart where it has left off or even to adapt the motors of

the robot depending on the material of the floor to save energy and be more

efficient. Even in games is Reinforcement Learning widely used. Researchers

let an AI play backgammon against itself and by doing so, learning from

itself and correcting his mistakes which made him a master level player and

close to one of the best backgammon players (Tesauro, 1994).

4.1 The setting

The Reinforcement Learning setting can be summarized in Figure 23 and

24 (Sutton & Barto, 1998). An agent is an entity that can observe the

environment and can act upon it and by doing so learn from the interactions.

35

The environment is where the actions take place, which will then yield a state

and reward. The agent will eventually learn how to map situations onto to

different kinds of actions based on what the agent has learnt. The goal of

Reinforcement Learning is maximizing its reward.

Figure 23: Agent Environment setting

Going back to the Figure 23, an agent can interact each time step t =

0, 1, 2, 3, 4, .. with the environment. After each time step t, the environment

produces a state st ∈ S, where S contains all possible states. Based upon a

state, an action will be chosen and taken, at ∈ A(st), where A(st) will be all

possible actions in the state st. The next time step t + 1, the environment

will yield a reward Rt+1 ∈ R with a new state St+1.

Figure 24: Starting from a state, the agent will choose an action. The next

time step the agent will receive a reward and comes in a new state. This is

done T times

Example 4.1.1. One of the most known well examples in Reinforcement

Learning is the mountain car (Figure 25). The agent has to drive the car to

the top of the mountain, but it does not have the power to get to the goal

position in one go. Therefore the agent can use gravity in order to get to the

goal as quickly as possible. The agent can do this by driving up the hill, let

go and drive backwards to gain momentum. The states of the mountain car

36

are, the position on the map which is one dimensional and the velocity of the

car. The actions can be, to drive forward, backward and do nothing. The

rewards are always negative per time step unless he reaches the goal. The

agent will learn to minimize his reward, since it is negative.

Figure 25: Mountain car; image from (RL-Library, n.d.)

Example 4.1.2. Another widely used example in Reinforcement Learning

is Pole Balancing, Figure 26 (Michie & Chambers, 1968; Barto, Sutton, &

Anderson, 1983). A pole is mounted on a cart at its center of mass, this

allows the pole to be balanced at an exact point. The cart itself can only

move left and right and the pole can only indirectly move from left to right.

The goal is balancing the pole in an upward position. This can be done by

moving the cart back and forth to get to that point. The states are the

pole’s angle and angular velocity. The actions are moving left and right and

by doing so creating a force to get the pole to a balanced state. The rewards

can for example be, for each time step an incremental reward with reward 1

until the cart fails.

Figure 26: Pole Balancing; image from (Anji, n.d.)

37

4.2 Rewards

The goal of Reinforcement Learning is to have a maximum reward over time.

The agent receives a reward every time step + 1, because when the agent

does the action, he can only observe the reward the next time step. This can

be formally written as;

Gt = Rt+1 +Rt+2 + ..RT

Where Gt is the expected total reward. The agent does not know the exact

reward, he can only expect a certain reward. T is the final time step, when

the agent goes into an end state. When the environment has a notion of

time as in learning an episode which starts and ends, like for example a maze

environment (Figure 27). The agent starts on the left and each time step

can move only one adjacent square where there is no wall. The agent needs

to find a way outside of the maze. This is called an episodic task be because

each episode the agent can do an action at in time step t.

Figure 27: A maze world where the agent starts from the left and needs to

find a way to get outside

An episodic task will eventually always go into a final state. When there

is no terminal state it is called a continuous task. This means that the

formally noted Gt is no longer true because there is no final time step T . Gt

38

can easily be adapted from T to infinity∞ 1. An additional approach to the

expected reward is adding a discount factor. This factor is used to determine

if whether the agent is interested in an immediate return or more interested

in a future reward.

Gt = Rt+1 + γRt+2 + γ2Rt+3 + ..

= Rt+1 + γ(Rt+2 + γRt+3 + ..)

= Rt+1 + γ(Gt+1)

or

=∞∑k=0

γkRt+1+k

This means that γ, the discount factor, decides whether the agents seeks for

a long term and future reward or an immediate reward. The discount factor

is bounded between 0 ≤ γ ≤ 1. This is interpreted as follows, imagine that

γ is equal to 0. Then is Gt = Rt+1, meaning that the agent only cares about

the reward it is about to receive. If γ = 1, it can be seen that rewards in the

future are equally important as the immediate reward.

For most parts the reward scheme is unknown, meaning the rewards are

chosen by the designer of the implementation. In the example of mountain

car, Example 4.1.1, the reward scheme is always −1 until the car reaches

the mountain. But not always, in this thesis the focus will lie on the reward

scheme given by Space Invaders itself (Section 5.2).

4.3 Markov Decision Process

The Markov Property states that whenever the agent is in a state s it contains

all valuable information to go to the next state s′ with its reward r′. From

this information it can decide in the future where to go. It is said that, when

the reward and transition probabilities only depend on the current state,

action and time step and not on the previous visited states, the problem has

1example of contin task?

39

the Markov Property. It can thus be defined as;

P (Rt+1 = r, St+1 = s′|S0, A0, R1, ..., St−1, At−1, Rt, St, At)

=

P (Rt+1 = r, St+1 = s′|St, At)

Which states what the probability is of the reward r and the next state s′

given all previous information is equal to only the previous state, which is

exactly what the Markov Property defines.

A Markov Decision Process is when a Reinforcement Learning task has the

Markov Property. It consists of:

• Set of States S: S0, S1, .. , Sn

• Set of Actions A: A0, A1, .. , An

• Transition function: T (s, a, s′) = P (St+1 = s′|St = s, At = a)

• Reward function: r(s, a, s′) = E[Rt+1|St = s, At = a, St+1 = s′]

The Transition function T gives the probability of a state s′ given the current

state and action. The Reward function r gives the expected reward given

the current state, action and next state.

Applying this to Example 4.1.1, the mountain car, the first transition func-

tion can be going from the current state, which is standing still, and the

action acceleration, to a next state which is higher on the mountain. The

reward scheme was designed as follows, only negative rewards are given un-

less the goal is reached. Since the car is in the start state and the action is

acceleration, the expected reward will be −1 since it is the first time step

and the goal was not reached.

4.4 Value functions

A policy π is the long term goal of an agent where the agent selects an

action in a state at any given time. A policy will take all elements into

consideration with regards to maximize the reward. It thus maps each state

with a probability onto an action, π(a|s). The value of taken that action a

40

in state s and further following the policy π is denoted as V π(s) and is called

the state-value function for policy π. This can be formally written as:

V π(s) = Eπ[Gt|St = s]

= Eπ[∞∑k=0

γkRt+1+k|St = s]

Meaning the expected value, E, will be the expected reward given the state

the agent is currently in. Equivalently the action-value function can be de-

fined for a policy π, this will be denoted as Qπ(s, a). The action-value func-

tion returns the expected return from a chosen state s and an action a by

following the policy π. The action-value function can thus be defined as:

Qπ(s, a) = Eπ[Gt|St = s, At = a]

= Eπ[∞∑k=0

γkRt+1+k|St = s, At = a]

Value functions give an indication if going into a state is a good or a bad op-

tion regarding the future. These value functions only come from experience

and the only method to get experience is gaining as much as information as

possible by traversing the environment.

The state-value function has a special property between the current state,

the action taken and the successor of the state followed from that action,

which is a recursive relationship. The following equation is named the Bell-

man equation for State Values. It looks at the action s and all the following

states s′ that follow from action a. The same can be applied on state-action

values.

V π(s) = Eπ[∞∑k=0

γkRt+1+k|St = s]

= Eπ[Rt+1 + γ∞∑k=0

γkRt+2+k|St = s]

=∑a

π(s, a)∑s′

T (s, a, s′)[R(s, a, s′) + γV π(s′)]

Qπ(s, a) = Eπ[∞∑k=0

γkRt+1+k|St = s, At = a]

=∑s′

T (s, a, s′)[R(s, a, s′) + γV π(s′)]

41

The Bellman equation will look at a start state and calculates for every

possible action the states of the successor with their expected reward. The

Bellman equation is going to average all the potentials with their weighted

probability of occurring.

It is only logical that multiple policies exist and thus also multiple state-

value functions, as a designer to solve the problematic task, there is a need

to find the optimal policy and optimal state-value function.

π ≥ π′

if and only if

V π(s) ≥ V π′(s) ∀s ∈ S

A policy is only better when the state-value or action-value function is better

or equal than the every other policy. Of all policies there is one policy which

is the best and thus optimal, π∗ with the associated optimal state-value

function, V ∗ or optimal action-value function Q∗. Note that an optimal

policy is not unique but an optimal action-value function is. Which will be

defined as:

V ∗(s) = maxπVπ(s) ∀s ∈ S

Q∗(s, a) = maxπQπ(s, a) ∀s ∈ S and ∀a ∈ A

Example 4.4.1. Take the gridworld as an example. Where an agent is put

on the grid and needs to find the goal, here the goal is indicated by a green

square. The agent can only move right, left, up and down. The agents

receives in this example +100 when moving to the goal. The optimal state-

value is then showed in Table 3. The optimal policy π∗, will be the shortest

way to the goal. Every possible optimal policy path is identified by arrows,

multiple arrows indicate multiple optimal paths exist, this is shown in Table

4. To be sure that the agent finds the optimal policy the agents must visit

every possible state, in the gridworld this is a doable option. But when there

are millions of states this can be quite exhausting, therefore these functions

can be approximated.

4.5 Action Selection

The problem remains which action to select and why to select a certain

action. A naive way to select an action is always selecting the action with

42

54 63 72 63 54

63 72 81 72 63

72 81 90 81 72

81 90 100 90 81

90 100 0 100 90

Table 3: V ∗(s) Table 4: π∗(s)

Table 5: Gridworld Example

the highest Q-value, Qt(A∗t ) = maxaQt(a). This method will always choose

to take the action which yield the highest reward above all other rewards,

this is also called exploitation. It only uses what the agent has learnt and will

not explore other options. One of the disadvantage of this greedy method

is that the agent will never find another possibility or another way that has

more rewards and is perhaps shorter. There is a way to force the selection

method to explore, which is initializing the Q-values on another value. A

more optimistic way of exploring while keeping some exploiting is the ε-

greedy method where there is a probability of ε to select a random action or

choosing the greedy method. Equivalently in this case there are also different

ways to optimise the action selection method. There is a possibility to keep

the ε fixed over different episodes but there is also a way to keep the ε high in

the beginning of the episode, to force the agent to explore as much as possible,

and after a certain time t, the exploration rate will be reduced to force the

agent to change his exploration to more exploitation. A disadvantage of the

ε-greedy method is that when selecting an action, it will choose each action

with same probability. This means that it could choose a very good action,

but also a extreme bad action. The softmax action selection solves this by

using probabilities of selecting an action which are ranked by their estimation

of Q-values.

P (s, a) =eQ(s,a)τ∑n

i=0 eQ(s,n)τ

The parameter τ , or temperature, is used to determine how long the explo-

ration will continue, the higher τ is, the more randomly it will play. The

closer τ is to 0, the more greedily it will be. The same thought can be ap-

plied by reducing the τ over time.

43

Balancing the amount of exploration and exploitation is one of the impor-

tant elements of learning. There is no need to always exploit the same path,

because the first best path is not per se the all time best path. The same

goes for exploration, always exploring random actions will never yield a good

result. Although when exploring tremendously the agent knows all possible

paths that can be taken. There is no current research that declares which

action selection method is the best. Both ε-greedy and softmax are methods

that are widely used today in Reinforcement Learning. In current research

ε-greedy is more used, simply because setting the ε parameter is easy under-

standable while the τ parameters needs knowledge of the action values and

e.

4.6 Incrementing Q-values

When using action selection methods, there is a need for a value of for an

action other than the reward. A simplistic way of representing these values

is by averaging all rewards.

Qt(s, a) =R1 +R2 +R3 + ..RKa

Ka

The rewards are averaged when the action a was selected K times before a

time step t. When the agent just starts, K is equal to 0, that makes Q(s, a)

undefined. Therefore Q-values are always initialized by some number, for

example 0. The law of number states that when K → ∞, Q(s, a) will

converge to Q∗(s, a). This method is also called the sample-average method.

As being said, this is a fairly naive way of implementing these values. For

the method to work, the computer needs to remember all possible reward

to average them, this will only increase the longer the task lasts. The same

goes for computational power, each time a new action is taken the computer

needs to recalculate the entire average, with thousands of rewards of only one

action in a state can overload a computer. One way to avoid this problem is

using incremental updates.

44

Qk+1 =1

k

k∑i=1

Ri

=1

k(Rk +

k−1∑i=1

Ri)

=1

k(Rk + (k − 1)Qk +Qk −Qk)

=1

k(Rk + kQk −Qk)

= Qk +1

k[Rk −Qk]

The computer only needs to remember the Qk and k value, which makes the

computational load a lot smaller. This incremental update can be generalized

by using the following equation:

Estimatenew = Estimateold + stepsize[Target− Estimateold]

The difference between the Target and Estimateold can be seen as the error

between the estimated value of an action method and the target. Usually in

Reinforcement Learning the stepsize will be replaced by α. The α can be a

constant, which makes the current reward weighted heavier than the older

rewards, where 0 < α ≤ 1, which is then called the weighted average.

4.7 Monte Carlo & Dynamic Programming

Monte Carlo methods used in Reinforcement Learning do not need full knowl-

edge of an environment but only needs experience. It can even learn from

simulated experience by sampling the environment. By doing so it only needs

to generate sample transitions. Monte Carlo methods are based on averaging

sample returns. This means that averages can only be calculated when the

episodes are completed, assuming the states are finite and episodic. Monte

Carlo methods can also be used to mimic policy iteration. The first phase is

Policy Evaluation, where given a policy π, the goal is to compute the Qπ(s, a)

or an approximation for all pairs. These pairs can be estimated by averaging

the sampled returns. When running long enough Q will approximate Qπ.

45

The next phase is Policy Improvement where a greedy policy is calculated

with respect to Q. The greedy policy will return an action a, given a state s

and a new policy that maximizes the state-action values. Monte Carlo meth-

ods are more complicated when used in non-episodic tasks because averaging

is only done after the episode is finished. When data has high variance, con-

vergence will be slower because more samples are needed. This means that

Monte Carlo is an unbiased method, while on the other hand Bootstrapping,

which is a method from Dynamic Programming, is a biased learner because

bootstrapping updates after one single step. These updates are calculated

on estimations. This will converge in finite and discrete cases to their true

values.

Rt = rt+1 + γrt+2 + γ2rt+3 + ..+ γT−t−1rT

vs.

Rt = rt+1 + γV (St+1)

This equation shows the difference between a Monte Carlo method, the first

equation, because it needs all rewards over an episode and the bootstrapping

method, second equation, only calculates the estimate of an estimate.

4.8 Temporal Difference

Temporal Difference, TD, learning is a mix between Dynamic Programming

and Monte Carlo methods. They learn from experience, by sampling by some

policy π, without any knowledge of the environment and updates are learnt

from other estimates. TD methods only need the next time step to update,

while Monte Carlo methods need the whole episode before updating.

V (St)← V (St) + α[Gt − V (St)]

vs.

V (St)← V (St) + α[Rt+1 + γV (St+1)− V (St)]

It can be seen that the first method shown is a Monte Carlo method because

it must wait until it has the Gt value, which is only gatherable after a whole

episode. The second method is a TD method because it can update after the

next time step. It uses the bootstrapping technique which is an estimation

of the Q values by only using estimates for the next state. This is a useful

feature which lower the computational load.

46

Before going into algorithms, there is a need to make a distinction between

different policies; on-policy and off-policy. On-policy is when an agent im-

proves the policy it is currently following to get a result. While off-policy is

learning the value of a policy, independently of the actions of the agent.

4.8.1 Q-Learning

An example of off-policy learning is Q-learning (C. J. C. H. Watkins, 1989).

The Q-values will approximate the optimal action value function independent

of the policy it is following, which makes it off-policy. Q-learning will converge

as long as states are visited and updated. (C. J. Watkins & Dayan, 1992).

Algorithm 1 Q-Learning

1: Initialize all Q(s, a) for s ∈ S, a ∈ A2: Repeat (for every episode):

3: Initialize s

4: Repeat (for each step of episode):

5: Choose a from s using policy derived from Q (e.g., ε-greedy)

6: Take action a, observe r, s′

7: Q(s, a)← Q(s, a) + α[r + γmaxaQ(s′, a)−Q(s, a)]

8: s← s′

9: until s is terminal

The algorithm goes as follows, first all Q(s, a) states are initialized arbitrarily.

For every episode the agent will choose a start location, s. For every step

of that episode the agent will choose an action a from a policy like ε-greedy.

The agent will take the action and receives a new state s′ and a reward

for going from state s to s′. Then the Q-values are updated by using the

following rule; use the old Q-value from were the agent started. Then the

agent calculates the reward the agent got plus the maximum of the Q-value

of the next state, which will be the estimate of the future reward multiplied

by a discount factor. This will be subtracted by the old value, all of this will

be multiplied by a learning rate which is then added to the old reward. The

agent will now go to the observed new state s′ and the iteration is restarted

with the new state.

47

4.8.2 SARSA

SARSA, previously named modified Q-learning (Rummery & Niranjan, 1994)

and renamed to SARSA by (Sutton, 1996), is an on-policy method. The

name stands for State Action Reward State Action and comes from the

agent which is in state s1, chooses action a1 and receives reward r. The

agent will then go in state s2 after taking action a1 and chooses its next

action which will be action a2.

Algorithm 2 SARSA

1: Initialize all Q(s, a) for s ∈ S, a ∈ A2: Repeat (for every episode):

3: Initialize s

4: Choose a from s using policy derived from Q (e.g., ε-greedy)



7: Choose a′ from s′ using policy derived from Q (e.g., ε-greedy)

8: Q(s, a)← Q(s, a) + α[r + γQ(s′, a′)−Q(s, a)]

9: s← s′ a← a′


The SARSA algorithms starts equivalently the same as Q-learning, where

all Q values are initialized arbitrarily. Then for every episode a state s is

chosen and immediately the following action a is derived from a policy like

ε-greedy. Then the agent will go into a loop until the state s is terminal. The

agent will take the action a and observes the reward r and the new state s′.

From this new state it will choose a new action a′ derived from a policy like

ε-greedy. The Q-values are then updated by calculating the reward it got

plus the new state action values multiplied by a discount factor. This will

then be subtracted by the old state-action values. This result will then be

multiplied by a learning rate and then added to the old state-action value.

The agent will now go to the new state s′ and and the new action a′.

4.9 Eligibility traces

TD methods use the current reward together with the estimated value, Monte

Carlo methods uses the exact reward but only after the episode is finished.

There is also a method in between where the numbers of steps (or backups)

48

are chosen, n-step method (C. J. C. H. Watkins, 1989) , before using the

estimated value.

G(1)t = Rt+1 + γV (St+1)

G(2)t = Rt+1 + γRt+2 + γ2V (St+2)

G(3)t = Rt+1 + γRt+2 + γ2Rt+3 + γ3V (St+3)

G(n)t = Rt+1 + γRt+2 + γ2Rt+3 + ..+ γnV (St+n)

The first equation, is simply bootstrapping. The second equation is called

the 2-step method , the third is called the 3-step method and so on. It can

be seen that whenever n is equal to the number of steps in an episode, it is

no longer the estimation but the actual reward which means it is the Monte

Carlo method.

With eligibility traces each state receives an extra variable, e, called the

eligibility trace. When the agent comes in a state s the eligibility trace of

that variable will be incremented, all other states will be decayed.

et(s) =

{γλet−1(s) if s 6= st

γλet−1(s) + 1 if s = st

Figure 28 shows what happens to a state. Every time a state is visited the

eligibility trace will be incremented, when the agents does not visit the state,

the state wil automatically decay. This is done by the decay parameter,

denoted as λ. By doing this, it can be seen that when learning happens

some states will be more affected than other states because of the frequented

visited states. When λ = 0, it can be seen that bootstrapping will happen,

because only the current trace is the important one and all other traces will

be zero. When the trace is set to λ = 1, it will mimic the Monte Carlo

methods.

Figure 28: Eligibility trace; image from (Sutton & Barto, 1998)

49

Eligibility traces can thus also be applied on SARSA, which is called SARSA(λ).

The idea from the original SARSA remains the same and is still on-policy.

Only now state action values are calculated with their eligibility trace and

the use of a TD error which is;

δ = rt+1 + γV (St+1)− V (st)

But can also be calculated for q(s, a) values in stead of V (s).

Algorithm 3 SARSA(λ)

1: Initialize all Q(s, a) for s ∈ S, a ∈ A and e(s, a) = 0

2: Repeat (for every episode):

3: Initialize s, a




7: δ ← r + γQ(s′, a′)−Q(s, a)

8: e(s, a)← e(s, a) + 1

9: For all s, a:

10: Q(s, a)← Q(s, a) + αδe(s, a)

11: e(s, a)← γλe(s, a)

12: s← s′ a← a′


The same can be applied on Q-learning, Q(λ). But with the single adaptation

that whenever Q-learning is following the greedy action selection, the expe-

rience can be followed but not when the random action or the non-greedy

action is selected. When a non-greedy action is selected will the eligibility

traces be reset to zero.

50

Algorithm 4 Q-Learning(λ)

1: Initialize all Q(s, a) for s ∈ S, a ∈ A and e(s, a) = 0

2: Repeat (for every episode):

3: Initialize s, a




7: a∗ ← argmaxbQ(s′, b) (if a′ ties for the max, then a∗ ← a′)

8: δ ← r + γQ(s′, a∗)−Q(s, a)

9: e(s, a)← e(s, a) + 1

10: For all s, a:

11: Q(s, a)← Q(s, a) + αδe(s, a)

12: If a′ = a∗13: then e(s, a)← γλe(s, a)

14: else e(s, a)← 0

15: s← s′; a← a′


Sometimes better performance can be gathered by using replacing traces in

stead of the standard traces where;

et(s) =

{γλet−1(s) if s 6= st

1 if s = st

Figure 29: Replacing traces; image from (Sutton & Barto, 1998)

4.10 Function approximation

Previously it was assumed that all Q-values would have a table. In this

table each Q(s, a) pair would have some value. This is a feasible method

when having states and actions on a small scale. If there are millions of

state-action pairs this would require a lot of memory but also time and data

51

to accurately compute them. Think for example the difference between the

state-space of backgammon, 1020, and the state-space for a robotic helicopter.

The robotic helicopter cannot map the whole world in his table and has thus

a continuous state-space. The solution to this problem will be to generalize

by gathering previously visited states and generalize them over the complete

set of states even if they are not yet visited. This generalization is also called

function approximation, where it take samples from the value function and

tries to generalize them and by doing so constructing an approximation of

the function. From now on these functions will be generalized and will be

parametrized by a vector w ∈ R;

v(s,w) ≈ vπ(s)

q(s, a,w) ≈ qπ(s, a)

This new function v(s,w) can be computed by a linear combination, a neural

network where the w will be the weights or a decision tree where w will be

split points and leaves.

Learning these function approximation can be done by gradient descent.

Where w = (w1, w2, ..., wn)T and v(s,w) can be differentiated denoted as

J(w). Each time step the agent observes a selected state St and its true

value under the policy vπ(St). With those values the gradient can be calcu-

lated by trying to minimizing the error as much as possible and going to a

local minima. This is done by updating the weights where the error will be

the lowest;

wt+1 = wt −1

2α∇wt

[vπ(St)− v(St,wt)

]2= wt + α

[vπ(St)− v(St,wt)

]∇wt v(St,wt)

where α is the step size and ∇wtJ(wt) is the partial derivative defined as;

∇wtJ(wt) =(∂J(wt)

∂wt,1

,∂J(wt)

∂wt,2

, ...,∂J(wt)

∂wt,n

,)T

The goal will be to find a local minima by updating the weights where the

error will be the lowest and by doing so finding a local minimum. Value

function can thus also be represented by a linear combination of v and w.

52

This can be written as;

v(s,w) = wTx(s) =n∑i=1

wixi(s)

Where each state has a vector of features x(s) = (x1(s), x2(s), .., xn(s))T with

the same amount of weights. The gradient descent with respect to w will

then be;

∇wv(s,w) = x(s)

These features can be constructed by using different methods. One example

of such a method is Coarse Coding. Where the state is a continuous space,

which in this example will be a two dimensional space (Figure 30). The

feature vector can in this example be if the state is in the circle or not. The

feature will be zero if the state is absent and 1 if the feature is present in a

certain circle. These features can overlap because the state can be in multiple

circles at once. Gradient-descent will update the weights of all the circles the

agent is in. The approximate value function will affect every point that is

between the union of the intersected circles with a greater affect if they have

more point in common.

Figure 30: Coarse coding; image from (Sutton & Barto, 1998)

53

Chapter 5Experiments and results

5.1 ALE

The environment that this thesis is based on is the Arcade Learning Environ-

ment (M. G. Bellemare, Naddaf, Veness, & Bowling, 2013), or abbreviated

ALE. It allows anyone to write AI-agents that can interact with Atari 2600

games. ALE is written on top of Stella 1 which is an open-source Atari emu-

lator. ALE enables interactions with the Stella emulator which permits the

user to gather all sorts of data like RAM and frame states parallel while the

game is playing and can even send data, like action moves, to the game.

The Atari 2600 console was invented in 1977. The hardware of the con-

sole is rather simple, compared to consoles today, it has a CPU of 1.19 Mhz

and has a RAM of 128 bytes. Games only had a screen of 160 pixels wide

and 210 pixel high with a maximum of 128 colors. The screen has thus 33600

pixels in total. The ALE system allows an agent to observe the current game

screen and/or the RAM state of the console. The advantage of frames is

that they are human interpretable (Figure 31b). But unfortunately, frames

provide an agent with only partial information as a single frame does not

provide information about the movement of objects. The RAM is not hu-

manly interpretable, but has more information and even holds the complete

state of the game (Figure 31a). The console has a joystick with 18 different

possible moves, but not all of them are used when playing a game. Because

the console is -hardware wise- not powerful it can easily be emulated. This

makes an excellent testbed for AI-agents because the possibilities with the

1http://stella.sourceforge.net

54

http://stella.sourceforge.net

frames, RAM and the limited possible actions. This on the contrary to cur-

rent games which have millions of pixels and multiple gigabytes of RAM

states. This does not mean that Atari 2600 games can easily be learnt, take

for example a game where only 4 possible actions are valid, this means that

when the game is running at 60 frames per second, only looking one second

ahead means searching through 460 different simulations that can be done.

0 5 10 15 20 25 30

0

5

10

15

20

25

30

(a) RAM

0 20 40 60 80 100 120 140

0

50

100

150

200

(b) Frames

Figure 31: The difference between RAM and Frames

5.2 Space Invaders

The game of Space Invaders (Figure 32) was chosen as a test bed for com-

bining Reinforcement Learning together with autoencoders. Space Invaders

is one of the most used games as a test bed for RL-agents (Mnih et al., 2013;

M. G. Bellemare et al., 2013). It is known that Reinforcement Learning

agents can beat a human level player (Mnih et al., 2015).

Space Invaders was first released in 1978 by Tomohiro Nishikado, since then

many different adoptions exist. The player controls a space ship and can

fire missiles. The goal of the game is to hit all layers of aliens and go to

the next level. The player can hide behind walls to shield himself from the

lasers coming from the aliens. The player can only move left, right, shoot

and do nothing. When a player misses his shot, he must wait until the laser

is off the screen so he can fire his next missile. Once all rows of aliens are

cleared the game goes to the next level, where the aliens will move more

quickly. The Command Alien Ship will randomly come and when shot will

yield more points than the basic alien ships. When the aliens come too close

to the shields, the shields will disappear and when the aliens eventually come

55

too close to the players ship, the game ends and will restart. The player has

a total of 3 lives before the game starts from scratch. The players receives

only a reward when hitting an alien spaceship.

0 20 40 60 80 100 120 140

0

50

100

150

200

Figure 32: Space Invaders screen

5.3 Reconstruction

When using autoencoders for extracting features and dimensionality reduc-

tion, it is essential that they are trained properly and that the autoencoders

in question can reconstruct from their different hidden layers. Using the

Mean Square Error we can see how far off the prediction of an autoencoder

is from the input values.

MSE(~x, ~y) =1

n(~x− ~y)2

Where the ~x is the input, the original RAM state, and ~y is the reconstruc-

tion of the autoencoder of ~x. The input values are gathered by running an

agent with SARSA(λ) and saving all possible RAM states. The agents plays

a total of 3000 episodes, each episode consist of an undetermined amount of

steps. These steps are only known when the agent has died three times in

the game. Each step the agent receives a RAM state which is then saved.

The dots shows when an autoencoder is trained from an input of 128 bytes

RAM state. Autencoders can be trained in two ways, the direct and in-

direct way. The direct way is to go from the start dimension of 128 to a

56

specified number of hidden nodes and back to 128 output nodes. The di-

rect autoencoder has thus only 1 hidden layer. Figure 33 shows going from

128 → Number of nodes → 128, where each arrow denotes the interconnec-

tion between two layers. It was decided that when training autoencoders on

different amount of hidden nodes, the number of hidden nodes will always

be divided by 2. As can be seen the lower the amount of hidden nodes, with

lowest going from 128→ 1→ 128 and highest 128→ 128→ 128, the higher

the MSE will be. This is only a logical conclusion, 1 hidden node cannot

perform as well as 128 hidden nodes. There is too much information lost in

going from a high number of dimension to a too low number of dimensions in

contrast with a high number of hidden nodes. Although it can be argued that

an error of 0.086 for 1 hidden node in 1 hidden layer is not that high. One

way to counteract the loss of going from one big dimension to an immediate

lower dimension is adding multiple layers. The red dots shows us the MSE

when going to the a lower dimension with intermediate layers, for example

in the case of going from 128 bytes to 1 node will be:

128→ 64→ 32→ 16→ 8→ 4→ 2→ 1→ 2→ 4→ 8→ 16→ 32→ 64→ 128

This also means that the trainingtime of the autencoder with multiple hidden

layers will be higher than a direct autoencoder. But it can be seen that when

using multiple hidden layers the autoencoder in question can achieve a lower

MSE than a direct autencoder. Note that the indirect autoencoder from

128→ 64→ 128 is omitted since it does not use multiple layers.

57

1 2 4 8 16 32 64 128

Layer size

0.00

0.02

0.04

0.06

0.08

0.10

MSE

Trained autoencoder from 128 to another layer

Directly

Indirectly

Figure 33: Mean Square Error of a trained autoencoder from an input layer

with 128 bits to a smaller layer directly and indirectly

It is also a good idea when experimenting with RAM states and autoencoders

to also train autoencoders not only in their byte form but also in their bit

form, thus instead of using 128 bytes training the autoencoder with input

of 1024 bits, Figure 34. The blue dots shows us then going from 1024 →Number of nodes→ 1024 and the red dots shows us the MSE with multiple

hidden layers. The same conclusion can be drawn in here as in the case with

autoencoders with 128 bytes. The deeper an autoencoder goes the more

information is lost. This can be made up for by using multiple layers. To

compare the settings with 128 bytes and 1024 bits as input layer, it can be

seen that 128 bytes performs better in reconstructing the input and thus

going to a lower dimension and then back to the same dimension.

58

8 16 32 64 128 256 512 1024

Layer size

0.00

0.02

0.04

0.06

0.08

0.10

MSE

Trained autoencoder from 1024 to another layer

Directly

Indirectly

Figure 34: Mean Square Error of a trained autoencoder from a input layer

of 1024 bits to a smaller layer directly and indirectly

5.4 Flow of experiments

All experiments will follow the same phases but with different settings. The

first phase is the preparation phase where the manual features SARSA(λ) is

run for 3000 episodes and where all RAM states are captured. The second

phase is preprocessing phase where the autoencoder is trained. The settings

of the autoencoder must be specified, number of layers, hidden nodes, and so

on. The n epochs are set on 15, this is how many times all trainings exam-

ples are put through the autoencoder. One epoch is thus one trainingcycle.

Further is the batch size, the number of trainings examples put through

before updating the weights, set on the same number as input dimensions.

So if the input dimension is 1024, from 1024 bits RAM, then the batch size

will be set on 1024. Additional can the loss function and activation function

be specified. After the autoencoder is trained, the last phase starts. The

agents receives a RAM state. This RAM state will go through the trained

autoencoder. Depending on the criteria a specified layer will be exacted and

59

used as the features. The agent will use these feature and learn with them.

5.5 Manual features and basic RAM

In the paper of (Naddaf, 2010; M. G. Bellemare et al., 2013), they perform

manual feature extraction by concatenating the original RAM state with the

pairwise logical AND of every possible pair. Figure 35 shows the difference

between the two combinations, it also shows a random performance where

the agent chooses a random action no matter which feature are presented.

The x-axis denotes the amount of episodes played an the y-axis presents

the rewards ALE returns when choosing actions. As can be seen the RAM

states with the pairwise AND will perform better than the basic RAM states.

These pairwise AND feature construction is manually done, the designer of

the algorithm must implement the pairwise algorithm and before he can

decide that the pairwise AND performs better than the basic RAM states

many experiments have passed. This is the aim of this thesis to skip the

test of finding good features and let the autoencoders handle the feature

extraction. From now on, the RAM concatenated with the pairwise AND

will be seen as the manual features and the standalone RAM will be seen as

basic RAM.

60

0 500 1000 1500 2000 2500 3000

Episodes

0

50

100

150

200

250

300

350

400

450

Rew

ard

s

Difference between RAM & RAM + AND

RAM + pairwise AND

Random

RAM

Figure 35: The difference between RAM combined with the pairwise logical

AND and RAM alone

5.6 Difference between bits and bytes

When working with RAM states we can choose how to represent the RAM

state, as bytes or bits. Note that bytes are normalized by dividing them by

255, so that their range is between [0,1]. By normalizing the input values,

the converage will be usually faster than when using not-normalized data

(Y. A. LeCun et al., 2012).

Figure 36 shows when the input values are the normalized bytes with a hidden

layer of 128 nodes. By doing this, we will simulate the identity function with

the same amount of input values. As can be seen it cannot translate the input

values of the RAM bytes well to a good feature vector. There can be a wide

range of possible problems why the bytes are not a good feature extraction.

For example the batch size was too low or too high, perhaps a Denoising

Autoencoder could have helped or even different activation functions glued

together with multiple layers of the same amount of hidden nodes. Of course

if we put enough time and effort in tuning all different hyperparameters we

61

would eventually get a better result. This is not the goal of this thesis, we

want to find an autoencoder as simple as possible without tweaking too much

and finding a good feature vector. Another explanation possible is that the

agent simply does not have enough information available in the extracted

feature vector and that valuable information that was previously available

in the basic RAM has been lost. The agent still learns better than playing

random, but is not as good as the manual features.

0 500 1000 1500 2000 2500 3000

Episodes

50

100

150

200

250

300

350

400

Rew

ard

s

Autoencoders trained from 128 -> 128

128->128 Lin

128->128 Sig

128->128 Rel

Manual features

Random

Figure 36: Autoencoders on 128 bytes

In Figure 37 we see the results of an autoencoder with as input value the RAM

state represented in bits. The same autoencoder was used as with bytes, with

the exact same settings. As can be seen the agent could use all the extra

information available, in contrast with the 128 byte autoencoder, and could

actually learn from the extracted feature vector. With this confirmation the

rest of this thesis will investigate the bit version of RAM states.

62

0 500 1000 1500 2000 2500 3000

Episodes

0

50

100

150

200

250

300

350

400

Rew

ard

s

Autoencoders trained from 1024 -> 1024

1024->1024 Lin

1024->1024 Rel

1024->1024 Sig

Manual features

Random

Figure 37: Autoencoders on 1024 bytes

5.7 Comparing different activation functions

As said previously choosing the right activation function can help in creating

better results. Table 6 depicts autoencoders which uses different activation

functions. For a more visual representation, see Appendix A, Figures 47,

48 and 49. It shows averages the last 1000 rewards of episodes with their

standard deviation. Note that when an activation function is set, all layers

use the same activation. There is also a possibility to use different activation

function in different layers, but this was not investigated. Each activation

function has been tested with an autoencoder going from the input 1024 to

a chosen bottleneck and back to the original inputsize. Note that each layer

is each time divided by two. So using an autoencoder which is depicted as

1024 → 256, uses three hidden layers, encoding from 1024 → 512 to the

bottleneck of 256 and back encoding to 512→ 1024. As can be seen a linear

activation function performs best with encoding the original state 1024 to

an encoded version of 512. Going deeper with linear activation function will

yield, in this case, NaN values. Because linear activation functions have no

63

limit and will only keep rising. This in contrast with the Sigmoid function

which is bound between [0, 1] and ReLU which forces neuron to be approxi-

mately 50 % active. Note that a linear activation function is nearly equivalent

with using the method PCA, Principal Component Analysis. PCA is a lin-

ear technique that can be used for dimensionality reduction and by doing so

finding the principal components. They show directions where the data is

most spread out and has the biggest variance. Linear autoencoders can only

return a linear encoding because the activation is also linear, therefore we

will pursuit to research more in non-linear activation functions.

As can be seen the ReLU activation does not perform too well in contrast to

the other activation functions. Sigmoid performs well when using a hidden

layer with 1024 nodes, the same as with the linear activation. To statistically

confirm this we used the MannWhitney U test, which assumes the data is

not normal distributed. The first test was between the Manual features and

the Basic features and results in a p-value of 9.63357008643e-07. We can

assume that when the p-value is smaller than 0.05 that there is a difference

between the Manual features and the Basic with 95% certainty. Which is

exactly what can be seen on Figure 35.

Linear Sigmoid ReLU

1024 → 1024 323.43 (± 47.11) 325.01 (± 43.96) 288.85 (±44.41)

1024 → 512 323.83 (± 45.44) 290.53 (± 38.64) 230.74 (± 39.11)

1024 → 256 NA 250.09 (± 35.35) 267.08 (± 43.84)

1024 → 128 NA 250.9 (± 41.42) 191.86 (± 30.04)

1024 → 64 NA 152.75 (± 23.83) 116.1 (± 25.64)

Manual features 330.87 (± 35.26)

Basic 301.92 (± 36.39)

Table 6: Comparing different activation functions against the number of

hidden layers and nodes

To statically prove that there is a difference with the manual features and

the encoded feature extraction we will test the manual features against the

different activation function from 1024→ 1024, 1024→ 512 and 1024→ 256,

64

Linear Sigmoid ReLU

1024 → 1024 0.0322843539108 0.00798380208929 1.6703515625e-06

1024 → 512 0.293818666313 6.30184822139e-08 1.6703515625e-06

1024 → 256 NA 5.73303143758e-07 2.99746184625e-06

Table 7: P-values of the MannWhitney U test

Table 7. As can be seen almost all p-values are lower than 0.05 which means

we can assume with 95% certainty that they differ from the manual features.

This does not mean that they are better or worse features. Except we can-

not assume they are different with the autoencoder with a linear activation

function with 1024→ 512.

5.8 Initializing Q-values

When designing SARSA it is of most important to set the right and optimal

Q-values. Initializing the Q-values will influence the speed of learning and

the efficiency of the algorithm (Koenig & Simmons, 1996). When the agent

is put in a setting, for example the grid world, the agent needs to find the

goal before even searching for a good policy. One way to do this is by letting

the agent explore the whole world, when the agent is exploring he will adapt

Q-values and put them in a way that he will remember of going in a state

with a certain action is a good action or not. If we have some knowledge we

can even adapt the Q-values via some rule. For example if we know the goal

of the setting, it would be easier to set the Q-value on a higher or lower value

to reduce the exploring. For example;

Q(s, a) =

{0 if s ∈ G, a ∈ Aq if s ∈ S\G, a ∈ A

where the Q(s, a) will be set on zero when the state is also a goal state and

otherwise will be set on some value q when the state is not a goal state. This

forces the agent to learn with the given Q-values, which he will learn in a

more optimistic way, by doing this the learning time and exploration will be

less than when initializing everything on the same number.

65

Unfortunately Space Invaders is a never-ending game, so setting a differ-

ent value on the goal state cannot be done. Even if it was known we cannot

set the goal state differently than other states because the features are black-

box and do not mean anything to a human. We can adapt all Q-values to

some other number and see how this will evolve and if the agent can learn

more optimistically. All previous graphs and tables are Q-values which are

initialized on zero. This experiments were run with sigmoid, so we know our

values will be between [0, 1]. Taking an average of the whole Q-values on

the last 500 episodes of our best autoencoders gives us an averaged value of

±0.57. So initializing Q-values on −1 and 1 would affect the learning rate.

Figure 38 shows when the Q-values are initialized on Q(s, a) = −1 and Fig-

ure 39 shows when initialized on Q(s, a) = 1. We can immediately see the

difference in how quick the agent is learning. Take for example on Figure 38

and Figure 39 the autoencoder trained from 1024→ 1024, thus learning the

identity function. As can be seen that when the Q-values are −1 the agent

will learn incredibly slow, it is even so slow that only after 3000 episodes the

agent reaches the same value as randomly playing. While on episode 500 the

agent, where Q-values are initialized on 1, will already have 4 times more

reward than he has where the Q-values are initialized on −1. As can be seen

generally speaking the values will tend to the same result as Q = 0 as long

as the experiments run long enough.

0 500 1000 1500 2000 2500 3000

Episodes

0

50

100

150

200

250

300

350

400

450

Rew

ard

s

Sigmoid activation with Q-values=-1

1024->64

1024->128

1024->256

1024->512

1024->1024

Manual features

Random

Figure 38: Q = −1

66

0 500 1000 1500 2000 2500 3000

Episodes

0

50

100

150

200

250

300

350

400

450R

ew

ard

s

Sigmoid activation with Q-values=1

1024->64

1024->128

1024->256

1024->512

1024->1024

Manual features

Random

Figure 39: Q = 1

Table 8 shows the average of rewards that was received by using different

autoencoders. This average was taken on the 500 last episodes. As can be

seen the Q = −1 does not perform any good, it takes too much time to learn.

But there is a competition between the Q = 0 and Q = 1. Autoencoders

trained to 1024 and 512 perform better when the Q-values are initialized

on 1. But when trained deeper with multiple hidden layers tend to learn

better with the initialization on Q = 0. Since we are experimenting how

deep we can go with deep learning before losing to many information of our

unsupervised feature extraction method we will continue from now on using

the values initialized on Q = 0.

Q = −1 Q = 0 Q = 1

1024 → 1024 64.78 (± 17.71) 325.01 (± 43.96) 267.42 (± 37.49)

1024 → 512 216.06 (± 31.57) 290.53 (± 38.64) 296.39 (± 40.5)

1024 → 256 152.03 (± 24.25) 250.09 (± 35.35) 253.82 (± 39.31)

1024 → 128 239.04 (± 35.52) 250.9 (± 41.42) 238.25 (± 37.14)

1024 → 64 158.91 (± 26.29) 152.75 (± 23.83) 150.84 (± 29.8)

Table 8: The difference between in setting different Q-values

67

5.9 Pretraining and extracting other layers

In previous experiments only the bottleneck was used as the extracted fea-

ture method. But since we are experimenting with deep learning and thus

using different layers it could also be useful to go to a very small bottleneck

and extracting a different layer than first intended. This was also used in

previous research (Stadie, Levine, & Abbeel, 2015), where they did not take

the bottleneck layer. Figure 40 shows a visual way, where the third hidden

layer with the red box is extracted instead of the intended bottleneck.

Figure 40: Example of an autoencoder with another layer extracted than the

bottleneck

When going into deep learning it also a good idea to pretrain the network.

Pretaining is when each layer is trained separately and then concatenated

together. For example if we want to have a pretrained autoencoder from

1024→ 256, we will first train another autoencoder from 1024→ 512. Then

all the weights are saved together with all the encoded form of the input

layer, so now our input layer will be 512. The next step will be creating an

autoencoder from 512 → 256, this will be trained with are new, encoded,

input features. Afterwards a whole new autoencoder is created with the

weights that are saved for each layer. The autoencoder can then be fine-tuned

68

by training again on the whole layer. Note that this is very time-consuming

because multiple autoencoder are trained.

1024

-> 2

56: 5

12

1024

-> 1

28: 5

12

1024

-> 6

4: 5

12

1024

-> 3

2: 5

12

1024

-> 1

6: 5

12

1024

-> 8

: 512

1024

-> 4

: 512

Man

ual f

eatu

res

Basic

Rando

m50

100

150

200

250

300

350

400

Rew

ard

s

Training deep with pretraining and extracting layer 512

Manual features

Random

Basic

Figure 41: Pretraining with extraction of layer 512

Figure 41 shows when autencoders are trained with pretraining to a very

small layer and each time the layer 512 is extracted. Boxplots are shown for

the last 1000 episodes together with boxplots of the Basic, Manual features

and Random with their average line to get a good comparison. As can be

seen the deeper the autoencoder, which goes to layers of 32, 16, 8 and 4,

the more information is lost. This results in rewards which are not good

compared to the results of Manual features and Basic. But pretraining has

helped in training the autoencoder of 1024 → 64. It shows that it perfor-

mance is better than the Basic but still underperforms in comparison to the

Manual features. See in Appendix A Figure 50 for the detailed plot.

Figure 42 shows the result of training to a layer with 4 nodes. This means

that there are a total of 8 possible layers that can be extracted. When train-

ing to a layer with 4 features and extracting those 4 features will not yield a

good score. There is too much information lost from going to 1024 possible

69

features to only 4. But when the same autoencoder is extracting a layer

that has a higher number of hidden nodes than these 4, it will yield more

information and a higher result. The reason that 512 nodes does not yield

a bigger score than just training to one layer of 512 nodes is because of the

training error. As previously mentioned the deeper a network is trained the

more information is lost (Section 5.3).

1024

-> 4

: 512

1024

-> 4

: 256

1024

-> 4

: 128

1024

-> 4

: 64

1024

-> 4

: 32

1024

-> 4

: 16

1024

-> 4

: 8

1024

-> 4

: 4

Man

ual f

eatu

res

Basic

Rando

m50

100

150

200

250

300

350

400

Rew

ard

s

Training deep with to a layer with 4 nodes

Manual features

Random

Basic

Figure 42: Pretraining with extraction to a hidden layer of 4 nodes

A more detailed table of all the autoencoders with all their possible layers

extracted is depicted in Appendix A Table 9 with their result and standard

deviation of the 1000 last episodes.

As suggested by (Srivastava et al., 2014) adding dropout to a deep net-

work can prevent the network from overfitting. Remember that when the

network is trained on samples it will try to create a network that can fit

the data perfectly. But when the network can mimic the training samples

almost perfectly but cannot mimic the test samples, or new samples from

our agent, it is overfitting. By adding dropout, and thus randomly drop-

ping nodes and their connections, the network will try to learn the samples

70

via different nodes and connections. Figure 43 show what happens to the

performance when adding dropout. When using fewer hidden layers which

leads to also fewer hidden nodes it can be seen that the rewards gained from

the agent will be worse than before. But training with autoencoders with

1024 → 32, 16, 8 it can be seen that they perform better than before. The

network is probably overfitting and trying to recreate all training samples

exactly, by using a dropout of 30% this can be avoided. Although the box-

plots show that the autoencoder 1024 → 256 : 512 has a lower reward than

the autoencoder with 1024 → 256 : 512 after 3000 episodes but Figure 44

shows that the learning curve, the black line, is not converging and is still

increasing. This does mean that adding dropout means that learning will be

slower as well as for the autoencoder as for the agent.

1024

-> 2

56: 5

12

1024

-> 1

28: 5

12

1024

-> 6

4: 5

12

1024

-> 3

2: 5

12

1024

-> 1

6: 5

12

1024

-> 8

: 512

1024

-> 4

: 512

Man

ual f

eatu

res

Basic

Rando

m50

100

150

200

250

300

350

400

Rew

ard

s

Training deep with pretraining and extracting layer 512: dropout

Manual features

Random

Basic

Figure 43: Pretraining with extraction of layer 512 with dropout

71

0 500 1000 1500 2000 2500 3000

Episodes

0

100

200

300

400

500R

ew

ard

sTraining deep with pretraining and extracting layer 512 with dropout

1024 -> 4: 512

1024 -> 8: 512

1024 -> 16: 512

1024 -> 32: 512

1024 -> 64: 512

1024 -> 128: 512

1024 -> 256: 512

Manual features

Random

Figure 44: Pretraining with extraction of layer 512 with dropout

5.10 Combination of RAM and layer

Combining layers of RAM and the encoded version of RAM could give us

information of how much the encoded version of the RAM is contributing.

Figure 45 shows us the results, for a more detailed plot see Appendix A

Figure 51. Adding the RAM state will give a boost to a poorer feature ex-

traction. Note that it is important that RAM state is between [0, 1] because

the activation function sigmoid limits also the values between [0, 1]. Nonethe-

less with a weaker feature extraction the original RAM state will take over

and will be used over the extracted features from the autoencoder. Figure

45 also shows the difference between the boxplot of 1024→ 512 +RAM and

1024→ 512. It can be seen that the features from the autoencoder and RAM

perform a little better than an agent which uses only the feautres from the

autoencoder. This means that the autoencoder does not have captured all

valuable information that was in the RAM, if it would have the performance

would have been the same. Although it can be argued that the difference is

minimal so it has captured most parts of the valuable information.

72

1024 -> 512 +

RAM

1024 -> 256 +

RAM

1024 -> 128 +

RAM

1024 -> 64 +

RAM

Manual featu

res

Basic

1024 -> 512

Random50

100

150

200

250

300

350

400

Rew

ard

s

Combining RAM and encoded RAM

Manual features

Random

Basic

Figure 45: Combining the original layer with the encoded version

5.11 Visualizing high dimensional data

It is also possible to visualize our high dimension data by using a technique

called t-tsne, t-Distributed Stochastic Neighbor Embedding (Van der Maaten

& Hinton, 2008). This mapping will map the high dimensions onto a two

dimensional space, this is done by searching for states that are very similar.

Both of our axis will go from our best autoencoder, 1024 → 512, with the

sigmoid function and save all the encoded states. This will then be mapped

to a two dimensional space by using the t-tsne technique. This will result in

a scatter-plot. All points will then get a color by using the following;

colors = max(φ · θ)

Where φ will be the encoding of the RAM state and θ the state-action.

Since our φ will be of dimension (samples × nodes), where nodes is an

array of values from our autoencoder encoding and θ will be of dimension

(nodes × action), where action will be the possible actions that the agent can

73

take. We can then take the dot-product, this gives us an array of dimension

(sample × action) afterwards we will take the maximum value of the results,

which gives us a one-dimensional array. This array gives the maximum Q-

value for an input state. Figure 46 shows the result of the last 10.000 RAM

states, encoding and state-action values. As can be seen there are clusters

with the same colors like red, some blue-ish and even some green. This means

that there are states from the RAM state that are comparable and are closely

matched with states coming from the autoencoder. This is an indication that

the features that we use to learn values are in fact relevant features for the

task, despite that the values are not being used to learn features.

150 100 50 0 50 100 150150

100

50

0

50

100

150

Figure 46: t-tsne

74

Chapter 6Conclusions

We have developed a method for unsupervised feature extraction that outper-

forms the use of raw input features and almost matches the manual feature

encoding methods. Our method is based on the use of autoencoding neu-

ral networks to learn a compressed representation of the input data. We

have compared multiple possible autoencoders based approaches and com-

pared these empirically. A number of conclusions can be drawn from these

experiments. The non-linear autoencoder is in this case not better than a

linear autoencoder. The linear autoencoder can compete with the Manual

features, but it could have easily been a PCA method which would yield the

same results. It does yield results in researching different activation func-

tions, because as can be seen on graphs they do make a wide difference.

When finetuning autoencoders and reducing to a very small dimension, com-

ing from a big dimension, with many layers it is a good idea to add pretraining

and dropout. These mechanism are needed so that the autoencoder does not

overfit on the training data.

Seeing the visualization of the autoencoder we can indeed see some clusters,

thus the autoencoder does find a representation where the input RAM dimen-

sion is well represented by the encoded states together with the SARSA(λ)

values.

When using autoencoders as a feature extraction method, research in dif-

ferent layers, activation function and even different input methods must be

taken to get a wide range of possibilities in choosing the best autoencoder. It

is proven in this research that when working on a blackbox of data, because

RAM is not humanly interpretable, it is possible to get a better result than

using plain features.

75

6.1 Future work

This thesis is entirely based on RAM states, because RAM states are black-

box it is difficult to see what happens or to interpret what happens. We

know RAM states contains the entire state of a game. It knows where the

agent is, if the laser is fired and in what direction. Unfortunately it is practi-

cally impossible to find these things from the RAM state. This is in contrast

with frames. ALE offers also the possibility to receive frames, these frames

consists of pixels with different color values. In Atari 2600 games each color

is for a specific item, for example green is the players ship, orange the shield.

These are useful features that can be used to learn in a better way. This

can be learnt by removing the background, the static colors like the score,

the khaki base and so on. But when the agent receives pixels, he does not

know what happens. It does not contain the entire state of the game. For

example, see previous Figure 32, the agent receives the frame. But he cannot

determine from a single frame where the laser is going. This laser can come

from the agent itself, from a few time steps back, or even come from the

aliens. To overcome this problem multiple frames can be used in stead of

using one frame, like we did in this thesis only 1 RAM state per time step.

76

Appendices

77

Appendix AExtended graphs and tables

0 500 1000 1500 2000 2500 3000

Episodes

0

50

100

150

200

250

300

350

400

Rew

ard

s

Gamplay with autencoders and linear activation function

1024->512

1024->1024

Manual features

Random

Figure 47: Autoencoders with multiple hidden layers with a Linear activation

function

78

0 500 1000 1500 2000 2500 3000

Episodes

50

100

150

200

250

300

350

400R

ew

ard

sGamplay with autencoders and Sigmoid activation function

1024->16

1024->32

1024->64

1024->128

1024->256

1024->512

1024->1024

Manual features

Random

Figure 48: Autoencoders with multiple hidden layers with a Sigmoid activa-

tion function

0 500 1000 1500 2000 2500 3000

Episodes

50

100

150

200

250

300

350

400

Rew

ard

s

Gamplay with autencoders and ReLU activation function

1024->64

1024->128

1024->256

1024->512

1024->1024

Manual features

Random

Figure 49: Autoencoders with multiple hidden layers with a ReLU activation

function

79

0 500 1000 1500 2000 2500 3000

Episodes

0

100

200

300

400

500R

ew

ard

sTraining deep with pretraining and extracting layer 512

1024 -> 4: 512

1024 -> 8: 512

1024 -> 16: 512

1024 -> 32: 512

1024 -> 64: 512

1024 -> 128: 512

1024 -> 256: 512

Manual features

Random

Figure 50: Pretraining with extraction of layer 512

0 500 1000 1500 2000 2500 3000

Episodes

50

100

150

200

250

300

350

400

Rew

ard

s

Combining the encoded RAM + original RAM

1024->64

1024->128

1024->256

1024->512

Manual features

Random

RAM

Figure 51: Combining the original layer with the encoded version

80

1024→

256

1024→

128

1024→

6410

24→

3210

24→

1610

24→

810

24→

4

Lay

er51

229

1.67

(±33

.03)

297.

2(±

35.0

1)30

6.86

(±38

.53)

253.

74(±

32.0

1)25

1.57

(±30

.44)

249.

99(±

32.6

9)23

6.42

(±29

.71)

Lay

er25

629

4.01

(±37

.12)

289.

75(±

35.6

9)27

0.12

(±34

.52)

258.

19(±

30.8

2)24

0.92

(±33

.65)

247.

75(±

34.4

9)25

1.18

(±31

.93)

Lay

er12

827

2.67

(±34

.38)

232.

15(±

30.4

3)23

3.08

(±30

.86)

242.

41(±

31.8

2)21

0.0

(±32

.45)

238.

56(±

29.5

1)

Lay

er64

228.

96(±

27.8

7)21

9.76

(±26

.28)

249.

08(±

35.2

9)20

4.68

(±28

.71)

215.

38(±

28.7

4)

Lay

er32

240.

66(±

30.1

4)22

3.82

(±28

.25)

213.

47(±

28.3

8)21

2.85

(±27

.6)

Lay

er16

238.

78(±

32.9

7)24

8.57

(±31

.52)

185.

06(±

27.2

1)

Lay

er8

191.

69(±

31.4

7)15

3.39

(±23

.84)

Lay

er4

145.

6(±

29.4

5)

Tab

le9:

Tra

inin

gto

asp

ecifi

cla

yer

and

extr

acti

ng

ach

osen

laye

r

81

Chapter 7Bibliography

Anji. (n.d.). Pole balance. Retrieved April 29, 2016, from http : / / anji .

sourceforge.net/polebalance.htm

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive

elements that can solve difficult learning control problems. Systems,

Man and Cybernetics, IEEE Transactions on, (5), 834–846.

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013, June). The

arcade learning environment: an evaluation platform for general agents.

Journal of Artificial Intelligence Research, 47, 253–279.

Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends R©in Machine Learning, 2 (1), 1–127.

Breiman, L. (1996). Bagging predictors. Machine learning, 24 (2), 123–140.

Campbell, M., Hoane, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial intel-

ligence, 134 (1), 57–83.

Collobert, R. & Weston, J. (2008). A unified architecture for natural language

processing: deep neural networks with multitask learning. In Proceed-

ings of the 25th international conference on machine learning (pp. 160–

167). ACM.

Cruz, J. A. & Wishart, D. S. (2006). Applications of machine learning in

cancer prediction and prognosis. Cancer informatics, 2.

Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of

on-line learning and an application to boosting. Journal of computer

and system sciences, 55 (1), 119–139.

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural net-

works. In International conference on artificial intelligence and statis-

tics (pp. 315–323).

82

http://anji.sourceforge.net/polebalance.htm

http://anji.sourceforge.net/polebalance.htm

Google. (n.d.). Google self-driving car project. Retrieved April 29, 2016, from

https://www.google.com/selfdrivingcar/reports/

Hinton, G. E. [Geoffrey E] & Salakhutdinov, R. R. (2006). Reducing the

dimensionality of data with neural networks. Science, 313 (5786), 504–

507.

Hinton, G. E. [Geoffrey E.] & Salakhutdinov, R. R. (2008). Using deep belief

nets to learn covariance kernels for gaussian processes. In J. C. Platt,

D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural in-

formation processing systems 20 (pp. 1249–1256). Curran Associates,

Inc.

Koenig, S. & Simmons, R. G. (1996). The effect of representation and knowl-

edge on goal-directed exploration with reinforcement-learning algorithms.

Machine Learning, 22 (1-3), 227–250.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification

with deep convolutional neural networks. In F. Pereira, C. J. C. Burges,

L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information

processing systems 25 (pp. 1097–1105). Curran Associates, Inc.

LeCun, Y. A., Bottou, L., Orr, G. B., & Muller, K.-R. (2012). Efficient back-

prop. In Neural networks: tricks of the trade (pp. 9–48). Springer.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553),

436–444.

RL-Library. (n.d.). Mountain car. Retrieved April 29, 2016, from http://

library.rl-community.org/wiki/Mountain Car (Java)

Makhzani, A. & Frey, B. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663.

Michie, D. & Chambers, R. A. (1968). Boxes: an experiment in adaptive

control. Machine intelligence, 2 (2), 137–152.

Minsky, M. & Papert, S. (1969). Perceptrons. MIT press.

Mitchell, T. (1997). Machine learning. McGraw-Hill International Editions.

McGraw-Hill.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra,

D., & Riedmiller, M. (2013). Playing atari with deep reinforcement

learning. arXiv preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare,

M. G., . . . Ostrovski, G., et al. (2015). Human-level control through

deep reinforcement learning. Nature, 518 (7540), 529–533.

Naddaf, Y. et al. (2010). Game-independent ai agents for playing atari 2600

console games (Doctoral dissertation, University of Alberta).

83

https://www.google.com/selfdrivingcar/reports/

http://library.rl-community.org/wiki/Mountain_Car_(Java)

http://library.rl-community.org/wiki/Mountain_Car_(Java)

Nair, V. & Hinton, G. E. [Geoffrey E]. (2010). Rectified linear units improve

restricted boltzmann machines. In Proceedings of the 27th international

conference on machine learning (icml-10) (pp. 807–814).

Ng, A. (2011). Sparse autoencoder. CS294A Lecture notes, 72, 1–19.

Quinlan, J. R. (1987). Simplifying decision trees. International journal of

man-machine studies, 27 (3), 221–234.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information

storage and organization in the brain. Psychological review, 65 (6), 386.

Rummery, G. A. & Niranjan, M. (1994). On-line q-learning using connec-

tionist systems.

Sammut, C. & Webb, G. I. (2011). Encyclopedia of machine learning. Springer

Science & Business Media.

Schaeffer, J., Culberson, J., Treloar, N., Knight, B., Lu, P., & Szafron, D.

(1992). A world championship caliber checkers program. Artificial In-

telligence, 53 (2), 273–289.

Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neu-

ral Networks, 61, 85–117.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche,

G., . . . Lanctot, M., et al. (2016). Mastering the game of go with deep

neural networks and tree search. Nature, 529 (7587), 484–489.

Skinner, B. F. (1938). The behavior of organisms: an experimental analysis.

Skinner, B. F. (1948). Superstition in the pigeon. Journal of experimental

psychology, 38 (2), 168.

Skinner, B. F. (1951). How to teach animals. Freeman.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov,

R. (2014). Dropout: a simple way to prevent neural networks from

overfitting. The Journal of Machine Learning Research, 15 (1), 1929–

1958.

Stadie, B. C., Levine, S., & Abbeel, P. (2015). Incentivizing exploration

in reinforcement learning with deep predictive models. arXiv preprint

arXiv:1507.00814.

Sutton, R. S. (1996). Generalization in reinforcement learning: successful

examples using sparse coarse coding. Advances in neural information

processing systems, 1038–1044.

Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning: an introduction.

MIT press.

Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program, achieves

master-level play. Neural computation, 6 (2), 215–219.

84

Thorndike, E. L. (1911). Animal intelligence: an experimental study of the

associative processes in animals.

Todes, D. P. (2002). Pavlov’s physiology factory: experiment, interpretation,

laboratory enterprise. JHU Press.

Trier, Ø. D., Jain, A. K., & Taxt, T. (1996). Feature extraction methods for

character recognition-a survey. Pattern recognition, 29 (4), 641–662.

Van der Maaten, L. & Hinton, G. (2008). Visualizing data using t-sne. Jour-

nal of Machine Learning Research, 9 (2579-2605), 85.

Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extract-

ing and composing robust features with denoising autoencoders. In

Proceedings of the 25th international conference on machine learning

(pp. 1096–1103). ACM.

Watkins, C. J. & Dayan, P. (1992). Q-learning. Machine learning, 8 (3-4),

279–292.

Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dis-

sertation, University of Cambridge England).

85

unsupervised feature extraction for reinforcement learning · 2017-05-12 · unsupervised feature...

Documents