music predictions using deep learning. could lstm networks...

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2016

Music Predictions Using Deep Learning.Could LSTM Networks be the New Standard for Collaborative Filtering?

EMIL KESKI-SEPPÄLÄ AND MICHAEL SNELLMAN

KTHSKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

Music Predictions Using Deep Learning. Could LSTM Networks be the New

Standard for Collaborative Filtering?

Emil KeskiSeppälä Michael Snellman

Degree Project in Computer Science, DD143X

Supervisor: Dilian Gurov Examiner: Örjan Ekeberg

CSC, KTH 20160511

Abstract Predicting the product a customer would like to buy is an increasingly important field of study and there are several different recommender system models that are used to make recommendations for users. Deep learning has shown effective results in a variety of predictive tasks but there haven’t been much research concerning its usage in recommender systems. This thesis studies the effectiveness of using a long short term memory implementation (LSTM) of a recurrent neural network (RNN) as a recommender system by comparing it to one of the most common recommender system implementations, the matrix factorization method. A radio playlist dataset is used to train both the LSTM and the matrix factorization models with the intent of generating accurate predictions. We were unable to create a LSTM model with good performance and due to that we are unable to make any significant conclusions regarding whether or not LSTM networks outperform matrix factorization models.

1

Sammanfattning Att förutspå vad en kund är troligast att köpa är ett viktigt studieområde och det finns ett antal olika rekommendationssystem som används för att skapa rekommendationer för användaren. Deep learning har visat effektiva resultat i en mängd olika förutsägande uppgifter men det har inte gjorts mycket forskning om dess användning i rekommendationssystem. Den här rapporten studerar om en “Long Short Term Memory” (LSTM) implementation av ett “Recurrent Neural Network” (RNN) är bättre på att förutspå vilken musik användare vill lyssna på än om man använde matrisfaktorisering, som är en av de mest populära modellerna som används i dagens rekommendationssystem. Vi använder ett dataset som består av låtar insamlade från olika radiostationer för att träna våra två modeller med avsikt att skapa modeller som kan förutsäga vad en användare vill lyssna på. Vi lyckades inte att skapa en LSTM modell som presterade lika väl som vi förväntade oss och på grund av det är vi oförmögna att dra några intressanta slutsatser huruvida LSTM nätverk har bättre prestanda än matrisfaktorisering.

2

Content List of Figures 5 1 Introduction 7 1.1 Problem Statement 7 2 Background 8 2.1 Recommender Systems 8 2.1.1 Introduction 8 2.1.2 Collaborative Filtering 8

2.1.2.1 User Based Collaborative Filtering 9 2.1.2.1 Item Based Collaborative Filtering 10

2.1.3 Matrix Factorization 10 2.1.4 ContentBased Filtering 12 2.1.5 Other Recommender Systems 12

2.1.5.1 Demographic Filtering 13 2.1.5.2 KnowledgeBased Filtering 13

2.1.6 Hybrid Techniques 13 2.1.7 Cold Start Problem 13 2.2 Deep Learning 14 2.2.1 FeedForward Neural Networks 17 2.2.2 Recurrent Neural Networks 19 2.3 LSTM Networks 20 2.3.1 LSTM Architecture 21 2.3.2 LSTM Enhancements 22

2.3.2.1 Gradient Calculation 22 2.3.2.2 Additions to Memory Cell 22

2.3.3 LSTM Equations 23 2.3.3.1 Forward Pass 24 2.3.3.2 Backward Pass 24

3 Methods 25 3.1 Dataset 25 3.2 Frameworks 25 3.3 Matrix Factorization 26 3.4 LSTM Structure 26 3.4.1 Input 26 3.4.2 Hidden layer 27

3.4.2.1 The amount of cells and layers 27 3.4.2.2 Dropout 27

3.4.3 Output layer 28 3.4.4 Optimizer 28

3

3.5 LSTM Loss Function 28 3.6 LSTM Evaluation 29 4 Results 30 4.1 LSTM Results 30 4.1 Matrix Factorization Results 33 5 Discussion and Analysis 35 5.1 Discussion on LSTM Results 35 5.2 Discussion on Matrix Factorization Results 36 5.3 Connection to Problem Statement 36 6 Conclusion 36 6.1 Suggestions for future research 37 7 References 37

4

List of Figures Figure 2.1: A visual representation of neighborhoods Source:https://commons.wikimedia.org/wiki/File:Map1NN.png Figure 2.2: A useritem matrix Source: http://limn.it/wpcontent/uploads/Grid.jpg Figure 2.3: A visualisation of how matrix factorization reduces the matrix. Source:http://xwei.github.io/static/sparkmooc_note_lab4/pasted_image002.png Figure 2.4: Example of two neural networks. The top feedforward neural network is not ”deep” since it only has one hidden layer. However the bottom feedforward neural network is ”deep” since it has many hidden layers. Source: http://neuralnetworksanddeeplearning.com/chap5.html Figure 2.5: A GPUbased maxpooling convolutional neural network (GPUMPCNN) won the MICCAI 2013 Grand Challenge on Mitosis Detection. Source: http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html Figure 2.6: Basic FNN with four input units, five hidden units and one output unit. Source: http://alexminnaar.com/implementingthedistbeliefdeepneuralnetworktrainingframeworkwithakka.html Figure 2.7: How the activation is calculated in a single unit. Inputs are multiplied by a specific weight element; they are then summed and placed in the sigmoid function. It’s common to have a fixed input which equals 1 with its own weight element to act as bias. Source: http://www.codeproject.com/Articles/175777/Financialpredictorvianeuralnetwork Figure 2.8: A plot of the sigmoid function. Source: http://artint.info/html/ArtInt_180.html Figure 2.9: To the left is a neural network with a cycle. To the right is the visual representation of unfolding them, the previous neural network A passes on a new input to its successor based on data from a previous timestep. Source: http://colah.github.io/posts/201508UnderstandingLSTMs/ Figure 2.10: Illustration of a modern LSTM memory cell (right) in comparison to a unit in a Simple Recurrent Network (SRN). Source: http://deeplearning4j.org/lstm.html Figure 2.11: Illustration of how the LSTM network preserves information over time. The states for input (bottom), forget (middle) and output (top) nodes are shown. Straight lines represent open gates and white circles represent open gates. Note that as long as the forget gate is open and the input gate is closed the information from the previous input is preserved. Source: http://deeplearning4j.org/lstm.html Figure 3.1: A visual representation of a LSTM network with two hidden layers. Where every blank rectangle is a LSTM cell that is part of the hidden layer. The dotted lines are meant to

5

https://commons.wikimedia.org/wiki/File:Map1NN.png

http://limn.it/wp-content/uploads/Grid.jpg

http://x-wei.github.io/static/sparkmooc_note_lab4/pasted_image002.png

http://neuralnetworksanddeeplearning.com/chap5.html


http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html

http://alexminnaar.com/implementing-the-distbelief-deep-neural-network-training-framework-with-akka.html


http://www.codeproject.com/Articles/175777/Financial-predictor-via-neural-network

http://artint.info/html/ArtInt_180.html

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://deeplearning4j.org/lstm.html


represent the fact that these connections have dropout applied to them. Source: https://tonydeep.wordpress.com/2015/11/17/paperreview/ Figure 3.2: Example graph displaying how to spot overfitting by observing training and validation accuracy. Source: http://cs231n.github.io/neuralnetworks3/#sanitycheck Figure 4.1: The training loss and validation loss of our LSTM model with 20% dropout in every layer. Figure 4.2: The training loss and validation loss of our LSTM model with 50% dropout in every layer. Figure 4.3: The training accuracy and validation accuracy of our LSTM model with 50% dropout in every layer. Around 20% accurate predictions on our training and validation set. Figure 4.4: The mean squared error distribution for different number of iterations Figure 4.5: The accuracy distribution for different number of iterations

6

https://tonydeep.wordpress.com/2015/11/17/paper-review/

http://cs231n.github.io/neural-networks-3/#sanitycheck

1 Introduction What if you could predict what products your users want to buy? This question is of great importance to many internetbased service providers as accurate recommendations of items that the customer is interested in could provide a significant increase of purchases. Many internet based services such as Netflix and Spotify invests a lot of work and money into the improvement of their recommendation systems. For example the streaming service Netflix held a $1 million competition of who could make a 10% improvement to their recommender algorithm between 20062009. Deep learning is a collective term for various deep network models. These deep learning models has seen use in a variety of predictive tasks and shown great results in many of them. Image recognition, speech recognition, text generation and even playing Go is just some areas deep learning has shown exceptional results in. Of the different deep learning models recurrent neural networks (RNN) and more specifically the long shortterm memory (LSTM) implementation have shown exceptional results in predictive tasks and there have been much research about it. But an area that there hasn't been much research conducted in is deep learning's effectiveness for use as item recommenders. There have been studies that investigate the usage of RNNs to counteract the coldstart problem [1], where a new system, item or user doesn't have enough data to make recommendations, which is common in standard recommender systems and Spotify have been experimenting with the usage of RNNs for recommendations [2] but there still haven’t been any studies that evaluates the usage of RNNs as a recommendation system. Standard recommender systems usually follow either the collaborative filtering approach where large amount of data of the users behaviour, preferences or activities are analyzed and compared to other users and items similar users like are recommended, the contentbased filtering approach where users preferences are gathered and items that matches those preferences are recommended or a hybrid of these two models. The question is if a deep learning implementation could provide more accurate recommendations than these standard models.

1.1 Problem Statement A big problem in collaborative filtering based recommender systems, including matrix factorization models, is the coldstart problem where in a new system or simply a newly added item or user does not have enough data to create recommendations from. Matrix factorization does have relatively good scalability compared to other recommender system implementations but due to the amount of data used in many modern implementations of recommender systems

7

growing bigger the scale still turns into a problem when similarity is computed for very large datasets. Deep learning models and especially RNNs actually needs a big amount of data to train on in order to get good results and since they have proven to be effective in other predictive tasks they may again produce good results and be effective as a substitute to standard recommender systems when large amounts of data is involved and could prove to be more effective than current recommender models. The main purpose of this thesis is to measure the effectiveness of using a recurrent neural network in recommender systems by using a LSTM (LongShort Term Memory) model and compare it to a standard recommender system using the matrix factorization model by comparing their effectiveness. These results will hopefully give some insight in possible future uses for RNNs and if they might be effective for usage as standard recommender systems.

2 Background 2.1 Recommender Systems

2.1.1 Introduction

Recommender systems is today a widely studied field and different recommender systems are used by most internetbased stores and service providers. Recommender systems are used to analyze the patterns of customers to predict what else they might be interested in. This can be very important for internetbased stores to keep customers interested by suggesting other articles they might be interested in. An example of just how important recommender systems are right now is that a few years ago Netflix issued a challenge to improve their recommender algorithms root mean squared error (RMSE) by 10 percent for the prize money of a million dollars. What follows from here is a description of the most commonly used general techniques used in recommender system models: 2.1.2 Collaborative Filtering

Collaborative filtering [3] is one of the most commonly used recommender techniques, it works by collecting a great amount of user data and by analyzing their purchases, ratings and other activities to predict what items may be of interest to the consumer. Collaborative filtering is

8

based on the idea of finding similarities in data patterns to make predictions. In many cases user data such as purchases and ratings are analyzed in order to recommend other similar products or products that users with similar purchases or ratings bought. It is today widely used in internetbased stores by comparing either users or items to make predictions but can be used in basically any context where different data can be compared, for instance social networks where users are compared with other similar users to recommend friends or people to follow. 2.1.2.1 User Based Collaborative Filtering In user based collaborative filtering [4] users are compared to other users to find others who has similar ratings, this based on the idea that people with already similar ratings will most likely rate other things similarly too. Many collaborative filtering models will have to handle a very high number of users the computation can become slow, all correlations for a user would take O(mn) where m specifies the number of users and n the number of items. Therefore they often use a neighborhoodbased model to make computations faster by selecting a number of users with similar ratings and creating a neighborhood where their ratings are compared to recommend other products that a user might like. Several different algorithms may be used to find similar neighbors, for instance the knearest neighbor algorithm (kNN). The kNN algorithm computes the similarity between all users and will then create a neighborhood of the k most similar users and add them to the neighborhood.

9

Figure 2.1: A visual representation of neighborhoods Source:https://commons.wikimedia.org/wiki/File:Map1NN.png The number of users in the neighborhood is also an important factor as too few will not give enough data while too many will cause too much unnecessary noise and reduce efficiency. When a prediction is made it will not just be based on the most similar user but of a weighted average of users from the neighborhood where the weight corresponds to how similar the users ratings are. There are different methods to measure the correlation between users, a common one is the Pearson correlation coefficient:

In this algorithm users x and y are compared, their ratings for item i of n total items that both of them has rated is written as xi and yi while x and y is the mean values of their respective ratings. 2.1.2.2 Item Based Collaborative Filtering A problem with the user based collaborative filtering model is that there often is a data sparsity problem, there are too many items and not enough ratings to give a recommendation. On average an item will have many more ratings than an average user which avoids the data sparsity problem common for user based models. Add to that that there might be millions of users which leads to the computational performance suffering when comparing. Therefore, in contrast to the user based approach found above, item based collaborative filtering [5] looks at the items instead of the users to find similar items to those that a user has already given a high rating. Item based collaborative filtering models will usually take the form of “Users who purchased this item also liked..” and is used in several internetbased stores, most notably by Amazon. To compute similarities between items the model preconstructs a table of similar items based on customers purchases. While the construction of the similaritem table is time intensive, the completed model can very quickly find items similar to the users purchases and ratings, find the most fitting items and recommend them to the user in much the same way as the userbased model. 2.1.3 Matrix Factorization

10

https://commons.wikimedia.org/wiki/File:Map1NN.png

Another model used in collaborative filtering is matrix factorization [6]. A basic ratings matrix uses both items and users as vectors that are based on rating information.

Figure 2.2: A useritem matrix Source: http://limn.it/wpcontent/uploads/Grid.jpg But a matrix like that have high computational complexity and a better model would be to use a more compact method. Therefore in matrix factorization the matrix is dimensionally reduced to a small compact model for fast computations by mapping both users and items to joint latent factors. This results in a faster method with a better neighbor network and a reduction of synonymity. Since matrix factorization is accurate, has good scalability and is flexible for modeling various models it has been gaining popularity lately.

11

Figure 2.3: A visualisation of how matrix factorization reduces the matrix. Source:http://xwei.github.io/static/sparkmooc_note_lab4/pasted_image002.png The best data to use to fill the matrix will be explicit user feedback, which will usually be ratings of products, but then we run into the data sparsity problem that we discussed previously which will results in a sparsely populated matrix since most users won’t have that many ratings. That is why matrix factorization also utilizes implicit feedback in place where there is no explicit information. Implicit feedback is gathered by analyzing the users behavior, that means for example what they purchased, what they were browsing and what they were searching. With implicit feedback it is possible to create a much more densely filled matrix. 2.1.4 ContentBased Filtering

Besides collaborative filtering one of the most wellknown recommender system models is contentbased filtering. [7] The key idea behind contentbased filtering is to only recommend items that has attributes that the user is interested in by classifying items based on the users preferences. Because of these reasons contentbased filtering is very useful for recommending movies and other items that can easily be filtered by keywords and is today used by, for example The Internet Movie Database. The content space is specified as a space where every keyword is a dimension. An item can be represented as a keyword vector in that space that specify which keywords apply to said item. For every user there is a taste profile that is also represented as a vector in that space. To create a user profile historical data about the user's preferences is gathered with either explicit or implicit feedback. The data is then analyzed and used to create the users taste vector. The contentbased filtering model will then be able to recommend items by comparing the users taste vector to the closest item vectors by computing the cosine between the vectors. An interesting point for this thesis is that Aäron van den Oord [14] was able to utilize deep learning based deep convolutional networks together with contentbased filtering for predicting music recommendation as a way to counteract the coldstart problem that is explained in 2.1.7. 2.1.5 Other Recommendation Techniques

12

http://x-wei.github.io/static/sparkmooc_note_lab4/pasted_image002.png

While collaborative and contentbased filtering are the most commonly used recommender

systems there are several other systems that aren’t as frequently used. We will briefly go over

demographic and knowledgebased filtering:

2.1.5.1 Demographic Filtering

The demographic filtering [8] approach have similarities to userbased collaborative filtering in

that it bases recommendations on what similar users like. The major difference however is that

demographic filtering tries to fit the user into a specific demographic and will then base

recommendations on what other users in that demographic likes.

2.1.5.2 KnowledgeBased Filtering

In the knowledgebased filtering approach the model will instead try to find a good recommendation by querying the user on what they are searching for. The user will choose a starting product that they enjoy and the model finds similar products that is then filtered through by querying the user for tweaks to the recommendation, for instance a user may be recommended a restaurant and can then choose “cheaper” which will tweak the results to display a new similar but cheaper recommendation [9]. 2.1.6 Hybrid Techniques

There are also hybrid approaches that utilises more than one different recommender models to

make their predictions. They can for instance use both collaborative and contentfiltering

techniques. These hybrid approaches can make up for the downsides of specific models and

will therefore usually produce better results.

Most commonly collaborative filtering will be combined with some other model to make up for

the coldstart and data sparsity problems. Hybrid models can be created in different ways by

combining the scores of several techniques to produce recommendations, the model may

switch between different techniques depending on the current situation or create an algorithm

containing features of different techniques. [10]

2.1.7 Cold Start Problem

13

One difficult problem that is common in recommender systems and most specifically collaborative filtering is the coldstart problem [11]. This problem concerns the absence of data to make recommendations from in the case of a new user, a new item or most importantly a newly implemented system. In the case of a new user who haven’t purchased or rated any items the problem is that there isn’t a profile to compare against other users or against items. In this case it will be hard to make any personalized recommendations, models will usually try to phase in personalization or offer popular default items as personalization options. When a new item is added there is a problem for collaborative filtering models as there might not be any ratings for it which will mean that it can’t be recommended to anyone. This is one reason for the raising popularity of hybrid solutions as the item in this case can be recommended by a contentbased filtering model. The hardest problem of the coldstart problem is when a completely new system is launched. In this case there isn’t any data to set up recommendations and some options to take is to use data from a different source or just set up a data gathering system that can be used by a later implemented recommender system. The coldstart problem is a popular area of research with many different suggested solutions. There are research using hybrid solutions [12], social tags [13] and most interestingly for this report deep learning [14].

2.2 Deep Learning Deep learning is a field of study in machine learning based on a class of statistical learning

algorithms loosely modeled after the neurons in the human brain which use multiple levels of

nonlinear operations to find patterns in complex data such as images for example [15]. The

nonlinear operations are a crucial aspect that allows neural networks to model extremely

complex structures. The field has experienced a surge in popularity since Geoffrey Hinton

published a paper describing how to efficiently train deep belief networks (DBN) in 2006 [16].

Previous to this discovery, it was considered too difficult to train neural networks with many

layers due a problem known as the ”vanishing gradient problem” [17]. Hinton solved this issue

by stacking singlelayer neural networks known as restricted Boltzmann machines on top of

each other and training them in a greedy manner one layer at a time. A neural network is

14

considered to be ”deep” as long as it has more than one hidden layer. An example of a deep

neural network is shown in figure 2.4.

Figure 2.4: Example of two neural networks. The top feedforward neural network is not ”deep”

since it only has one hidden layer. However the bottom feedforward neural network is ”deep”

15

since it has many hidden layers. Source: http://neuralnetworksanddeeplearning.com/chap5.html

Since this discovery a lot of breakthroughs have been made within deep learning in almost

every aspect. Whether it’s regarding advances in neural network architectures, different

methods for training neural networks or discovering new applications for deep learning in the

real world. Because the reason deep learning has become as popular as it is today is due to its

successful applications in fields such as computer vision, speech recognition, natural language

understanding, handwriting recognition, audio processing, information retrieval, robotics and

more [18]. Deep learning has also been used successfully to win many competitions within

topics such as visual mitosis detection and optical character identification [19, 20]. Neural

networks have also been used to set records in prosody contour prediction and social signal

classification [21, 22]. Neural networks are also being applied in recommender systems both

within contentfiltering and collaborativefiltering [23, 24]. In most cases, two different types of

neural networks are used to achieve these results, ensemble of GPUMPCNN and LSTMRNN.

Figure 2.5: A GPUbased maxpooling convolutional neural network (GPUMPCNN) won the

16




MICCAI 2013 Grand Challenge on Mitosis Detection. Source:

http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html

2.2.1 FeedForward Neural Networks

Feedforward neural networks (FNN) are known by many names, another common name for

them is ”multilayer perceptron”. A multi layered FNN is a network that consists of multiple layers

of units (also known as neurons) and each layer has a connection to another layer of the

network. However a FNN cannot have any cycles, so the connections can only go forward

(hence the name feedforward). There are three types of layers (see figure 2.6). The input layer

which consists of the input units, the hidden layer which consists of the hidden units and the

output layer which consists of the output units.

Figure 2.6: Basic FNN with four input units, five hidden units and one output unit. Source:

http://alexminnaar.com/implementingthedistbeliefdeepneuralnetworktrainingframeworkwith

17



akka.html

Every unit in a FNN has an activation level. We denote the activation level of unit k as a_k. For

input units the activation level is the value of the input. For noninput units the activation level is

calculated as a function of the units from the previous layer. First a weighted sum of the

previous units (each connection in figure 2.6 has its own specific weight) which is then passed

through a nonlinear activation function. In multilayer FNNs it is common to use the sigmoid

function as the activation function.

See figure 2.7 for a visual representation of a single unit and figure 2.8 for a graph of the

sigmoid function.

Figure 2.7 (left): How the activation is calculated in a single unit. Inputs are multiplied by a

specific weight element; they are then summed and placed in the sigmoid function. It’s common

to have a fixed input which equals 1 with its own weight element to act as bias. Source:

http://www.codeproject.com/Articles/175777/Financialpredictorvianeuralnetwork

18



Figure 2.8 (right): A plot of the sigmoid function. Source: http://artint.info/html/ArtInt_180.html

So we have now explained how the units operate. However how do we know what values our

weights should have? Let us remember that we want our neural network to be able to recognize

patterns in the real world. Therefore we need to train our network in order to prepare it for the

real world. It is during this process that we will feed empirical data of what we’re trying to predict

in to our neural network and keep adjusting the weights in order to minimize the difference

between the target output and the actual output generated by our neural network. A common

technique called backpropagation is applied to minimize the errors by adjusting the weights. A

common function to calculate the error is the least squares function. Where t is the target output

from our empirical data, y the actual output from our neural network and E is the squared error.

equation used for stochastic gradient descent. You then repeat the process until you have

minimized the error to sufficient levels.Backpropagation is a supervised learning technique with

two steps where you first calculate a prediction using randomized weights. You then compare

the prediction with the actual output to get the prediction error (forward pass). The second step

is to then calculate the gradients of the weights in every layer with respect to the prediction error

by moving backwards through the network (backward pass). Once you have the gradients you

can update the weights with the equation used for stochastic gradient descent. You then repeat

the process until you have minimized the error to sufficient levels.

2.2.2 Recurrent Neural Networks

When discussing feedforward neural networks. We made the distinction that there could not be

any cycles. If you relax this restriction and allow cycles you end up with a recurrent neural

network (RNN). The problem with FNNs is that it has no memory. If you wanted to classify a

19




sequence of words it would be very beneficial to know which words came before the word

you’re trying to predict. Cycles allow previous information to linger within the network. Think of

one neural network passing on information to the next neural network. One way of

understanding cycles in recurrent neural networks is by unfolding the loop. See figure 2.9.

Figure 2.9: To the left is a neural network with a cycle. To the right is the visual representation of

unfolding them, the previous neural network A passes on a new input to its successor based on

data from a previous timestep. Source:

http://colah.github.io/posts/201508UnderstandingLSTMs/

This unfolding reveals why recurrent neural networks are so wellsuited for understanding

sequences and lists. RNN’s are good at using information from previous timesteps, however

they can’t go back very far in time and past a certain point it becomes very difficult to connect

the information [25].

2.3 LSTM Networks In a regular recurrent neural network there can become a problem that the gradient used in the backpropagation part of the training becomes either too small or too big as it is multiplied with the weights of every timestep. If the weights in the matrix are too small it can lead to the vanishing gradient problem when the gradient turns so small that learning becomes extremely slow or may even stop altogether, this also makes it hard to learn longterm dependencies. On the other hand if the weights are too large we can instead end up with the exploding gradient problem where the gradient is so large that learning diverges.

20





There have been many attempts to solve this gradient problem in the 1990s, many of these attempts used nongradient based training algorithms. However the current best solution is Long ShortTerm Memory (LSTM) architecture [26]. The purpose of this chapter is to give a background for LSTM, the RNN architecture that we use in our thesis. We will describe the basic structure of LSTM and why it solves the gradient problem. We will describe how we calculate the error gradient and what improvements have been made since 1997 when the first LSTM was created. We will also give all the necessary equations to train a LSTM network. 2.3.1 LSTM Architecture To solve these problems the LSTM model use memory cells. A LSTM network is formed exactly like a simple RNN, with memory cells instead of the regular nonlinear units. A memory cell is a structure made up of four main components: an input gate, a forget gate, an output gate and a neuron with a selfrecurrent connection. The weight of the selfrecurrent connection is set to 1.0 which ensures that without any outside interference the memory cells state will remain constant between different timesteps. The input, output and forget gates are used to modulate the interactions the memory cell has with its environment. The input gate can either allow or block incoming signals that would alter the state of the memory cell, conversely the output gate can decide if the state of the memory cell will be allowed to affect other memory cells and the forget gate can affect the memory cells selfrecurrent connection allowing it to forget its previous state if needed. The hidden layer of a LSTM network can also be attached to any type of output layer like any other neural network depending on the task at hand, whether that task is classification or regression or something else. See figure 2.10 below for a visualisation of a memory cell in comparison to a unit in a recurrent neural network.

Figure 2.10: Illustration of a modern LSTM memory cell (right) in comparison to a unit in a Simple Recurrent Network (SRN). Source: http://deeplearning4j.org/lstm.html

21

The preservation of information over time is illustrated in figure 2.11.

Figure 2.11: Illustration of how the LSTM network preserves information over time. The states for input (bottom), forget (middle) and output (top) nodes are shown. Straight lines represent open gates and white circles represent open gates. Note that as long as the forget gate is open and the input gate is closed the information from the previous input is preserved. Source: http://deeplearning4j.org/lstm.html 2.3.2 LSTM Enhancements 2.3.2.1 Gradient Calculation Every neural network needs to be trained. Meaning that the weights need to be updated until the differences between the guesses made by the model and the training set are very close. LSTM networks are trained using gradient descent. However the original LSTM network that came about in 1997 [26] used a combination of Real Time Recurrent Learning [27] (RTRL) and Backpropagation Through Time [28] (BPTT) to calculate an approximate error gradient. However it is possible to calculate an exact LSTM error gradient using BPTT [29]. The exact gradient has not only been proven to be more accurate but also easier to debug since you check the value numerically. In this thesis we only calculate the exact error gradient. Using the error gradient you can then update your weights. 2.3.2.2 Additions to Memory Cell

22


The original memory cell contained only input gates and output gates. The forget gates [30] and peephole weights [31] connecting the gates to the memory cell were added later on by other researchers. The purpose of the forget gate is to give memory cells a way to reset their states. To give an example as to why the ability to forget is useful, imagine that you’re analyzing a large collection of playlists. Once you reach the end of a playlist you know that the next playlist is unrelated to the one you just finished looking at, and therefore the memory cell should be reset. Peephole connections improve the LSTM’s ability to learn more difficult tasks that might require precise timing. In this thesis we are using the LSTM with forget gates and peephole connections added. 2.3.3 LSTM Equations In this section we will provide the equations needed for the activation (forward pass) and the exact error gradient calculation (backward pass) in a single cell. In the case of several memory cells you would just repeat these steps. It is also worth noting that only the cell output is visible to other memory cells, the rest is happening internally. These equations are cited from a book focusing on RNNs and LSTMs by Alex Graves [32]. The explanation of the terms that is to follow is also cited from said book. represents the weight of the connection between unit i and unit j,wij the input from the network itself to a unit j at time t is and the value of the unit j at time t afteratj

applying its activation function is . The subscripts refer to the input gate, the forget gatebtj , ,ι Φ ω and the output gate of the cell. The peephole weights from cell c to the input, forget and output gates are denoted , and respectively. refers to the state of a memory cell c at thewcι wcΦ wcω stc time t which is one of C overall memory cells. is the activation function for the gates, and f g h are respectively the activation functions for the cell input and cell output. Let be the amount ofI inputs, the number of outputs and the number of cells in the hidden layer. Let the subscriptK H denote cell outputs from other memory cells in the hidden layer. We define as the totalh G

number of inputs to the hidden layer, including cells and gates and we use the subscript tog refer to these inputs when we don’t need to distinguish between the input types. The forward pass is calculated for an input sequence of length where you start at and recursively x T t = 1 using the update functions whilst incrementing after every step. The backward pass ist calculated by starting at and recursively calculating the unit derivatives whilst decreasest = T t with every step. To get the final weight derivatives you simply add up all the derivatives that were calculated at every timestep. We will denote the derivative where L stands for theδtj = ∂L

∂atj

loss function that we are using to train the network, which is categorical crossentropy, which we will discuss later in the report. The order of the equations are significant and every pass has to be done in the right order. All the states and activations during the forward pass are initialized to 0 at t=0 and all the derivatives in the backward pass are initialized to 0 at t=T+1.

23

2.3.3.1 Forward Pass

2.3.3.2 Backward Pass

24

3 Methods To examine the effectiveness of LSTM a model was implemented using the Keras framework and trained using a large dataset partitioned into playlists. For comparison with the LSTM model a matrix factorization model was implemented using Apache Spark. Our nullhypothesis is that a LSTM based recurrent neural network model won’t be able to outperform regular recommender system models. Therefore if the LSTM model produces worse results than the regular recommender system test then the nullhypothesis can be rejected as it has been proven that the LSTM implementation causes better results.

3.1 Dataset For the dataset we used the R9 Yahoo! Music Internet Radio Playlist, version 1.0 [33]. It is a dataset consisting of metadata collected from more than 4000 internet radio stations collected during a period of 15 days between September 22nd and October 6th, 2011. We decided to use this dataset partly because it was a large dataset that provided us with a lot of data but also because it was easy to use with clear formatting and being able to use the radio stations as user playlists for our training and testing. We decided to predict the next artist instead of the next song as there were too many different songs in the dataset. Therefore we partitioned the dataset into several different playlists each with a size of fifty songs each, with a step of 5 songs in the dataset for every new playlist. We added this semiredundancy due to the fact that the optimizer RMSProp performs better on redundant data. We discuss our optimizer in a later section. We also limited the number of unique artists to around 1000, we do this because in order to generate a prediction we have to generate a probability distribution across all possible predictions, it becomes very difficult to generate this probability distribution with too many unique items. This resulted in around 150000 playlists containing roughly 500000 total songs.

3.2 Frameworks To implement and train our LSTM model we used the neural network library Keras [34]. Keras is an opensource deep learning library that runs on Theano, a numerical computation library for Python. Keras is intended for fast and easy implementation of deep learning models which made it a good fit for the project. As we also needed regular recommender system model to compare against our RNN model therefore we also made use of Apache Spark’s machine learning library Mllib [35]. Spark is an opensource cluster computing network originally developed by researchers at University of

25

California, Berkeley's AMPLab but is currently maintained by the Apache Software Foundation. Its machine learning library contains implementations for collaborative filtering which were used in this project.

3.3 Matrix Factorization To actually measure the effectiveness of the LSTM implementation a standard recommender system was needed to compare the results against. For this a collaborative filtering model utilizing matrix factorization model was chosen because collaborative filtering based approaches are the most common recommender system techniques and in implementations of collaborative filtering matrix factorization has been proven to be a more effective technique than classic techniques, such as knearest neighbor, when they were used in the Netflix prize challenge and as shown in Yehuda Koren’s thesis [6]. For these reasons the matrix factorization method was chosen to be implemented for comparative evaluation as it has been proven to be an effective method. The matrix factorization was implemented in Apache Spark and utilized the alternating least squares algorithm. There are several different algorithm that can be used in a matrix factorization model such as singular value decomposition or stochastic gradient descent. Alternating least squares was used in this project because it was trained on implicit data and alternating least squares is the recommended algorithm to use in that case according to Yehuda Koren [6]. The effectiveness of the matrix factorization was computed using the mean squared error (MSE) method. MSE is used as a loss function and computes the difference between the predicted value and the actual value. MSE is the most widely used loss functions and is the one that is most commonly used for recommender systems.

3.4 LSTM structure As we can see from figure 2.4 and figure 2.5. There are many ways to construct a neural network. There are many factors to consider and we will try to motivate many of those factors. Whilst it is true that your model depends a lot on your dataset, we will try to motivate why we decided to design our model in the way we did, by referring to scientific journals and other LSTM implementations. 3.4.1 Input In this section we will describe what we do to our playlists to transform it into an input that we can feed into our LSTM network. We represent every item in our playlist as a onehot vector of shape 1xN, which is used to distinguish between every artist in our catalogue [42]. The reason for using onehot vectors is first of all due to the fact that it makes our output very easy to calculate and because our framework (Keras) uses onehot vectors internally. Since every item is a vector, we represent every playlist as a matrix and the collection of playlists as a 3D tensor.

26

We represent this in our code as a 3D array. The reason for this representation is due to the fact that the weights of a neural network are represented as matrices and for matrix multiplication to take place our input needs to be in the form of matrices as well. 3.4.2 Hidden layer There are three important aspects to these hidden layers that we will address. First of all we will address the amount of LSTM cells that we chose to have in every layer. Second of all we will address the amount of layers we chose. Lastly we will address dropout, the practice of disregarding the calculations of randomly chosen cells. In figure 3.1 you can see a LSTM with two layers.

Figure 3.1: A visual representation of a LSTM network with two hidden layers. Where every blank rectangle is a LSTM cell that is part of the hidden layer. The dotted lines are meant to represent the fact that these connections have dropout applied to them. Source: https://tonydeep.wordpress.com/2015/11/17/paperreview/ 3.4.2.1 The amount of cells and layers You can have a variable amount of cells, a variable amount of layers and a variable amount of LSTM cells in every layer. These are choices that are left to the architect to decide upon, since every dataset requires a different model and we confirmed during our literature study that the amount of cells and layers can vary greatly. We read journals where less than 100 cells were used and journals where more than 1000 cells were used [36, 37]. Our conclusion is that this is a matter where you have to use trial and error to find the optimal configuration. In an attempt to find the optimal structure we tried many different configurations with cell numbers ranging from 100 to 500 cells per layer, with 3 to 5 hidden layers and the same amount of cells in every hidden layer.

27

3.4.2.2 Dropout Dropout is a technique that was proposed in 2014, the technique is that you randomly choose a certain amount of cells in a layer and set their output to 0 [38]. The point of this technique is to reduce overfitting. During our literature study we discovered many papers trying to deduce the ideal amount of dropout and whether or not you need dropout after every layer [39]. We decided to try the same amount of dropout in every layer and try dropout only after the final hidden layer. Setting it at either 0, 20% or 50%. 3.4.3 Output layer After the final hidden layer we reach the output layer. This is the layer where we generate our predictions for the most likely artist for you to listen to after looking at a playlist. We use the softmax classifier to do this. See the equation below to see the softmax activation function.

The function takes a Kdimensional vector z and squashes it to a probability distribution that sums up to 1. The loss function that we would need in this case is called categorical crossentropy, also known as log loss. This loss function measures the “distance” between the probability distribution we generated and the “true” probability distribution. We go into more detail regarding our loss function in section 3.5. The reason for using the softmax classifier is due to the fact that it is very well suited for our problem and is a very popular classifier in the text generation community as well. 3.4.4 Optimizer After you’ve back propagated throughout the network, you need to update the weights and there are several optimizer algorithms for this task. We use RMSProp, due to the fact that it was the default optimizer in Keras and due to the fact that it was recommended by Geoffrey Hinton [40]. Hinton recommends RMSProp if the dataset is large and contains a lot of redundancy. Which we think is true regarding our playlists and dataset.

3.5 LSTM Loss Function The loss function that we used for our LSTM model was crossentropy. The purpose of crossentropy in our case is to essentially measure the distance between the probability distribution that our model generates for a certain playlist and the true distribution. Minimizing this distance would result in a stronger model.

28

Crossentropy computes the average number of bits needed to identify an event drawn from a set of two different probability distributions of a set of events. The crossentropy of p and q is defined as the entropy of p multiplied with the KullbackLeibler divergence from q to p. It can be summed up in the following function:

When used in machine learning the p signifies the true value of an event while q signifies the predicted value. Using a simple prediction when given an input x if a number y will be either 0 or 1, the probability of finding the number to be respectively 1 or 0 is given by qy=1= and qy=0=1 y y where equals g(x*w) where w is the weight that is learned through some algorithm. The true y probabilities can be described similarly as py=1=y and py=0=1y. With the notation finished we can then simply insert the values into the function above. If there are n different events then the loss can be computed by simply dividing the function by n.

3.6 LSTM Evaluation In order to measure the performance of our LSTM network during training we will use a validation set. A validation set is usually 20% of the training data that you set aside and you feed the validation set through the LSTM during its training process. The LSTM has not seen this validation data so it’s a useful test in order to deduce if your model is overfitting. Overfitting is the issue where your model will just memorize the training data instead of learning any patterns. If you use a validation set you can catch this problem by comparing the training loss to the validation loss. If your model has a low loss on training data but a high loss on validation data you are overfitting [41]. Another valuable metric for performance evaluation is the training and validation accuracy [41]. The accuracy is given as a percentage value of how many guesses were correct on the training and validation data. This is useful because you can measure how large the overfitting is. See figure below for an example graph.

29

Figure 3.2: Example graph displaying how to spot overfitting by observing training and validation accuracy. Source: http://cs231n.github.io/neuralnetworks3/#sanitycheck

4 Results

4.1 LSTM Results We tried many different structures in our attempts to get the LSTM model to converge. We have selected three graphs to display since they display illustrate our findings quite well. The first graph that shows the training process of a LSTM network with 20% dropout after every hidden layer. It contained 500 hidden cells in every LSTM layer and contained 4 LSTM layers. The second graph will show the training process of a LSTM network with 50% dropout after every hidden layer. It contained 500 hidden cells in every LSTM layer and contained 4 LSTM layers. The third graph is the training and validation accuracy of the model used in graph two. In figure 4.1 we see the training/validation loss of our LSTM model with 20% dropout. You can see that only after a few epochs that the model begins to overfit quite heavily. Instead of learning any patterns or context, our model is simply memorizing our training data. Hence why the validation loss started increasing. Also note how the validation error is lower than the training error. The reason for this is that Keras does not apply any dropout when it’s testing on a validation set, hence giving the illusion that our model performs better on unknown data.

30

Figure 4.1: The training loss and validation loss of our LSTM model with 20% dropout in every layer. In figure 4.2 we see the training/validation loss of our LSTM model with 50% dropout. As you can see there is much less overfitting in this model. However the problem with this model is that it’s not converging and it is essentially stuck on a very high training loss. Which means that our model may have been unable to learn the more complicated contexts within our playlists.

31

Figure 4.2: The training loss and validation loss of our LSTM model with 50% dropout in every layer. In figure 4.3 we see the training/validation accuracy of our LSTM model with 50% dropout. This graph demonstrates that the accuracy of our model on both datasets is quite close. This implies that there is little overfitting and that our model is learning a pattern.

32

Figure 4.3: The training accuracy and validation accuracy of our LSTM model with 50% dropout in every layer. Around 20% accurate predictions on our training and validation set.

4.2 Matrix Factorization Results We were able to implement a relatively well functioning matrix factorization model that produced good results. The data was split into a training set and a validation set where 80 percent of the data was used to train the model and the remaining 20 percent of the data was used for testing the implemented model. We tried a few different settings regarding number of iterations and how many latent factors the model used. Below is a graph illustrating the MSE for different number of iterations with the best MSE produced at 12 iterations with a MSE value of 0.5566632127645209.

33

Figure 4.4: The mean squared error distribution for different number of iterations The following graph displays the accuracy of the matrix factorization model on the validation data set for different number of iterations of the model. The accuracy stays relatively stable and shows very small changes between different number of iterations.

Figure 4.5: The accuracy distribution for different number of iterations

34

5 Discussion

5.1 Discussion on LSTM Results The graphs in the previous section illustrate the problems we faced in trying to develop this model quite succinctly. There was a large span of time between our 20% dropout model and our 50% dropout model. The reason for this is that we had been led to believe that 20% dropout was the ideal amount of dropout and anything more was not required. So in the beginning we spent a lot of time experimenting with other parameters in the hopes of beating the overfitting. It was not until the very end of the project that we experimented with 50% dropout that we managed to beat the overfitting. However, as we can see in figure 4.2 the training loss and validation loss is very high. This indicates that our model is not large enough to handle the complex context of our data [41]. If we had made the model larger by adding more cells or LSTM layers we may have been able to get a better model. This would have been the natural continuation of our project, however time was not on our side and we simply didn’t have enough time to explore this path. We never anticipated how important time would be for this project. Not just due to the deadlines, but due to the time required to train our LSTM network. In our 50% dropout model one epoch would take 60 minutes to train. Meaning that we would have to wait at least 15 to 20 hours before being able to tell if a certain model was performing well. We never anticipated that it would take this amount of time to train our model. Since all of the neural network examples that we tried didn’t require nearly as much time to train. The difference was that we didn’t realize that the context of a certain song within a certain playlist was much more complex than any of our examples. If we had understood the importance of time we would have made it a point to derive a detailed schedule. However this is something that we did not do. 5.1.1 Limitations of the Softmax Classifier The softmax classifier is usually the classifier of choice for predictions. This classifier is not without its limitations however and we will discuss some of them in this section. First of all, the softmax classifier generates a probability distribution across all possible items that we want to predict. However in our dataset we have around one million unique songs. When the amount of items grows larger it becomes extremely difficult to generate a probability distribution using softmax. Our solution was to try to predict artists instead of unique songs. We did this because there are less unique artists than unique songs in our dataset. This means that we were able to use the softmax classifier without any issues in our model. During our literature study we also found out that this is a very active area of research within the text generation community.

35

5.2 Discussion on Matrix Factorization Results Concerning the implementation of the matrix factorization model, as is illustrated in the graphs of the results section, we were able to implement a matrix factorization model that produced satisfactory results. The accuracy rating of ~45% stayed stable for different number of iterations the model ran and for different numbers of latent factors used by the model, we could have been more thorough in investigating a bigger amount of different values for the training of the model given more time. A more advanced model could have been implemented but due to the time constraints and the fact that the LSTM implementation is the main point of this thesis a more advanced matrix factorization wasn’t able to be implemented.

5.3 Connection to Problem Statement Due to the fact that our LSTM model did not converge properly. We deduced that there would be no point in comparing the results of our models with each other. Since we would be unable to draw any interesting conclusions from those results. We can however discuss the the differences when it came to the actual implementation of these models. Implementing the matrix factorization was a much smoother experience than implementing the LSTM. It made us realize that the ease of implementation most likely plays a large role as to why matrix factorization is so popular in the recommender system community. Another factor that could be important a reason why RNN models has not seen much use in recommender systems is that it takes so long to train the model. Recommender models need to be updated as new products, users and ratings are added and while this process doesn’t have to be very fast this could still be a problem considering the time it took for the LSTM model to train for us. Whilst deep learning is being used by big companies such as Spotify and Netflix, it may prove to be more difficult for a smaller company that does not already have a large dataset to achieve the same performance.

6 Conclusion We were unable to prove whether or not LSTM performs better than matrix factorization in the case of music prediction. However given our results we remain hopeful that if additional research were to occur that LSTM would outperform matrix factorization in the case of collaborative filtering. We can also conclude that matrix factorization is an easier model to implement than a LSTM network and that implementing a LSTM network may require a lot of experimentation in order to find the proper configuration. However by tracking the accuracy and

36

loss of your training and validation set, you are able to gain valuable insight to how your model works and what can be done to improve it.

6.1 Suggestions for future research Our model was unable to learn the context of our data due to its smaller size. That’s why we would suggest that future researcher use a larger model than the one we used, with more cells and more layers. We believe that with a larger model the results would improve significantly. Future researchers may want to be able to recommend songs instead of artists, the normal softmax classifier would be unable to satisfy this task due to the sheer size of the collection of unique songs. That’s why we would suggest looking into other options such as hierarchical softmax for example. This is also an active field of research so new solutions may arise in the future. We are also interested in whether something such as word2vec could be applied to songs. Word2vec is used in natural language processing to generate word embeddings, words that appear in similar contexts have similar embeddings and it would be very interesting to see if something like this could be applied to songs or artists.

7 References [1] Hidasi, B, Karatzoglou, A, Baltrunas, L, Tikk, D 2016 ‘Sessionbased Recommendations with Recurrent Neural Networks’, ICLR 2016 [2] Bernhardsson, E. (2014). Recurrent Neural Networks for Collaborative Filtering. Available at: http://erikbern.com/2014/06/28/recurrentneuralnetworksforcollaborativefiltering/ [16 April 2016] [3] Herlocker, J, Konstan, J, Borchers, A & Riedl, J 1999, ’An Algorithmic Framework for Collaborative Filtering’, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 230237. Available from: ACM Portal: ACM Digital Library. [25 February 2016]. [4] Zhao, ZD & Shang, MS 2010, ‘Userbased CollaborativeFiltering Recommendation Algorithms on Hadoop’, Third International Conference on Knowledge Discovery and Data Mining, pp 478481. Available from: IEEE Xplore Digital Library. [25 February 2016].

[5] Linden, G, Smith, B & York, J 2003 ‘Amazon. com recommendations: Itemtoitem

collaborative filtering’, IEEE Internet Computing, vol. 7, Issue. 1, pp 7680. Available from: IEEE Xplore Digital Library. [25 February 2016].

37

http://erikbern.com/2014/06/28/recurrent-neural-networks-for-collaborative-filtering/

[6] Koren, Y, Bell, R & Volinsky C 2009 ‘Matrix factorization techniques for recommender systems’ Computer vol. 42. issue. 8 pp 3037. Available from: IEEE Xplore Digital Library. [25 February 2016]. [7] Van Meteren, R & Van Someren, M 2000 "Using ContentBased Filtering for Recommendation," presented at ECML2000 Workshop, 2000. Available from: http://users.ics.forth.gr/~potamias/mlnia/paper_6.pdf [8] Aimeur, E, Brassard, G & Fernandez, JM 2006 ‘Privacypreserving demographic filtering’, SAC '06 Proceedings of the 2006 ACM symposium on Applied computing, pp 872878. Available from: IEEE Xplore Digital Library. [25 February 2016]. [9] Trewin, S 2000, ‘Knowledgebased recommender systems’ in Encyclopedia of library and information science, pp. 180198. Available from books.google.com. [25 february 2016] [10] Burke, R 2002, ‘Hybrid recommender systems: Survey and experiments’ User Modeling and UserAdapted Interaction, November 2002, vol. 12, Issue 4, pp 331370. Available from Springer Link. [25 february 2016] [11] Konstan, J 2016, The Cold Start Problem lecture notes distributed in Introduction to Recommender Systems at the University of Minnesota. Available from https://www.coursera.org/learn/recommendersystems/home/welcome. [25 february 2016] [12] Lam, XN, Vu, T, Le, TD & Duong AD 2008 ‘Addressing coldstart problem in recommendation systems’, ICUIMC '08 Proceedings of the 2nd international conference on Ubiquitous information management and communication, pp 208211 Available from: ACM Portal: ACM Digital Library. [25 February 2016]. [13] Zhang, ZK, Liu, C, Zhang, YC & Zhou, T 2010 ‘Solving the coldstart problem in recommender systems with social tags’, EPL (Europhysics Letters), vol. 92, num. 2. Available from: iopscience.iop.org. [25 February 2016]. [14] van den Oord, A, Dieleman, S & Shrauwen B 2013 ‘Deep contentbased music recommendation’ Advances in Neural Information Processing Systems 26 (2013), vol. 26, pp. 26432651. Available from: Nips Proceedings. [25 February 2016].

38

https://www.coursera.org/learn/recommender-systems/home/welcome

https://www.coursera.org/learn/recommender-systems/home/welcome

[15]. Deng, Li & Yu, Dong 2014, ‘Deep Learning: Methods and Applications’, Foundations and

Trends in Signal Processing, vol. 7 no. 3–4, pp. 197387.

[16]. Hinton, GE & Osindero, Simon 2006, ‘A fast learning algorithm for deep belief nets’, Neural

Computation, vol. 18, no. 7, pp. 15271554.

[17] Hochreiter, Sepp 1998, ‘The vanishing gradient problem during learning recurrent neural

nets and problem solutions’, International Journal of Uncertainty, Fuzziness and

KnowledgeBased Systems, vol. 6, no. 2, pp. 107116.

[18] Markoff, John 2012, ‘Scientists See Promise in DeepLearning Programs’, The New York

Times 23 November. Available from:

http://www.nytimes.com/2012/11/24/science/scientistsseeadvancesindeeplearningapartof

artificialintelligence.html

[19] Ciresan, DC, Giusti A, Gambardella, LM & Schmidhuber, J 2013, ’Mitosis Detection in

Breast Cancer Histology Images with Deep Neural Networks’, Medical Image Computing and

ComputerAssisted Intervention MICCAI 13, vol. 8150, pp. 411418.

[20] Ciresan, Dan & Schmidhuber, Jürgen 2013, ’MultiColumn Deep Neural Networks for

Offline Handwritten Chinese Character Classification’, arXiv. Available from:

http://arxiv.org/pdf/1309.0261v1.pdf

[21] Fernandez, R, Rendel, A, Ramabhadran, B & Hoory, R 2014, ‘Prosody Countour Prediction

with Long ShortTerm Memory, BiDirectional, Deep Recurrent Neural Networks’, Interspeech

2014. Available from:

https://www.researchgate.net/publication/267154161_Prosody_Contour_Prediction_with_Long_

ShortTerm_Memory_BiDirectional_Deep_Recurrent_Neural_Networks

[22] Brueckner, Raymond & Schuller, Björn 2014, ‘ Social Signal Classification Using Deep

BLSTM Recurrent Neural Networks’, International Conference on Acoustic, Speech and Signal

Processing – ICASSP 2014, pp. 48234827.

39

http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html





https://www.researchgate.net/publication/267154161_Prosody_Contour_Prediction_with_Long_Short-Term_Memory_Bi-Directional_Deep_Recurrent_Neural_Networks



[23] Van Den Oord, A, Dieleman, S & Schrauwen, B 2013, ‘Deep contentbased music

recommendation’, Neural Information Processing Systems Conference – NIPS 2013, vol. 26,

pp. 26432651.

[24] Wang, H, Wang, N & Yeung, DY 2015, ‘Collaborative Deep Learning for Recommender

Systems, ’KDD ‘15 Proceedings of the 21th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining’, pp. 12351244.

[25] Bengio, Y, Simard, P & Frasconi, P 1994, ‘Learning longterm dependencies with gradient

descent is difficult’, IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157166.

[26] Hochreiter, Sepp & Schmidhuber, Jürgen 1997, ‘Long ShortTerm Memory’, Neural Computation, vol. 9, no. 8, 17351780. [27] Robinson, A. J. & Fallside, F. (1987), 'The Utility Driven Dynamic Error Propagation Network' (CUED/FINFENG/TR.1), Technical report, Engineering Department, Cambridge University , Cambridge, UK. [28] Williams, R, J and Zipser, D. Gradientbased learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Backpropagation: Theory, Architectures and Applications, pages 433–486. Lawrence Erlbaum Publishers, Hillsdale, N.J., 1995. URL citeseer.nj.nec.com/williams95gradientbased. [29] Graves, A and Schmidhuber, J. (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005, vol.4, pp 2047 – 2052. Available from: IEEE Xplore Digital Library. [18 April 2016]. [30] Gers, F. (2001) Long ShortTerm Memory in Recurrent Neural Networks. PhD thesis, Ecole Polytechnique F´ed´erale de Lausanne, 2001. Available from: http://felixgers.de/papers/phd.pdf [31] Gers, F, Schraudolph, N and Schmidhuber, J. (2002) Learning precise timing with LSTM recurrent networks. The Journal of Machine Learning Research, vol. 3, pp 115143. Available from: ACM Portal: ACM Digital Library. [18 April 2016].

40

http://felixgers.de/papers/phd.pdf

[32] Graves, A. 2008. Supervised Sequence Labelling with Recurrent Neural Networks. PhD Thesis. Technische Universitäat München [33] Yahoo! Webscope dataset ydataymusicinternetradioplaylistsv1_0 [http://labs.yahoo.com/Academic_Relations] [34] Chollet, François 2015, ‘Keras’, Github, https://github.com/fchollet/keras [35] Zaharia, M. (2014) Apache Spark (Version 1.6.1) [Computer program]. Available at http://spark.apache.org/mllib/ (Accessed 09 May 2016) [36] Wen, TH, Gasic, M, Mrksic, N, Su, PH, Vandyke, D and Young, S.(2015) Semantically Conditioned LSTMbased Natural Language Generation for Spoken Dialogue Systems. EMNLP 2015. Available from: http://arxiv.org/pdf/1508.01745v2.pdf (Accessed 11 May 2016) [37] Zhang, S, Liu, C, Jiang, H, Wei, S, Dai, L and Hu, Y. (2016) Feedforward Sequential Memory Networks: A New Structure to Learn Longterm Dependency. Available from: http://arxiv.org/pdf/1512.08301.pdf (Accessed 11 May 2016) [38] Srivastava, N, Hinton, G, Krizhevsky, A, Sutskever, I and Salakhutdinov R. (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014) 19291958. Available from: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf(Accessed 11 May 2016) [39] Pham, V, Bluche, T, Kermorvant, C and Louradour, J. (2014) Dropout improves Recurrent Neural Networks for Handwriting Recognition. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, p: 285 290. Available from: http://arxiv.org/pdf/1312.4569.pdf(Accessed 11 May 2016) [40] Hinton, G, Srivastava, N and Swersky, K. (2014) ‘Neural Networks for Machine Learning Lecture 6a Overview of mini‐batch gradient descent’ Available from: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf(Accessed 11 May 2016) [41] Stanford University. CS231n Convolutional Neural Networks for Visual Recognition. Available from: http://cs231n.github.io/neuralnetworks3/(Accessed 11 May 2016) [42] Wikipedia Onehot encoding. Available from: https://en.wikipedia.org/wiki/Onehot (Accessed 11 May 2016)

41

http://skuld.bmsc.washington.edu/raster3d/html/raster3d.html

http://skuld.bmsc.washington.edu/raster3d/html/raster3d.html


http://arxiv.org/pdf/1512.08301.pdf

https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

http://arxiv.org/pdf/1312.4569.pdf

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

http://cs231n.github.io/neural-networks-3/

https://en.wikipedia.org/wiki/One-hot

www.kth.se

music predictions using deep learning. could lstm networks...

Documents