structure discovery of deep neural network based on ... discovery of deep neural network based on...

7
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Structure Discovery of Deep Neural Network Based on Evolutionary Algorithms Shinozaki, T.; Watanabe, S. TR2015-032 April 2015 Abstract Deep neural networks (DNNs) are constructed by considering highly complicated configu- rations including network structure and several tuning parameters (number of hidden states and learning rate in each layer), which greatly affect the performance of speech processing ap- plications. To reach optimal performance in such systems, deep understanding and expertise in DNNs is necessary, which limits the development of DNN systems to skilled experts. To overcome the problem, this paper proposes an efficient optimization strategy for DNN struc- ture and parameters using evolutionary algorithms. The proposed approach parametrizes the DNN structure by a directed acyclic graph, and the DNN structure is represented by a simple binary vector. Genetic algorithm and covariance matrix adaptation evolution strategy efficiently optimize the performance jointly with respect to the above binary vector and the other tuning parameters. Experiments on phoneme recognition and spoken digit detection tasks show the effectiveness of the proposed approach by discovering the appropriate DNN structure automatically. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2015 201 Broadway, Cambridge, Massachusetts 02139

Upload: lekiet

Post on 10-Mar-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,

MITSUBISHI ELECTRIC RESEARCH LABORATORIEShttp://www.merl.com

Structure Discovery of Deep Neural Network Based onEvolutionary Algorithms

Shinozaki, T.; Watanabe, S.

TR2015-032 April 2015

AbstractDeep neural networks (DNNs) are constructed by considering highly complicated configu-rations including network structure and several tuning parameters (number of hidden statesand learning rate in each layer), which greatly affect the performance of speech processing ap-plications. To reach optimal performance in such systems, deep understanding and expertisein DNNs is necessary, which limits the development of DNN systems to skilled experts. Toovercome the problem, this paper proposes an efficient optimization strategy for DNN struc-ture and parameters using evolutionary algorithms. The proposed approach parametrizesthe DNN structure by a directed acyclic graph, and the DNN structure is represented by asimple binary vector. Genetic algorithm and covariance matrix adaptation evolution strategyefficiently optimize the performance jointly with respect to the above binary vector and theother tuning parameters. Experiments on phoneme recognition and spoken digit detectiontasks show the effectiveness of the proposed approach by discovering the appropriate DNNstructure automatically.

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy inwhole or in part without payment of fee is granted for nonprofit educational and research purposes provided that allsuch whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi ElectricResearch Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and allapplicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall requirea license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

Copyright c© Mitsubishi Electric Research Laboratories, Inc., 2015201 Broadway, Cambridge, Massachusetts 02139

Page 2: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,
Page 3: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,

STRUCTURE DISCOVERY OF DEEP NEURAL NETWORKBASED ON EVOLUTIONARY ALGORITHMS

Takahiro Shinozaki

Tokyo Institute of Technology4259 Nagatsuta-cho, Midori-ku, Yokohama,

Kanagawa, 226-8502 JAPAN

Shinji Watanabe

Mitsubishi Electric Research Laboratories (MERL)201 Broadway, Cambridge, MA, 02139 USA

ABSTRACT

Deep neural networks (DNNs) are constructed by considering highlycomplicated configurations including network structure and severaltuning parameters (number of hidden states and learning rate in eachlayer), which greatly affect the performance of speech processingapplications. To reach optimal performance in such systems, deepunderstanding and expertise in DNNs is necessary, which limits thedevelopment of DNN systems to skilled experts. To overcome theproblem, this paper proposes an efficient optimization strategy forDNN structure and parameters using evolutionary algorithms. Theproposed approach parametrizes the DNN structure by a directedacyclic graph, and the DNN structure is represented by a simplebinary vector. Genetic algorithm and covariance matrix adaptationevolution strategy efficiently optimize the performance jointly withrespect to the above binary vector and the other tuning parameters.Experiments on phoneme recognition and spoken digit detectiontasks show the effectiveness of the proposed approach by discover-ing the appropriate DNN structure automatically.

Index Terms— Deep neural network, structure optimization,evolutionary algorithm, genetic algorithm, CMA-ES

1. INTRODUCTION

Recently, deep neural networks (DNNs) have become standardtechniques in various speech and language applications includingspeech recognition and spoken term detection [1,2], and well-trainedDNNs make it possible to build high-performance systems for suchtasks. However, DNNs have very complicated configurations in-cluding network topologies and several tuning parameters (numberof hidden states and learning rate in each layer) that greatly affectthe performance of speech processing applications. Therefore, tooptimize performance in such systems, deep understanding andexpertise in DNNs is necessary, thus limiting the development ofDNN systems to skilled experts. This paper aims to overcome thisproblem by automatically optimizing DNNs used in acoustic mod-els in speech processing without human expert elaboration based onevolutionary algorithms.

Applications of evolutionary algorithms to acoustic modeling inspeech processing have been studied by various researchers [3–7].However, these approaches have mainly focused on model param-eter optimization or several tuning parameters. DNN has anotherimportant configuration tuned by human experts that deals with thetopologies of DNNs. In fact, the progress in neural network speechprocessing arises from the discovery of appropriate DNN topologies,e.g., deep layers, deep stacking networks, and tandem concatenationof original inputs with neural network (bottleneck layer) outputs for

GMM features [8–11]. We call this problem the structure optimiza-tion of DNNs. There are several studies of structure optimization ofGaussian based models [12–16]; however, these studies do not ad-dress the optimization of DNNs. With this background, this paperproposes efficient structure and tuning parameter optimization algo-rithms based on evolutionary algorithms for DNNs used in speechprocessing.

Evolutionary algorithms are a generic term for nonlinear/non-convex optimization algorithms inspired by several biological evolu-tion mechanisms, including genetic algorithms and evolution strate-gies, and are widely used in various applications. This paper usesa popular genetic algorithm [17] and covariance matrix adaptationevolution strategy (CMA-ES) [18, 19] as an evolutionary algorithm.They have the common functions of 1) generating various vectorsencoding system configurations to be tested and 2) scoring multipleDNN systems with these prepared configurations. These steps areiteratively performed to search the optimal tuning parameters. Thefirst generation step provides the multiple hypotheses created by aprocess of simulated biological gene generation in a genetic algo-rithm and by probabilistic random number sampling in CMA-ES.The second scoring step includes training and evaluation of DNNsto obtain their scores, which usually takes very long time, but thisstep can be parallelized for each configuration. Although both ap-proaches must represent a DNN structure configuration using a vec-torized form (gene), this paper also proposes to formulate the DNNstructure configuration with a binary vector based on directed acyclicgraph (DAG) representation, where each connection between neuronmodules is represented by a binary value.

We conducted experiments on a massively parallel computingplatform TSUBAME supercomputer developed at Tokyo Institute ofTechnology1. Although each evaluation of DNN is fairly simple(single-speaker phone recognition and spoken digit detection), theexperiments were ran using 62-GPGPUs in parallel for both evolu-tionary algorithms to address DNN structure discovery problems.

2. EVOLUTIONARY ALGORITHM

Let us first denote the nth DNN configuration as Cn. Cn is com-posed of DNN specifications of tuning parameters and networktopologies, which is enough information to specify the DNN train-ing. The D-dimensional vector xn is a code to generate the con-figuration Cn and can be regarded as a gene in the evolutionaryalgorithm. This problem is formulated to find the optimal configu-ration Cn∗ :

Cn∗ = argmax∀n

f(Cn). (1)

1http://www.gsic.titech.ac.jp/en

Page 4: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,

Algorithm 1 Evolutionary algorithm1: Initialize x2: for n = 1, · · · , N0 do3: Sample xn

4: end for5: while not convergence do6: for n = 1, · · · , N (N0 for the 1st iteration) do7: Decode gene vector xn to configuration Cn

8: Evaluate configuration Cn to get score f(Cn)9: end for

10: Generate child gene vectors {xn}Nn=1 from current (parent)gene vectors {xn}Nn=1 and their scores {f(Cn)}Nn=1

11: end while

where function f provides the performance of the DNN (i.e., theaccuracy and size of DNN). As speech processing systems are ex-tremely complicated, there is no analytical form for f , and it is dif-ficult to include specific knowledge in f . Such situations are besthandled using an evolutionary algorithm, which is described in Al-gorithm 1.

The algorithm starts to provide initial gene vectors {xn}Nn=1.Then, the gene vectors are decoded to the corresponding config-urations {Cn}Nn=1, and scored by function evaluations. Based onthe scores, new gene vectors are generated. Genetic algorithms andCMA-ES have different strategies in this generation step, and thefollowing section describes the gene generation step of each methodin greater detail.

2.1. Genetic algorithm

The genetic algorithm is a search heuristic motivated by the naturalevolution process. It is based on 1) the selection of genes accord-ing to their scores pruning inferior gene vectors for the next iteration(generation), 2) mating pairs of gene vectors to make child gene vec-tors that mix the properties of the parents, and 3) mutation of a partof a gene vector to produce new gene vectors.

In the selection process, this paper uses a tournament method,which first extracts a subset of M(< N) genes ({xn′}Mn′=1) from atotal of N genes randomly and then selects the best gene by the score(xn∗ = argmaxn′ f(Cn′)). The random subset extraction step canprovide variations of genes, while the best selection step in a subsetguarantees the exclusion of inferior genes.

For the mating process, the one-point crossover method isadopted, which first finds a pair of (parent) genes from the selectedgenes (xp

n1 and xpn2) and then swaps the d + 1 to D elements of

these two vectors to obtain a new (child) gene vector pair (xcn1 and

xcn2).

xcn1 =

ˆ

xpn1,1 · · · xp

n1,d xpn2,d+1 · · ·xp

n2,D

˜ᵀ

xcn2 =

ˆ

xpn2,1 · · · xp

n2,d xpn1,d+1 · · ·xp

n1,D

˜ᵀ (2)

where ᵀ is the matrix transpose. The position d is randomly sampled.As the iteration increases, these processes provide appropriate genevectors that encode optimal DNN configurations.

2.2. Covariance matrix adaptation evolution strategy

CMA-ES assumes that a gene vector x is generated from a multivari-ate Gaussian distribution N (x|µ,Σ), where µ and Σ are the meanvector and covariance matrix, respectively. After the function evalu-ation in each epoch, CMA-ES updates µ and Σ using the following

equation:(

µ ← µ + εµ

PNn=1 w(f(Cn))(xn − µ)

Σ ← Σ + εΣ

PNn=1 w(f(Cn)) ((xn − µ)(xn − µ)ᵀ −Σ) ,

(3)where εµ and εΣ are step sizes, and w() is a weighting function fora score f(Cn). This procedure is based on a well-known methodof parameter update based on a gradient. These parameters are ex-plained in detail in [20]. Once the parameter is updated, new genevectors are sampled from the Gaussian distribution, i.e., {xn}Nn=1 ∼N (x|µ,Σ)

Compared with the genetic algorithm, CMA-ES assumes thecontinuous gene vector and provides a simple (well-known) genegeneration algorithm. However, the actual DNN configuration isrepresented by binary and discrete values in addition to continuousvalues, and the genetic algorithm is more flexible for directly con-sidering these variables.

3. STRUCTURE OPTIMIZATION

3.1. Representation of structured DNN

When applying structure optimization to DNN, one issue is how tomaintain the computational efficiency of the DNN while allowingflexible configurations. For a DNN formed by a simple cascade ofmultiple layers, the computation required for training and evaluationis based on layer-wise coarse grain matrix operation. Therefore, it isefficiently computed on GPGPU, which is a very important propertyfor practical speech applications. However, if arbitrary connectionsare allowed between neurons, the computation is small grained, andefficient computation becomes difficult. To overcome this problemas well as to make compact gene expression possible, we considerDNNs formed as a directed acyclic graph (DAG) of neuron modules.It is an extension of the standard multi layer-structured DNNs by al-lowing arbitrary connections between layers with a constraint that itdoes not have directed cycles. By analogy to neuroanatomy, we re-fer to the neuron module as the nucleus. Right hand side of Figure 1shows an example. In the figure, it is referred to as a “phenotype” incontrast to the “genotype” that is explained in the next subsection.

Arbitrary feed-forward structures are represented without limi-tation if we choose the nucleus to be equal to a single neuron. How-ever, if there are a large number of small nuclei, it is difficult to ef-ficiently perform computations on GPGPU. Therefore, we considernuclei with a moderate number of neurons. A DAG of nucleus al-lows efficient application of RBM-based pre-training. That is, nucleiare pre-trained simply one by one from the input side to the outputside of the DAG structure. If a nucleus has multiple inputs, they areconcatenated to form a single input vector. The transformations toobtain the inputs are always already computed because there is nodirected cycle. Fine tuning by back-propagation is also easily ap-plied.

3.2. Genetic representation

To efficiently represent arbitrary DAG structures as genes, we utilizethe fact that the nodes of a DAG can be numbered so that all thedirected connections face the same direction from a lower numberednode to a higher numbered node. This property is easily confirmedby considering the lemmas that there exists at least one node thatdoes not have a parent, and a graph obtained by removing such anode is also a DAG.

Once all the nodes of a DAG are numbered, then the connectionscan be represented by a lower triangular binary matrix C where an

Page 5: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,

element ci,j is one if nucleus i receives a connection from nucleusj and zero otherwise. A gene representation of the graph structureis obtained by arranging the elements of C into a vector in a pre-defined order. If a real number is used as a gene element, as inCMA-ES, binary values are obtained, for example, by first roundingthe number to an integer and then obtaining a residue divided by two.We assumed the maximum number of nuclei to be pre-determined.However, genes are allowed to have defunct nuclei that lack inputand/or output connection and the effective number of nuclei canchange during the optimization process. The matrix in Figure 1 isthe genotype of the DNN structure shown in the same figure.

Without loss of generality, it can be assumed that there is onlyone input nucleus that directly connects to the input, which makesimplementation a bit easier. For this purpose, we introduced a spe-cial “input nucleus” that simply outputs the same vector as the input.Additionally, we introduced an “extraction nucleus” that extracts apart of an input vector to allow separate treatment of sub-dimensions.In the example in Figure 1, the DNN consists of one input nucleus(nucleus id = 0), seven extraction nuclei (id = 1,2,..,7), four sigmoidnuclei (id = 8,9,10,11), and one output softmax nucleus (id = 12).The input to the network is a concatenation of seven frames of speechfeatures, and the output of the seven extraction nuclei corresponds toeach of the original frames.

4. EXPERIMENTS

We performed two types of experiments. The first is a preliminaryexperiment based on frame-wise phone recognition. An in-housephone corpus was used where utterances were given by a single malespeaker. The second is the main experiment, where continuous DPmatching based spoken digit detections was performed using DNN-based features. The AURORA2 [21] corpus was used for this task.

For the training and evaluation of DNNs utilizing GPGPU, weimplemented software using the Theano python library [22]. Exper-iments were performed using the TSUBAME2.5 supercomputer. Inthe following experiments, 62 NVIDIA K20X GPGPUs were usedin parallel through the message passing interface (MPI).

4.1. Preliminary experiment using single-speaker phone recog-nition4.1.1. Experimental setup

The training and development sets were speech utterances given bya male speaker that totaled 12 minutes and 3.5 minutes, respectively.The speech features were 12 MFCC, energy, and their deltas. Seg-ment features were used as an input to DNNs having 130 (=26x5)dimensions in total concatenating five frames. The genes specifieda DNN structure with at most four plastic nuclei that were subjectto the optimization in terms of sizes, learning rates, and connec-tions. Extraction nuclei were used to represent the subvectors ofa composite feature vector. The input and extraction nuclei werefixed, and their connections were not changed during the optimiza-tion. Five (= N0) initial genes were manually prepared. The DNNswere trained to predict 19 subsets of Japanese phonemes, and frameaccuracy was evaluated. Both CMA-ES and GA used real num-ber sequences as genes in this experiment. The mapping from thereal value to the nucleus sizes was given by Equation (4), and themapping to the learning rates was given by Equation (5). For themutation process of GA, Gaussian noise with zero mean and 0.05standard deviation was uniformly added to the gene.

Size = ceil (10x) , (4)learnrate = ceil

`

105x/100000´

. (5)

Fig. 2. Preliminary experimental results.

CMA-ES used the frame accuracy of the development set as the fit-ness function. GA used a weighted sum of frame accuracy and net-work size, normalized in each generation so that a gene with 95%normalized accuracy performance was evaluated as equal to the topone if it had a 50% normalized network size. The network size wasthe sum of the sizes of affine transformation matrices in a DNN. Thepopulation size N was set to 62 for both algorithms. The tournamentsize M was four for the gene selection in GA.

4.1.2. Results

Figure 2 shows the best frame accuracy among each generation ob-tained during the optimizations. The horizontal axis is the numberof generations, and the vertical axis is the frame accuracy. In theearly generations, CMA-ES gave better accuracy than GA showingits efficiency in the search. After approximately 10 iterations, bothof them gave similar performance, showing good improvement fromthe initial generation. Thus, the experiment verified the effectivenessof both evolutionary algorithms for the structure optimization.

4.2. Spoken digit detection

4.2.1. Experimental setup

Spoken digit detection was performed based on continuous DPmatching using DNN extracted digit posterior features. The digitposterior features had 13 dimensions to represent 0 to 9, oh, and twotypes of silences. The evaluation was based on frame-wise meanaverage precision (MAP) [23]. In DP hypotheses, the same framemay be included in multiple candidate segments starting and endingat different time frames with different distance scores. For the MAPevaluation, the minimum of the matching distance is adopted at eachframe.

DNN structure optimizations by evolutionary algorithms wereperformed using the training set of the AURORA2 corpus. Amongthe training set, utterances from five male and five female speakerswere used to provide keywords to detect. Another five male and fivefemale speakers were used to provide target utterances within whichto search for the keywords and were used to evaluate the MAP scoreduring the optimizations, where 50 keywords and 100 utterances perkeyword were randomly selected to evaluate the MAP score. Therest of the training data from 90 speakers were used to pretrain andfine-tune the DNN parameters. The speech features were 13 MFCCsand an energy. Seven frames were concatenated to form a compositefeature vector with 98 dimensions in total.

The genes specified a DNN structure with at most five plasticnuclei. GA used binary values to represent network connections,which was more convenient to directly specify the mutation flip ratethan using real numbers. CMA-ES requires the use of real numbers.However, because the variance is automatically adjusted in the op-timization process, it does not require manually setting the flipping

Page 6: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,

Fig. 1. Example of DAG of nucleus DNN and a part of a gene corresponding to the structure. This structure was obtained by genetic algorithmat the 30th generation using the GA(MAP, Size, Sim) strategy described in Section 4.2.2. Compared to one of the most similar initial DNNs,connections from nuclei 2 to 9 and 5 to 8 were added.

Fig. 3. Spoken digit detection results using AURORA2 corpus.Clean condition training was used.

probability. The population size was set to 62. As the fitness func-tion, CMA-ES used the MAP score. For GA-based optimization, thefollowing three strategies were explored.

GA(MAP) Uses MAP score to compare two genes.

GA(MAP,Size) Uses MAP score. However, gene a wins over geneb that has a higher MAP score if the difference of the scoresis less than 0.02 and the network size of gene a is less than90% of gene b.

GA(MAP,Size, Sim) In addition to strategy 2, structure similarityis considered when mating two genes by selecting the bestpartner among 5 candidates.

This flexibility of selecting candidates with the above advanced cri-teria is an advantage of GA on CMA-ES. Training and evaluating aDNN took around 30 to 60 minutes depending on individuals.

4.2.2. ResultsFigure 3 shows the best MAP score among a generation usingthe clean training condition, and Figure 4 shows the sizes of theDNNs corresponding to Figure 3. As in the preliminary experiment,CMA-ES gave better results than the GA-based optimizations inthe early generations. After approximately 15 iterations, the differ-ence between CMA-ES and GA(MAP) was small, and both of themgave improvements from the initial generation, although the scoresby CMA-ES looked more stable. When the three GA strategieswere compared, generally GA(MAP) gave the best MAP score andGA(MAP, Size) the worst. However, the differences were not large.There were, in contrast, large differences in their network sizes.GA(MAP, Size, Sim) and GA(MAP, Size), which consider networksize in the selection process, gave much smaller network sizes thanGA(MAP) and CMA-ES. GA(MAP, Size, Sim) and GA(MAP, Size)are advantageous when running spoken digit detection on resource-limited hardware. The structure shown in Figure 1 is the 30thgeneration DNN with the best MAP score obtained by GA(MAP,Size, Sim).

Fig. 4. Network sizes of DNNs that gave the best MAP score in eachgeneration.

Table 1. MAP scores on test sets. Optimizations were performedusing the multicondition training set. DNN(ini) is the result of thefirst generation, DNN(GA) is the result of GA at the 30th generation.DNN(expt) is the result of tuning by a human expert.

Features setA(clean) setA(SNR15) setB(SNR15)MFCC 0.38 0.22 0.21DNN(ini) 0.44 0.36 0.33DNN(GA) 0.81 0.75 0.74DNN(expt) 0.80 0.73 0.73

Table1 shows the test set MAP scores of DNNs optimized/trainedusing the multicondition training set of the AURORA2 corpus. Forthe structure optimization, GA(MAP, SCORE, SIM) was used. Testset A was noise closed, and test set B was open. As can be seen,DNN-based digit posterior features were already better than MFCCin the first generation, and further gains were obtained by GA. TheMAP scores given by DNN at the 30th generation were 0.81 for theclean test data and 0.75 and 0.74 for the noisy test data of set A andB, respectively, which were comparable or slightly better than tunedresults by a human expert.

5. SUMMARYA DAG of nucleus DNNs has been proposed that is suited forGPGPU computation and for compact gene expression. Joint struc-ture and parameter optimizations have been explored based onevolutionary algorithms. The experimental results showed that bothCMA-ES and GA gave large improvements from the initial gener-ation without tuning by human experts. Future work will includeapplying the framework to large vocabulary speech recognition.

6. ACKNOWLEDGMENTThis work was supported by JSPS KAKENHI Grant Numbers26280055, 25280058. Part of this work was also supported by theKayamori Foundation of Information Science Advancement.

Page 7: Structure Discovery of Deep Neural Network Based on ... DISCOVERY OF DEEP NEURAL NETWORK BASED ON EVOLUTIONARY ALGORITHMS Takahiro Shinozaki Tokyo Institute of Technology 4259 Nagatsuta-cho,

7. REFERENCES

[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, andB. Kingsbury, “Deep neural networks for acoustic modeling inspeech recognition: The shared views of four research groups,”IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97,2012.

[2] B. Kingsbury, J. Cui, X. Cui, M.J.F. Gales, K. Knill, J. Mamou,L. Mangu, D. Nolden, M. Picheny, B. Ramabhadran, et al., “Ahigh-performance Cantonese keyword search system,” in Proc.ICASSP. IEEE, 2013, pp. 8277–8281.

[3] C. W. Chau, S. Kwong, C. K. Diu, and W. R. Fahrner, “Opti-mization of HMM by a genetic algorithm,” in Proc. ICASSP,1997, vol. 3, pp. 1727–1730.

[4] S. Kwong, C. W. Chau, K. F. Man, and K. S. Tang, “Optimi-sation of HMM topology and its model parameters by geneticalgorithms,” Pattern Recognition, vol. 34, no. 2, pp. 509–522,2001.

[5] F. Yang, C. Zhang, and T. Sun, “Comparison of particle swarmoptimization and genetic algorithm for HMM training,” inProc. ICPR, 2008, pp. 1–4.

[6] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improvingdeep neural networks for LVCSR using rectified linear unitsand dropout,” in Proc. ICASSP, 2013, pp. 8609–8613.

[7] S. Watanabe and J. Le Roux, “Black box optimization for au-tomatic speech recognition,” in Proc. ICASSP. IEEE, 2014, pp.3256–3260.

[8] A. R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic model-ing using deep belief networks,” IEEE Transactions on Audio,Speech, and Language Processing, vol. 20, no. 1, pp. 14–22,2012.

[9] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learningfor building deep architectures,” in Proc. ICASSP. IEEE, 2012,pp. 2133–2136.

[10] H. Hermansky, DPW Ellis, and S. Sharma, “Tandem connec-tionist feature extraction for conventional HMM systems,” inProc. ICASSP, 2000, vol. 3, pp. 1635–1638.

[11] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilis-tic and bottle-neck features for LVCSR of meetings.,” in Proc.ICASSP, 2007, pp. 757–760.

[12] K. Shinoda and T. Watanabe, “Acoustic modeling based on theMDL criterion for speech recognition,” in Proc. Eurospeech,1997, vol. 1, pp. 99–102.

[13] S. S. Chen and P. S. Gopalakrishnan, “Clustering via theBayesian information criterion with applications in speechrecognition,” in Proc. ICASSP, 1998, vol. 2, pp. 645–648.

[14] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, “Vari-ational Bayesian estimation and clustering for speech recog-nition,” IEEE Transactions on Speech and Audio Processing,vol. 12, pp. 365–381, 2004.

[15] X. Liu and M. Gales, “Automatic model complexity controlusing marginalized discriminative growth functions,” IEEETransactions on Audio, Speech, and Language Processing, vol.15, no. 4, pp. 1414–1424, 2007.

[16] T. Shinozaki, S. Furui, and T. Kawahara, “Gaussian mixtureoptimization based on efficient cross-validation,” IEEE Jour-nal of Selected Topics in Signal Processing, vol. 4, no. 3, pp.540–547, 2010.

[17] L. Davis et al., Handbook of genetic algorithms, vol. 115, VanNostrand Reinhold New York, 1991.

[18] N. Hansen, S. D. Muller, and P. Koumoutsakos, “Reducing thetime complexity of the derandomized evolution strategy withcovariance matrix adaptation (CMA-ES),” Evolutionary Com-putation, vol. 11, no. 1, pp. 1–18, 2003.

[19] N. Hansen, A. Auger, R. Ros, S. Finck, and P. Posık, “Com-paring results of 31 algorithms from the black-box optimiza-tion benchmarking bbob-2009,” in Proc. the 12th annual con-ference companion on Genetic and evolutionary computation(GECCO), 2010, pp. 1689–1696.

[20] N. Hansen, “The CMA evolution strategy: a comparing re-view,” in Towards a new evolutionary computation, pp. 75–102. Springer, 2006.

[21] D. Pearce and H. Hirsch, “The aurora experimental frameworkfor the performance evaluation of speech recognition systemsunder noisy conditions,” in ISCA ITRW ASR2000, 2000, pp.29–32.

[22] James Bergstra, Olivier Breuleux, Frederic Bastien, Pas-cal Lamblin, Razvan Pascanu, Guillaume Desjardins, JosephTurian, David Warde-Farley, and Yoshua Bengio, “Theano: aCPU and GPU math expression compiler,” in Proceedings ofthe Python for Scientific Computing Conference (SciPy), June2010.

[23] K. Kishida, “Property of average precision and its generaliza-tion: an examination of evaluation indicator for informationretrieval,” Tech. Rep., National Institute of Informatics, 2005.