asr system & libdnn yen-chen wu [email protected] 台大語音實驗室 暑期專題研究
TRANSCRIPT
ASR System & LIBDNNYen-Chen [email protected] in Speech RecognitionDNNTIMIT IntroductionHow to use libdnn
DNNINSPEECH RECOGNITION
Speech RecognitionIn speech processingeach word consists of syllableseach syllable consists of phonemes
Each time frame, with an observance (vector) mapped to a phoneme. ()() (syllables)TSI --I N (phonemes)S--@ (phonemes)
Observation SequencesDigital Speech ProcessingLect. 2.0
25 msSample Rate: 1600010 mssliding windowframes of featuresFrame 1Frame 2Frame 3DNN in Speech RecognitionGoal: predict phoneme given feature in each time frame.Frame-wise predictionInput: acoustic featuresMFCC, FBANK or...Output: pronunciation unitsPhonemes or...To know more about Automatic Speech Recognition(ASR), please refer to http://speech.ee.ntu.edu.tw/DSP2015Spring/
TrainingDeep Neural Network7
8Main ProblemsModel initializeFeedforwardBackpropagateUpdatePredict
Model InitializeDNN sometimes fails at local optimum problem, so initialization matters.Practically, there exists unsupervised pre-training technique on initialization.However, in this homework, we recommend you initialize them randomly for the simplicity and efficiency.
Feedforward
Backpropagate
Update
EvaluationFramewise phoneme predictionFrame Accuracy
WHY DNN?Basic Model in Deep LearningFeature Extraction (Representation)Variety of Structures (CNN, RNN, LSTM, NTMetc)
Network StructureHow many layers?Number of neurons in each layer
Training ParameterLearning RateBatch SizeDataset and Format
16
DatasetTIMIT(Texas Instrument and Massachusetts Institute of Technology)Well-transcribed speech of American English speakers of different sexes and dialects. Designed for the development and evaluation of ASR systems.
17
DatasetEach instance consists of 3 parts:speaker faem0, sentence si1392, the 37th frame
18Data FormatWAV file: Speak-Sentence ID + .wavCheck by your ear(s)ARK file: Instance ID + features
TODOHOW TO USELIBDNN
LIBDNNlibdnn C++ CUDA
Ref: ( Deep and Convolutional Neural Networks for Acoutic Modeling in Large Vocabulary Continuous Speech Recognition )
() LibSVM
01
()(dense)
:nn-initnn-init [train_set_file] [options]EX: nn-init -o init.model --input-dim 69 --struct 1024 --output-dim 39nn-trainnn-train [valid_set_file] [model_out] [options]EX: nn-train train.dat init.model --input-dim 69nn-predictnn-predict [output_file] [options] EX: nn-predict test.dat train.model --input-dim 69shell-script
WORK STATION: [email protected]
ssh -p 2822 [email protected]/home/wyc2010/DNN_practicerun.shcp /home/wyc2010/DNN_practice/run.sh!sh run.sh