asr system & libdnn yen-chen wu [email protected] 台大語音實驗室暑期專題研究

ASR System & LIBDNNYen-Chen [email protected] in Speech RecognitionDNNTIMIT IntroductionHow to use libdnn

DNNINSPEECH RECOGNITION

Speech RecognitionIn speech processingeach word consists of syllableseach syllable consists of phonemes

Each time frame, with an observance (vector) mapped to a phoneme. ()() (syllables)TSI --I N (phonemes)S--@ (phonemes)

Observation SequencesDigital Speech ProcessingLect. 2.0

25 msSample Rate: 1600010 mssliding windowframes of featuresFrame 1Frame 2Frame 3DNN in Speech RecognitionGoal: predict phoneme given feature in each time frame.Frame-wise predictionInput: acoustic featuresMFCC, FBANK or...Output: pronunciation unitsPhonemes or...To know more about Automatic Speech Recognition(ASR), please refer to http://speech.ee.ntu.edu.tw/DSP2015Spring/

TrainingDeep Neural Network7

8Main ProblemsModel initializeFeedforwardBackpropagateUpdatePredict

Model InitializeDNN sometimes fails at local optimum problem, so initialization matters.Practically, there exists unsupervised pre-training technique on initialization.However, in this homework, we recommend you initialize them randomly for the simplicity and efficiency.

Feedforward

Backpropagate

Update

EvaluationFramewise phoneme predictionFrame Accuracy

WHY DNN?Basic Model in Deep LearningFeature Extraction (Representation)Variety of Structures (CNN, RNN, LSTM, NTMetc)

Network StructureHow many layers?Number of neurons in each layer

Training ParameterLearning RateBatch SizeDataset and Format

16

DatasetTIMIT(Texas Instrument and Massachusetts Institute of Technology)Well-transcribed speech of American English speakers of different sexes and dialects. Designed for the development and evaluation of ASR systems.

17

DatasetEach instance consists of 3 parts:speaker faem0, sentence si1392, the 37th frame

18Data FormatWAV file: Speak-Sentence ID + .wavCheck by your ear(s)ARK file: Instance ID + features

TODOHOW TO USELIBDNN

LIBDNNlibdnn C++ CUDA

Ref: ( Deep and Convolutional Neural Networks for Acoutic Modeling in Large Vocabulary Continuous Speech Recognition )

() LibSVM

01

()(dense)

:nn-initnn-init [train_set_file] [options]EX: nn-init -o init.model --input-dim 69 --struct 1024 --output-dim 39nn-trainnn-train [valid_set_file] [model_out] [options]EX: nn-train train.dat init.model --input-dim 69nn-predictnn-predict [output_file] [options] EX: nn-predict test.dat train.model --input-dim 69shell-script

WORK STATION: [email protected]

ssh -p 2822 [email protected]/home/wyc2010/DNN_practicerun.shcp /home/wyc2010/DNN_practice/run.sh!sh run.sh

asr system & libdnn yen-chen wu [email protected] 台大語音實驗室 暑期專題研究

Documents

asr system & libdnn yen-chen wu [email protected] 台大語音實驗室暑期專題研究