using recurrent neural networks automatic speech recognition delft university of technology faculty...

44
Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based Systems Daan Nollen 8 januari 1998

Post on 22-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

Using Recurrent Neural NetworksAutomatic Speech Recognition

Delft University of TechnologyFaculty of Information Technology and SystemsKnowledge-Based Systems

Daan Nollen

8 januari 1998

Page 2: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

2

“Baby’s van 7 maanden onderscheiden al wetten in taalgebruik”

• Abstract rules versus statistical probabilities• Example “meter”

• Context very important– sentence (grammer)– word (syntax)– phoneme“Er staat 30 graden op de meter”

“De breedte van de weg is hier 2 meter”

Page 3: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

3

• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations

Contents

• Problem definition

Page 4: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

4

Problem definition:

Create an ASR using only ANNs

• Study Recnet ASR and train Recogniser• Design and implement an ANN workbench• Design and implement an ANN phoneme

postprocessor• Design and implement an ANN word

postprocessor

Page 5: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

5

• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations

Contents

Page 6: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

6

Automatic Speech Recognition (ASR)

• ASR contains 4 phases

A/D conversion Pre-

processingRecognition

Post- processing

Page 7: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

7

• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations

Contents

Page 8: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

8

Recnet ASR

• TIMIT Speech Corpus US-American samples

• Hamming windows 32ms with 16ms overlap

• Preprocessor based on Fast Fourier Transformation

Page 9: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

9

Recnet Recogniser• u(t) external input (output

preprocessor)• y(t+1) external output

(phoneme probabilities)• x(t) internal input• x(t+1) internal output

x(0) = 0

x(1) = f( u(0), x(0))

x(2) = f( u(1), x(1)) = f( u(1), f( u(0), x(0)))

Page 10: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

10

Performance Recnet Recogniser

• Output Recnet probability vector

• 65 % phonemes correctly labelled

• Network trained on parallel nCUBE2

/eh/ = 0.1/ih/ = 0.5/ao/ = 0.3/ae/ = 0.1

Page 11: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

11

Training Recnet on the nCUBE2

• Per trainingcycle over 78 billion floating point calculations

• Training time SUN 144 mhz 700 hours (1 month)

• nCUBE2 (32 processors) processing time 70 hours

Page 12: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

12

• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations

Contents

Page 13: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

13

Recnet Postprocessor• Perfect recogniser

• Recnet recogniser

/h/ /h/ /h/ /eh/ /eh/ /eh /eh/ /l/ /l/ /ow/ /ow /ow//h/ /eh/ /l/ /ow/

/h/ /k/ /h/ /eh/ /ah/ /eh /eh/ /l/ /l/ /ow/ /ow /h/

Page 14: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

14

Hidden Markov Models (HMMs)

• Model stochastic processes• Maximum a posteriori state

sequence determined by

• Most likely path by Viterbi algorithm

S1 S2

a11 a22

a12

a21

p(a)=pa1

p(b)=pb1

p(n)=pn1

p(a)=pa2

p(b)=pb2

p(n)=pn2

T

ttttt

QqupqqQ

11

* )|()|Pr(maxarg

Page 15: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

15

Recnet Postprocessing HMM

• Phonemes represented by 62 states

• Transitions trained on correct TIMIT labels

• Output Recnet phoneme recogniser determines output probabilities

• Very suitable for smoothing task

/eh/ /ih/

a11 a22

a12

a21

p(/eh/)=y/eh/

p(/ih/)=y/ih/

p(/ao/)=y/ao/

p(/eh/)=y/eh/

p(/ih/)=y/ih/

p(/ao/)=y/ao/

Page 16: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

16

Scoring a Postprocessor• Number of frames fixed, but number of phonemes

not

• Three types of errors - insertion /a/ /c/ /b/ /b/ /a/ /c/ /b/- deletion /a/ /a/ /a/ /a/ /a/ ...- substitution /a/ /a/ /c/ /c/ /a/ /c/

• Optimal allignment

/a/ /a/ /b/ /b/ /a/ /b//a/ /c/ /b/ /b/ /a/ /c/ /b/

Page 17: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

17

Scoring Recnet Postprocessing HMM

• No postprocessing vs Recnet HMM

• Removing repetitions

• Applying scoring algorithm

Nothing Recnet HMMCorrect 79.9% 72.8%Insertion 46.1% 3.7%Deletion 17.6% 21.2%Substitution 2.5% 6.0%Total errors 66.2% 30.9%

Page 18: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

18

Elman RNN Phoneme Postprocessor(1)

• Context helps to smooth the signal

• Probability vectors input

• Most likely phoneme output

Page 19: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

19

RNN Phoneme Postprocessor(2)• Calculates conditional probability

• Probability changes through time

context)) input,|(p(phonargmaxphon i61i0

(i)likely most

Page 20: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

20

Training RNN Phoneme Postprocessor(1)

• Highest probability determines region

• 35 % in incorrect region• Training dilemma:

– Perfect data, context not used

– Output Recnet, 35 % errors in trainingdata

• Mixing trainingset needed

Page 21: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

21

• Two mixing methods applied– I : Trainingset = norm( Recnet + (1- ) Perfect prob)

– II: Trainingset = norm(Recnet with p(phncorrect)=1)

Method I Method II Recnet HMMCorrect 74.8% 75.1% 72.8%Insertion 23.5% 21.9% 3.7%Deletion 21.4% 21.1% 21.2%Substitution 3.8% 3.8% 6.0%Total errors 48.7% 46.8% 30.9%

Training RNN Phoneme Postprocessor(2)

Page 22: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

22

• Reduces 50% insertion errors

• Unfair competition with HMM– Elman RNN uses previous frames (context)– HMM uses preceding and successive frames

Conclusion RNN Phoneme Postprocessor

Page 23: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

23

Example HMM Postprocessing

/a/ /b/

0.8 0.8

0.2

0.2

a=1b=0 Frame 1: p(/a/)=0.1

p(/b/)=0.9

p(/a/, /a/) = 1·0.8·0.1 = 0.08p(/a/, /b/) = 1·0.2·0.9 = 0.18

Frame 2: p(/a/)=0.9 p(/b/)=0.1

p(/a/, /a/, /a/) = 0.08·0.8·0.9 = 0.0576 p(/a/, /a/, /b/) = 0.08·0.2·0.1 = 0.0016p(/a/, /b/, /a/) = 0.18·0.2·0.9 = 0.0324p(/a/, /b/, /b/) = 0.18·0.8·0.1 = 0.0144

Page 24: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

24

• Reduces 50% insertion errors

• Unfair competition with HMM– Elman RNN uses previous frames (context)– HMM uses preceding and succeeding frames

• PPRNN works real-time

Conclusion RNN Phoneme Postprocessor

Page 25: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

25

• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations

Contents

Page 26: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

26

Word postprocessing

• Output phoneme postprocessor continuous stream of phonemes

• Segmentation needed to convert phonemes into words

• Elman Phoneme Prediction RNN (PPRNN) can segment stream and correct errors

Page 27: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

27

Elman Phoneme Prediction RNN (PPRNN)

• Network trained to predict next phoneme in coninuous stream

• Squared error Se determined by

261

0

)(

i

iie zyS

Page 28: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

28

Testing PPRNN“hh iy - hv ae dcl - p l ah n jh dcl d - ih n tcl t ix - dh ix - dcl d aa r kcl k - w uh dcl d z - bcl b iy aa n dcl d”

hh

iy

hv

ae dcl

p

l

ah n jh

dcl d

ih

n

tcl

t

ix

dh

ix

dcl

d

aa

r kcl k

w uh

dcl d

z

bcl

b

iy

aa

n dcl d 0

0.5

1

1.5

2

2.5

3

hh iy hv ae dcl p l ah n jh dcl d ih n tcl t ix dh ix dcl d aa r kcl k w uh dcl d z bcl b iy aa n dcl d

Distance

Phoneme

Page 29: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

29

Performance PPRNN parser

• Two error types– insertion error helloworld he-llo-world– deletion error helloworld

helloworld

• Performance parser complete testset– parsing errors : 22.9 %

Page 30: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

30

Error detection using PPRNN

• Prediction forms vector in phoneme space

• Squared error to closest cornerpoint squared error to most likely Se mostlikely

• Error indicated by

mostlikely e

eindicationE

SS

Page 31: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

31

Phn(x+1)Phn(x+1)

Error correction using PPRNN

Phn(x) Phn(x+2)

Phnml

(x+1)• Insertion errorPhnml

(x+2)

• Deletion error

• Substitution error

Page 32: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

32

Performance Correction PPRNN

• Error rate reduced 8.6 %

Recnet HMM HMM + PPRNNCorrect 72.9% 80.7%Insertion 4.1% 3.3%Deletion 20.9% 14.8%Substitution 6.2% 4.5%Total errors 31.2% 22.6%

Page 33: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

33

• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations

Contents

Page 34: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

34

ConclusionsPossible to create an ANN ASR

• Recnet– documented and trained

• Implementation RecBench• ANN phoneme postprocessor

– Promising performance

• ANN word postprocessor– Parsing 80 % correct– Reducing error rate 9 %

Page 35: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

35

Recommendations• ANN phoneme postprocessor

– different mixing techniques– Increase framerate– More advanced scoring algorithm

• ANN word postprocessor– Results of increase in vocabulary

• Phoneme to Word conversion– Autoassociative ANN

• Hybrid Time Delay ANN/ RNN

Page 36: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

36

A/D Conversion

Hello world

• Sampling the signal

1000101001001100010010101100010010111011110110101101

Page 37: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

37

Preprocessing

101101

010110

111101

111100

• Spliting signal into frames

• Extracting features

1000101001001100010010101100010010111011110110101101

Page 38: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

38

Recognition

101101

010110

111101

111100

• Classification of frames

101101

010110

111101

111100

H E L L

Page 39: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

39

Postprocessing

• Reconstruction of the signal

101101

010110

111101

111100

H E L L

“Hello world”

Page 40: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

40

Training Recnet on the nCUBE2

• Training RNN computational intensive

• Trained using Back- propagation Through Time

• nCUBE2 hypercube architecture

• 32 processors used during training

0

1

2

3

Page 41: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

41

Training results nCUBE2

• Training time on nCUBE2 50 hours

• Performance trained Recnet 65.0 % correct classified phonemes

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60

cycles

alpha

energy

senergy

ncorrect

sncorrect

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60

cycles

alpha

energy

senergy

ncorrect

sncorrect

Page 42: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

42

Viterbi algorithm

• A posteriori most likely path

• Recursive algorithm with O(t) is the observation

jj )0(

))(())1((max)(1

tObatt jijiNi

j

SN

...S7

S6

S5

S4

S3

S2

S1

0 1 2 3 4 ... T

Page 43: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

43

Elman Character prediction RNN (2)• Trained on words

“Many years a boy and girl lived by the sea. They played happily”

• During training words picked randomly from vocabulary

• During parsing squared error Se is determined by

with yi is correct output of node i and zi is the real output of node i

• Se is expected to decrease within a word

261

0

)(

i

iie zyS

Page 44: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based

44

Elman Character prediction RNN (3)