speech recognition and - inria

Speech Recognition and

Multispeech

Loria Inria, Nancy, France

1

Information carried by speech

� Linguistic content (words)

� Speech recognition

� Recognition of all uttered words, or just some keywords

� Vocal commands, speech transcription, vocal indexing, etc.

� Speaker (who speaks)

� Speaker recognition

� Speaker identification, or speaker authentication

� Diarization (associating speech segments with speakers), etc.

� Language

� Language recognition

� Identification of the spoken language, or of the dialect, accent, etc.

� Paralinguistic information

� Emotions

� Neutral speech, joy, sadness, anger, etc.

� Speaking style

� Spontaneous vs. read speech, sport commentary, etc.

2

Automatic speech recognition system

� Input: audio file

� Output: transcription (text)

Recognition

Engine

Language

Model

Acoustic

Model

Phonetic

Lexicon

Transcription

3

Acoustic models

� Hidden Markov Models (HMM)

� Finite state automaton with N states, composed of three

components: �, �, Π

�A[aij]: matrix of transitions (NxN)

�Π π� : initialprobabilities N

�B[bj ]: observation probabilities

aijπ�

bj (O1) bj

(O2)

4

5

Why hidden?

� Because we see only observations

� We don’t know the state sequence that produced the observation

sequence

6

Speech modeled by HMM

� We assume that the speech is produced by a Markov

system

7

HMM

Observation probability

� Two possibilities:

� GMM (Gaussian Mixture Model): Observation probability

is modeled by a mixture of M Gaussians

� DNN (Deep Neural Network): Observation probability is

modeled by a Deep Neural Network

8

Deep Neural Network (DNN)

� A DNN is defined by three types of parameters:

� The interconnection pattern between the different layers of

neurons

� The training process for updating the weights wi of the

interconnections

� The activation function f that converts a neuron's weighted

input to its output activation

9

Architecture example of DNN for

acoustic model

� MLP (Multi Layer Perceptron)

� 6 hidden layers

� 2048 neurons for each hidden

layer

� Input: size of the acoustic

parameters (39)

� Output: number of HMM states

(4048 context-dependent phone

states)

10

TDNN – Time Delay Neural Network [Peddinti et al. 2015]

11

V. Peddinti, G. Chen, V. Manohar, T.Ko, D. Povey, S. Khudanpur, JHU ASpIRE

system: Robust LVCSR with TDNNs i-vector Adaptation and RNN-LM. Proc. of the

IEEE ASRU, 2015.




Recognition

Engine

Language

Model

Acoustic

Model

Phonetic

Lexicon

Transcription

12

Language model

� Compute the probability of a word knowing the

previous words

� Two possibilities:

� N-gram

� Recurrent Neural Networks (RNN)

13

N-gram

� An n-gram model gives the probability of a word wi given the

n-1 previous words:

� Advantages

� Easy to compute

� Rare events are taken into account

� Drawbacks

� Only 3-grams or 4-grams can be evaluated (short term

dependency)

� No generalization

�In the training corpus “a blue car” “a red Ferrari”

The probability of “a blue Ferrari“ (never seen) will

be badly estimated

14

Recurrent Neural Network

Language Model (RNNLM)

� Continuous space representation (word embedding)

� Using NN for projecting words in a continuous space

� To take into account the temporal structure of language (word sequences)

� Recurrent Neural Networks [Chen et al., 2015]

� [Chen et al, 2015] Xie Chen, Xunying Liu, Mark JF Gales, and Philip C Woodland,“Improving the training and evaluation efficiency of recurrent neural network languagemodels,” in Proc. ICASSP, 2015.

15

Some properties of word embeddings

� Vector relations between words

16

Long term dependency

� To take into account long term dependencies in a sentence

� Ex: les étudiantes inscrites à la conférence ACL sont arrivées

� Add a memory mechanism

� Long Short Term Memory (LSTM)

[Kumar et al. 2017] S. Kumar, Michael A. Nirschl, D. HoltmannRice, H. Liao, A. Theertha Suresh, and F. Yu, Lattice rescoring strategies for long short term memory language models in speech recognition, in ASRU Workshop, 2017.

[Li et al. 2020] K Li, Z Liu, T He, H Huang, F Peng, D Povey, S Khudanpur An Empirical Study of Transformer-Based Neural Language Model Adaptation, ICASSP 2020.

Short term dependency: bigram

Long term dependency: 8-gram

17

Language model for speech recognition

� Combination of LMs

� Advantage of N-gram: rare events are taken into account

� Advantage of RNNLM: generalization capacity

P(w|h) =λ Pngram(w|h) + (1-λ) PRNNLM(w|h)

Improvement of WER is about 20% relative [Sundermeyer

&Ney 2015]

[Sundermeyer &Ney 2015] M. Sundermeyer, H. Ney, R. Schlüter (2015). From

Feedforward to Recurrent LSTM Neural Networks for Language Modeling, IEEE

Transactions on Audio, Speech and Language Processing, vol.. 23, no. 3, March 2015.

18




Recognition

Engine

Language

Model

Acoustic

Model

Phonetic

Lexicon

Transcription

19

Phonetic lexicon

� Examples in English (from cmudict)

soften S AA F AH N sorbet S AO R B EY

soften(2) S AO F AH N sorbet(2) S AO R B EH T

� The lexicon specifies the list of words known by the ASR system [Sheikh 2016]

� An ASR system cannot recognized words that are not in the lexicon

� It is impossible to have a lexicon covering all possible words (because

of person names, company names, product names, etc.)

� Diachronic evolution of vocabularies (due to new topics, new persons, etc.)

� The lexicon also specifies the possible pronunciations of the words

� Must include the usual pronunciation variants

� But one should not include too many useless variants as this increases possible

confusions between vocabulary words

[Sheikh 2016] I. Sheikh. Exploiting Semantic and Topic Context to Improve Recognition of Proper

Names in Diachronic Audio Documents. . Université de Lorraine, 2016.

20

Speech recognition errors

� Insertion / Deletion / Substitution

�Ref. : I want to go to Paris

�Reco. : well I want to go Lannion

� WER : Word Error Rate

�� =�� +�� +�� !

�" #$%"��

21

Experimental evaluation

� Kaldi-based Transcription System (KATS) [Povey et al.,

2011]

� Segmentation and diarization

�Splits and classifies the audio signal into homogeneous segments

� Non-speech segments (music and silence)

� Telephone speech

� Studio speech

� Parametrization [MFCC]

�13 MFCC + 13 ∆ and 13 ∆ ∆

� 39-dimension observation vector

[Povey et al., 2011] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M.

Hannemann, P. Motlıcek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely (2011). The

Kaldi Speech Recognition Toolkit, ASRU

[MFCC] https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

22

Corpus� The training and test data from the radio broadcast news

corpus (ESTER project [Gravier et al., 2004] )

� Training: 250 hours of manually transcribed shows for

� France Inter

� Radio France International

� TVME Morocco

� Evaluation:

� 4 hours of speech

[Gravier et al., 2004] Gravier, G. & Bonastre, Jean-François & Geoffrois, E.& Galliano, Sebastian & Tait, K. & Choukri, Khalid. (2004). The ESTEREvaluation Campaign for the Rich Transcription of French Broadcast News.

23

Results (Word Error Rate)

The confidence interval +/- 0.4 %

� DNN-based system outperforms the GMM-based system

� WER difference is 5.3% absolute, and 24% relative

� Improvement is statistically significant

� DNN-based acoustic models achieves better classification and has better

generalization ability

24

Human vs machine

� Speech corpus : telephone conversations

(2017 – Microsoft)

� The results obtained with a combination of many speech recognition systems get similar to those of professional transcribers

25

Word Error Rates Switchboard Call Home

Professional transcribers 5,9% 11,3%

Automatic speech recognition

(combination of many

NN-based systems,

trained on large data sets)

5,8% 11,0%

Conclusion

� From 2012, excellent results of DNN in many domains:

� image recognition, speech recognition, language modelling,

parsing, information retrieval, speech synthesis, translation,

autonomous cars, gaming, etc.

� The DNN technology is now mature to be integrated into products.

� Nowadays, main commercial recognition systems (Microsoft

Cortana, Apple Siri, Google Now and Amazon Alexa) are based on

DNNs.

26

Conclusion

But performance still degrades in adverse conditions, such as

� High level noise

� Hands free distant microphones (reverberation problems)

� Accents (non-native speech)

Limited vocabulary (even if very large, there is still the problem of

person names, location names, etc.)

Still far from an universal recognition system, as powerful as a human

listener in all conditions

But performance continue to improve…

-

27

Deep Neural Networks and speech recognition

Advantages

� Stunning performance

� Revolution of the state of the art results

� Lot of applications

� No hypothesis on the input data

� Scalability with corpus size

� Generalization

� Good performance for unseen data

� End-to-end systems [Hadian et al, 2018]

� No need to define features

� Lot of toolkits easy to use

� With many examples

Drawbacks

� Black box

� Hyper parameters tuning

� Huge training data needed

� Supervised training

� Labelled training data needed

� Computationally intensive

� Training requires GPUs or a

cluster

28

H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end

speech recognition using lattice-free mmi,” in Proc. Interspeech 2018,

2018, pp. 12–16

Continuous speech recognition

� Efficient algorithms and tools for building and optimizing models

from data

� Language � text corpora

� Acoustic � speech corpora (with associated transcription)

� But many choices have to be done by the ASR system developer

� Type of acoustic features, size of temporal windows, etc.

� Acoustic model structure

� Number of states / of densities / of Gaussian components per density

� Or, type of neural network, number of layers, size of layers

� Some trade-off are necessary

� Few parameters � rough modeling but reliable estimation

� Many parameters � detailed modeling, but estimation may be unreliable

� Training from speech data leads to good recognition performance on

similar speech data (but performance degrade in different/new

conditions)

-

29

LSTM

w(t)

y(t-2)

y(t)

c(t-2)

w(t-1) w(t+1)

y(t+1)y(t-1)

y(t-1)

c(t-1)

y(t)

c(t)

y(t+1)

c(t+1)

Long Short Term Memory30

Long Short Term Memory (LSTM)

w(t)

y(t-1) y(t)

Memory

cell c(t)

input

gate

forget

gate

output

gate

c(t-1) c(t)

y(t)31

Complex structure including:

« forget gate » which define

how much recurrent information

(from past frame) should be

kept

« input gate » which define the

new contribution (from current

time frame)

« output gate » which define the

output contribution of this cell