ngsst 2006 冬季講習會 automatic language identification overview & some experiments on...

Post on 27-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NGSST 2006 冬季講習會

Automatic Language Identification

Overview & Some Experiments on OGI-TS Corpus

National Tsing Hua UniversityChi-Yueh Lin

2006/1/19

NGSST 2006 冬季講習會

Introduction to LID

Language Identification (LID) applications Pre-processing for machine systems Pre-processing for human listeners

Some authors preferred to use another abbreviation “ALI”, which stands for “Automatic Language Identification”.

NGSST 2006 冬季講習會

Introduction to LID Pre-processing for machine systems

Multi-lingual information retrieval system in hotel lobby or international airport.

English ASR

French ASR

Spanish ASR

Mandarin ASR

LanguageIdentification

System

Information in English

Information in French

Information in Spanish

Information in Mandarin?

?

?

?

NGSST 2006 冬季講習會

Introduction to LID Pre-processing for human-listeners

AT&T Language Line was designed for handling emergency calls

? ??

Delay in the order of minutes

NGSST 2006 冬季講習會

Introduction to LID AT&T Language Line

http://www.languageline.com The service uses trained human

interpreters to handle about 150 languages.

It takes about 3-minute delay to correctly identify “Tamil”.

NGSST 2006 冬季講習會

Introduction to LIDHuman Perceptual Experiment

From “Reviewing Automatic Language Identification”, IEEE Signal Processing Magazine, Oct. 1994.

NGSST 2006 冬季講習會

Introduction to LIDHuman Perceptual Experiment

Comments from the post-experiment interview Phoneme-spotting and word-spotting

strategies Prosodic cues Increased exposure to each language,

performance improved.

NGSST 2006 冬季講習會

Introduction to LID Paper found in IEEE Xplore

Keyword : “language identification”

ICASSP 2006 6 papers

Years Before1980

1980~1989

1990~1999

2000~present

# of papers

No 5 50+ 40+

Golden Age of LID

NGSST 2006 冬季講習會

Introduction to LID Research on LID before 1980 were pri

marily done in Texas Instruments. 1973~1980 (4 papers) Reference template

House and Neuberg (1977 JASA) HMM trained on sequences of broad pho

netic category labels Near-perfect discrimination No real speech data.

NGSST 2006 冬季講習會

Language identification cues

Phonology Phone & phoneme sets differ from one la

nguage to another. Phone & phoneme frequencies of occurre

nce may also differ. Phonotactics.

Prosody Duration, pitch, and stress.

NGSST 2006 冬季講習會

Language identification cues Morphology

Word roots Lexicon

Syntax The sentence patterns are different amon

g languages.

NGSST 2006 冬季講習會

Language identification cues

Phonology

Prosody

Morphology

Syntax

Most of recent LID systems use these two kinds of cues

These cues are seldom used

NGSST 2006 冬季講習會

Language Identification System

NGSST 2006 冬季講習會

LID systems

NGSST 2006 冬季講習會

LID systems Systems vary primarily according to t

heir method for modeling languages. Spectral-similarity approaches Prosody-based approaches Phone-recognition approaches Using multilingual speech units Word level approaches Continuous speech recognition

NGSST 2006 冬季講習會

LID systems System conditions

Content-independent Speaker-independent

NGSST 2006 冬季講習會

LID systemsSpectral-similarity approaches

The earliest automatic LID system. Use conventional spectral or cepstral fe

ature vectors.

NGSST 2006 冬季講習會

LID systemsSpectral-similarity approaches Cimarusti and Ives (1982 ICASSP)

Read speech 5 speakers, 8 languages 100-dim feature vector

15 area functions, 15 autocorrelation coefficients, 5 bandwidths, 15 cepstral coefficients, 15 filter coefficients, 5 formant frequencies, 15 log area ratios, and 15 reflection coefficients.

NGSST 2006 冬季講習會

LID systemsSpectral-similarity approaches

Foil (1986 ICASSP) Noisy radio signals (~5 dB) 3 languages Information from pitch, energy, and

formant 45-dim feature vector

23-dim from energy 22-dim from pitch VQ codebook (10 clusters) for formants

NGSST 2006 冬季講習會

LID systemsSpectral-similarity approaches

Goodman et al. (1989 ICASSP) Improved version of Foil’s work. (~9 dB) 6 languages Formant-cluster algorithm used an LPC-12 auto

correlation analysis. The parameters used were log-amplitude value

s A1, A2, A3, and formant values F1, F2, F3. Formant-based method is superior than LPCC-

based method.

NGSST 2006 冬季講習會

LID systemsSpectral-similarity approaches

Sugiyama (1991 ICASSP) 20 languages VQ based approach

Standard VQ algorithm VQ histogram algorithm (common codebook)

Autocorrelation coefficients, LPC coefficients, delta-cepstrum coefficients.

NGSST 2006 冬季講習會

LID systemsSpectral-similarity approaches Zissman (1993 ICASSP) applied GMM t

o LID task.

C: Cepstrum, D: Delta-cepstrum

NGSST 2006 冬季講習會

LID systemsProsody-based approaches

Savic (1991 ICASSP) Pitch information is useful for discriminatin

g Spanish from Mandarin Human can use prosodic features (Muth

usamy, 1994 ICASSP) Tonal-languages (Mandarin, Vietnamese) Speech rate (Spanish)

NGSST 2006 冬季講習會

LID systemsProsody-based approaches Itahashi (1994 ICSLP) argues that pitc

h estimation is more robust in noisy environment. Based on fundamental frequency, 21 feat

ures totally. Polygonal line approximation of F0 patte

rn. Use PCA to perform discriminant analysis

NGSST 2006 冬季講習會

LID systemsProsody-based approaches Thyme-Gobbel (1996 ICSLP)

Syllable-based pitch contour Syllable duration Amplitude Rhythm Phrase location Pitch is the most distinguishable feature.

NGSST 2006 冬季講習會

LID systemsProsody-based approaches Ramus (1999 JASA)

A study based on speech resynthesis. Global intonation (aaaa, sasasa) Syllabic rhythm (sasasa ,flat sasasa) Broad phonotactics (saltanaj)

NGSST 2006 冬季講習會

LID systemsProsody-based approaches Rouas (2003 ICASSP, 2005 Speech Co

mm.) Rhythmic parameter

Duration of consonant and vowel Complexity of CV segment.

Fundamental frequency parameter Skewness and kurtosis of F0 Accent location

NGSST 2006 冬季講習會

LID systemsProsody-based approaches

Rouas (2005 Eurospeech) Long-term and short-term prosody modeling. N-gram model. Long-term

Prosodic movements over several pseudo-syllables

Short-term Prosodic movements inside a pseudo-syllable.

NGSST 2006 冬季講習會

LID systemsProsody-based approaches

NGSST 2006 冬季講習會

LID systemsProsody-based approaches Lin (2005, 2006 ICASSP)

Pseudo-syllable segmentation Pitch contours were represented by a set

of Legendre polynomials Dynamic model instead of static model

NGSST 2006 冬季講習會

LID systemsProsody-based approaches However, Hazen (1993) showed that fe

atures derived from prosodic information provided little language discriminability when compared to a phonetic system. Performance of approach based on proso

dic information degrades in N-way identification task when N becomes large.

NGSST 2006 冬季講習會

LID systemsProsody-based approaches Advantage of prosody-based

system Robust to channel effect and noise. Require little transcriptions and

training data.

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

Different languages have different phone inventories and different phonotactics.

Zissman (1994 ICASSP) PRLM P-PRLM

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

Phone recognition followed by language modeling (PRLM)

N-gram probability distributions are trained from the output of the single-language phone recognizer, not from human-supplied labels.

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches Parallel PRLM (PPRLM, an

extension of PRLM)

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

PPRLM tries to incorporate phones from more than one language into a PRLM-like system. The only limitation is the number of

languages for which labeled training speech is available.

Achieve the best performance among all methods in LID task.

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

Yan (ICASSP 1995)Forward-Bigra

m

Backward-Bigram

Combination

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

Torres-Carrasquillo (2002, ICASSP & ICSLP) Variation of PRLM-like system. Use GMM tokenizer instead of phone recognizer

as front-end processing. Language models are trained on the values of

“token index”. Shifted delta cepstral feature. Do not need any transcription.

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

Feature vector Xn is representedby token index 2.

Token sequence2221321113323111123213213…

Apply language model

NGSST 2006 冬季講習會

LID systemsPhone-recognition approaches

To make phone-recognition-based LID systems easier to train, one can use a single-language phone recognizer as a front end to a system that uses phonotactic scores to perform LID.

Language ID could be performed successfully even when the front end phone recognizer(s) was not trained on speech spoken in the languages to be recognized.

NGSST 2006 冬季講習會

LID systemsUsing multilingual speech units

Focus on the problem of identifying and processing only those phones that carry the most language discriminating information. Mono-phonemes

Phonemes whose acoustic realizations in one language overlap little or not at all with those in another language.

Poly-phonemes Phonemes whose acoustic realizations are similar

enough across many languages.

NGSST 2006 冬季講習會

LID systemsUsing multilingual speech units Dalsgaard (ICSLP 1994)

Four European languages Danish, English, German, Italian

134 phoneme models

K

mI

mG

mD

mUK

p

0

Mono-phonemesPoly-phonemes

NGSST 2006 冬季講習會

LID systemsUsing multilingual speech units Berkling (1994 ICASSP)

3 languages (English, German, Japanese)

Label

Ratio Language with largerfrequency of occurence

f (1.3) GE

NGSST 2006 冬季講習會

LID systemsUsing multilingual speech units Köhler (1998)

Single multi-language (6 languages) front end phone recognizer.

24 mel-scaled cepstral, 12 delta cepstral, 12 delta delta cepstral, energy, delta energy, delta delta energy.

Feature vectors were transformed by a LDA.

Monophones -> multilingual phones

NGSST 2006 冬季講習會

LID systemsWord level approaches

These systems use more sophisticated sequence modeling than the phonotactic models of the phone-level systems, but do not employ full speech-to-text systems.

NGSST 2006 冬季講習會

LID systemsWord level approaches Kadambe and Hieronymus (1995)

Trigram phonotactics & lexicon matching 4 languages

NGSST 2006 冬季講習會

LID systemsWord level approaches Ramesh and Roe (1994)

Use of embedded word models of frequently occurring words and phrases.

Multiple-mixture left-to-right CDHMM, LPC cepstrum based features.

NGSST 2006 冬季講習會

LID systemsWord level approaches Lund and Gish (1995 Eurospeech)

Pseudo-word Language Model (PWLM) Pseudo-words are the frequently occurri

ng sub-sequences within the phoneme recognition output.

Finding pseudo-word candidates is a time-consuming task.

NGSST 2006 冬季講習會

LID systemsWord level approaches

Gao (2005 Eurospeech) Applied techniques from document retrieval.

Spoken document categorization Latent semantic indexing

NGSST 2006 冬季講習會

LID systemsContinuous speech recognition

Several large-vocabulary continuous-speech recognition systems were used in parallel for language ID. Architecture is similar to PRLM and PPRLM During testing, recognizers run in parallel,

and the one yielding output with highest likelihood is selected as the winning recognizer.

Sometime was called parallel phone recognition (PPR).

NGSST 2006 冬季講習會

LID systemsContinuous speech recognition

Biased scores problem

Recognizer-dependent bias

NGSST 2006 冬季講習會

LID systemsContinuous speech recognition Lamel (1994 ICASSP)

English & French 46 CI phone models for English 35 CI phone models for French 99% for laboratory read speech on 2s utt

erance. 76% for telephone spontaneous speech o

n 2s utterance.

NGSST 2006 冬季講習會

LID systemsContinuous speech recognition Mendoza (1996 ICASSP)

English, Japanese, Spanish Bias removal via “Score – Best Score”

strategy. Score : score from conventional

recognizer Best Score : score from raw acoustic

match

NGSST 2006 冬季講習會

LID systemsContinuous speech recognition

Schultz (1996 ICASSP) 4 language-dependent LVCSR run in parallel. German, Japanese, English, Spanish Acoustic, phonotactic rule, lexicon, gramma

r

NGSST 2006 冬季講習會

LID systemsContinuous speech recognition Need language-dependent labels

for each language. More difficult to implement than

any of other systems.

NGSST 2006 冬季講習會

LID systemsMultiple Systems Fusion

Statistic fusion strategies (very common) GMM Neural network

Parris (1995 ICASSP) Logistic function

Gutierrez (2003 ICASSP) Performance Confidence Index Dempster-Schafer Theory of Evidence

NGSST 2006 冬季講習會

Corpus for Language Identification Task

NGSST 2006 冬季講習會

Corpus for Language Identification

In early years, no corpus was collected for language identification task.

Experiments were conducted on small amount of data.

Things change since 1994….

NGSST 2006 冬季講習會

Corpus for Language Identification

Corpus available in Linguistic Data Consortium. OGI-TS – 10 languages (1994) CallFriend – 12 languages (1996) CallHome – 6 languages (1997) CSLU – 22 languages (2005)

NGSST 2006 冬季講習會

OGI-TS Corpus Oregon Graduate Institute Multi-Languag

e Telephone Speech Corpus Collected by Yeshwant Muthusamy. 10 languages

English, Farsi, French, German, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese.

90 calls for each language. 50 calls in the Training Set 20 calls in the Development Set 20 calls in the Evaluation Set

Hindi wasadded

afterward

NGSST 2006 冬季講習會

OGI-TS Corpus Corpus used for NIST LID evaluation in

1996. Initial label: 7 Broad Phonetic

Categories Vowel, Fricative, Silence or Closure, Stop,

pre-vocalic sonorant, inter-vocalic sonorant, post-vocalic sonorant.

NGSST 2006 冬季講習會

OGI-TS Corpus Three Types of utterances

fixed, useful vocabulary speech domain-specific vocabulary speech unrestricted vocabulary speech

Three durations of utterances 3 sec 10 sec 45 sec

NGSST 2006 冬季講習會

OGI-TS Corpus Contents of file

nlg - native language (3 sec) clg - common language (3 sec) dow - days of the week (10 sec) num - number 0 thru 10 (10 sec) htl - hometown likes (10 sec) htc - hometown climate (10 sec) roo - room description (10 sec) mea - description of most recent meal (10 sec) stb - free speech before the tone (45 sec) sta - free speech after the tone (10 sec)

NGSST 2006 冬季講習會

OGI-TS Corpus For more information about this corp

us, refer to Muthusamy’s Ph.D. dissertation. Y. K. Muthusamy, "A Segmental Approach

to Automatic Language Identification," Ph.D. Thesis, OGI Technical Report No. CSLU 93-002,Nov. 24, 1993.

NGSST 2006 冬季講習會

Muthusamy’s work on OGI-TS Broad phonetic category PLP spectral feature Neural network-based broad

phonetic segmentation algorithm

NGSST 2006 冬季講習會

Muthusamy’s work on OGI-TS

Pair-wise LID

From Muthusamy’s dissertation

NGSST 2006 冬季講習會

Muthusamy’s work on OGI-TS

From Muthusamy’s dissertation

NGSST 2006 冬季講習會

Some Experiments on

OGI-TS Corpus

NGSST 2006 冬季講習會

System Prosody-based System

Pitch Information Duration Information Static modeling & dynamic modeling

Phone-recognition System PRLM Front-end recognizer : English

GMM-Tokenizer

NGSST 2006 冬季講習會

Prosody System Identify languages mainly based on

prosodic cues. Rhythmic categories

Stress-timed languages (Morse-Code) Syllable-timed languages (Machine Gun) Tonal languages Mora-timed languages

NGSST 2006 冬季講習會

Prosody System Why pitch ?

From the previous research, pitch had been widely investigated and found useful in language identification task.

NGSST 2006 冬季講習會

Prosody System Pitch Contour Extraction

Method proposed by P. Boersma (1993) Autocorrelation-based Find best path through several candidat

es with help of dynamic programming

NGSST 2006 冬季講習會

Prosody SystemPitch Contour Segmentation Information from the smoothed

version of energy contour. Valley points of energy contours

are candidates for segmentation. Duration constraint

No less than 50ms

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation

For most of previous work, pitch contours are approximated by polygonal lines.

In our recent work, we use Legendre polynomials instead.

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation Legendre polynomials

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation Legendre polynomials

P0 : Pitch height P1 : Pitch slope P2 : Pitch curvature P3 : Pitch S-curvature

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation

In most cases, small value of M is sufficient.

Approximated Pitch Contour

i-th ordercoefficient

i-th orderLegendre polynomial

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation Legendre polynomial

Orthogonal property

mnnm ndxxPxP

12

21

1

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation

11

100

1

00

0

0

,

,~

,

,~

PP

PPafa

PP

Pfa

33

3

2

03

22

2

1

02

,

,~

,

,~

PP

PPaf

a

PP

PPaf

a

iii

iii

f

ff 2~

Inner Product Operator

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation

In our previous work (ICASSP 2005), the most useful features are Duration of pitch contour Coefficient of first order Legender polynomial Coefficient of second order Legender polynom

ial

NGSST 2006 冬季講習會

Prosody System Pitch Contour Representation Each pitch contour is represented

by a set of Legendre polynomial coefficients and duration.

t

t

t

t

a

a

d

v

2

1

Pitch slope

Pitch curvature

Pitch duration

NGSST 2006 冬季講習會

Prosody System Models for LID

In ICASSP 2005, feature vectors for language are modeled by a GMM

In ICASSP 2006, ergodic Markov model is used to further improve the performance.

Static model -> dynamic model

tv

NGSST 2006 冬季講習會

Prosody System Models for LID - GMM

T

t

N

nnntn

T

tt

vw

vpL

1 1

1

,log

log

t

t

t

t

a

a

d

v

2

1

T : Index of pitch contourn : Index of mixture, N=64 here.

l : Index of language

NGSST 2006 冬季講習會

Prosody System Models for LID – Ergodic Markov Model

D1

D3

D4

D5

D2D6D1: dt 50ms~100msD2: dt 100ms~150msD3: dt 150ms~200msD4: dt 200ms~250msD5: dt 250ms~300msD6: dt 300ms~

6,5,4,3,2,1 where ,ˆ

ˆ

DDDDDDDDd

dQd

t

tDt

Duration QuantizerQuantized Duration Index

NGSST 2006 冬季講習會

D3 D2D1

Prosody System Models for LID – Markov Model

21

11

1

1

55

a

a

msd

v

22

12

2

2

185

a

a

msd

v

23

13

3

3

130

a

a

msd

v

21

111 a

av

22

122 a

av

23

133 a

av

)(DQ

)(DQ)(DQ

NGSST 2006 冬季講習會

Prosody System Models for LID – Markov Model

D1

D3

D4

D5

D2D6

Each state is modeled by a

8-component GMM

Transition probabilities are estimated by ML criterion, and these

probabilitiescan be Bi-gram, Tri-

gram,or Mixture of Bi-

grams.

NGSST 2006 冬季講習會

Prosody System Models for LID –Bi-gram

21

1

ˆ,ˆ,

1

ˆ,T

1t

11ˆ

1

and of function is and ,10 here w

ˆˆlog1,log

ˆˆlog1ˆ;log

log

LL

ddpvNw

ddpDdvp

vpL

ttd

nd

nt

N

n

dn

T

ttttdt

T

tt

Bi

ttt

t

t

tt a

av

2

1

NGSST 2006 冬季講習會

Prosody System Models for LID – Tri-gram

T

tttttdt

T

tt

Tri

dddpDdvp

vpL

t1

21ˆ

1

ˆ,ˆˆlog1ˆ;log

log

t

tt a

av

2

1

NGSST 2006 冬季講習會

Prosody System Models for LID –Mixture of Bi-grams Approximate tri-gram with mixture of

bi-grams Overcome the problem of insufficient trai

ning data while training trigram model.

1 and allfor 10 where

ˆˆˆ,ˆˆ

n

121

n

ddpdddp

n

N

nnttnttt

NGSST 2006 冬季講習會

Prosody System Models for LID –Mixture of Bi-grams

21

1

1

ˆˆlog1ˆˆlog1

ˆ;log

log

tttt

T

ttdt

T

tt

Mix

ddpddp

Ddvp

vpL

t

NGSST 2006 冬季講習會

Prosody System Models for LID –Mixture of Bi-grams

td̂2ˆtd3

ˆtd

1

1ˆtd

1 1

NGSST 2006 冬季講習會

Prosody System Pair-wise LID Task

45 pair-wise language identification task.

10-sec & 45-sec utterances 10-sec : HTC, HTL, ROO, MEA 45-sec : STB

Domain specific utterances

Unrestricted domain utterances

NGSST 2006 冬季講習會

Prosody System Pair-wise LID Task (avg. 45 pairs)

GMM Dynamic/Bigram

Dynamic/Trigram

Dynamic/Mix

45-sec 68.91% 80.23%(16.43%)

79.62%(15.54%)

81.35%(18.05%)

10-sec 65.45% 69.83%(6.69%)

68.84%(5.18%)

70.02%(6.98%)

GMM

GMMDynamic

Rate

RateRate Relative

Improvement

NGSST 2006 冬季講習會

Prosody System Pair-wise LID Task on 45-sec ( L vs {others} )

45s GMM DMix Rel. GMM DMix Rel.EN- 67.03 81.84 22.09 KO- 67.67 82.71 22.22

FA- 74.48 85.05 14.18 MA- 76.54 83.41 8.97

FR- 61.00 71.51 17.23 SP- 61.26 73.31 19.67

GE- 63.77 84.65 32.75

TA- 63.05 76.75 21.73

JA- 79.10 86.08 8.82 VI- 74.82 88.21 17.90

NGSST 2006 冬季講習會

Prosody System Pair-wise LID Task on 45-sec ( L vs {others} )

10s GMM DMix Rel. GMM DMix Rel.EN- 59.31 65.99 11.26 KO- 64.47 69.57 7.91

FA- 68.05 70.98 4.30 MA- 71.69 73.09 1.94

FR- 60.00 66.91 11.52 SP- 58.32 64.23 10.13

GE- 61.32 70.02 14.19

TA- 61.53 67.71 10.05

JA- 81.60 79.48 -3.83 VI- 68.20 73.24 7.39

NGSST 2006 冬季講習會

Prosody System Pair-wise LID Task

Stress-timed languages, like English and German, benefit from this dynamic topology.

Syllable-timed languages, like French and Spanish, benefit from this topology also, but still not good enough.

Pitch-accent and tonal languages only improve a little.

NGSST 2006 冬季講習會

PRLM SystemFront-end Phone Recognizer

Design a front-end English phone recognizer.

48 phonetic units are selected from TIMIT database.

Each phonetic units are modeled by 3-state left-to-right mono-phone HMM.

NGSST 2006 冬季講習會

PRLM System48 phonetic units from TIMIT

Stops (6) b d g p t k

Affricates (2) jh ch

Fricatives (8) s sh z zh f th v dh

Nasals (6) m n ng em en eng

Semivowels & Glides (6)

l r w y hh el

Vowels (18) iy ih eh ey ae aa aw ay ah ao oy ow uh uw er ax ix axr

Non-speech (2) sil non-phonetic(pau, epi, h#)

NGSST 2006 冬季講習會

PRLM SystemTraining Phase

Use the English phone recognizer mentioned above to decode the utterances from the training set in the OGI-TS corpus with null-gram language model.

For each language, its corresponding language model is trained on those decoded phone sequences.

NGSST 2006 冬季講習會

PRLM SystemTraining Phase

EnglishPhone Recognizer

English Language Model

French Language Model

Spanish Language Model

Mandarin Language Model

English

French

Mandarin

Do NOT use any language model while

decoding

/a/, /m/, …

/aa/, /en/, …

/jh/, /ey/, …

/b/, /ae/, …

These language models will be used in the evaluation phase

OGI-TSTraining

Set

OtherCorpus

NGSST 2006 冬季講習會

PRLM SystemEvaluation Phase

EnglishPhone

Recognizer

Do NOT use any language model while

decoding

/a/, /m/, /sh/, …

English LMFrench LMSpanish LMMandarin LM…

UnknownLanguage

? ?

EN LM Score

FR LM Score

SP LM Score

MA LM Score

… LM Score

PICK MAXOr

Back-endClassifier

Phone sequence only, accompanied acoustic

scores contribute a little

It’s French !

NGSST 2006 冬季講習會

PRLM SystemPair-wise LID Task

PRLMBigram

Dynamic/Mix

Muthusamy’swork

45-sec 91.05% 81.35% 85.2%

10-sec 83.14% 70.02% 75.8%

phone pitch Broad Phonetic

NGSST 2006 冬季講習會

PRLM SystemPair-wise LID Task (avg. 45 pairs) ( L vs {others} )

45s PRLM DMix PRLM DMix

EN- 94.55% 81.84% KO- 90.36% 82.71%

FA- 94.49% 85.05% MA- 90.62% 83.41%

FR- 87.94% 71.51% SP- 85.54% 73.31%

GE- 92.64% 84.65% TA- 95.02% 76.75%

JA- 91.80% 86.08% VI- 86.39% 88.21%

NGSST 2006 冬季講習會

PRLM SystemPair-wise LID Task (avg. 10 pairs) ( L vs {others} )

10s PRLM DMix PRLM DMix

EN- 85.70% 65.99% KO- 84.50% 69.57%

FA- 84.05% 70.98% MA- 87.16% 73.09%

FR- 83.38% 66.91% SP- 79.13% 64.23%

GE- 85.32% 70.02% TA- 87.19% 67.71%

JA- 86.51% 79.48% VI- 81.57% 73.24%

NGSST 2006 冬季講習會

Prosody vs PRLM45-sec utterances

Performance on 45s Utterances

0

20

40

60

80

100

EN FA FR GE JA KO MA SP TA VI

Language

Iden

tifica

tion

Rat

e

PRLM

Prosody

NGSST 2006 冬季講習會

Prosody vs PRLM10-sec utterances

Performance on 10s Utterances

0

20

40

60

80

100

EN FA FR GE JA KO MA SP TA VI

Language

Iden

tifica

tion

Rat

e

PRLM

Prosody

NGSST 2006 冬季講習會

Prosody vs PRLMError reduction rate when the length of testing utterance increases from 10s to 45s

PRLM Prosody

PRLM Prosody

EN- 61.89% 46.60 % KO- 37.80 % 43.18 %

FA- 65.45 % 48.48 % MA- 26.95 % 38.35 %

FR- 27.43 % 13.90 % SP- 30.71 % 25.38 %

GE- 49.86 % 48.80 % TA- 61.12 % 27.99 %

JA- 39.21 % 32.16 % VI- 26.15 % 55.94 %PRLM benefits more fromlonger utterances

NGSST 2006 冬季講習會

GMM-Tokenizer SystemIntroduction

Simplified version of PRLM Use GMM-Tokenizer instead of phone rec

ognizer as front-end processing. Do not need any transcription in the trai

ning set.

NGSST 2006 冬季講習會

GMM-Tokenizer SystemIntroduction

GMM Tokenizer

NGSST 2006 冬季講習會

GMM-Tokenizer SystemIntroduction

38-dim MFCC 12 cepstra 12 delta cepstra 12 delta-delta cepstra Delta energy Delta-delta energy

30 Shifted-delta-cepstrum (SDC) (N, d, p, k) = (10, 1, 3, 3)

NGSST 2006 冬季講習會

GMM-Tokenizer SystemIntroduction

N : 單一音框計算出的倒頻譜參數維度d : 差分化音框大小p : 串接差分向量的音框距離k : 串接差分向量的個數

•Delta parameter

•shifted delta parameter

11

ct-2 ct-1 ct ct+1 ct+2

2 2 d =2

Δct-2 Δct-1 Δct Δct+1 Δct+2 Δct+3 Δct+4

4

1

2

tct

ct

c

N=10, d=2, p=3, k=3

d

d

ttt

ccc

1

2

1

2

)(

k=3

p=3

NGSST 2006 冬季講習會

GMM-Tokenizer SystemIntroduction

1 2 1 1 0( | ) ( | ) ( )t t t t tP w w P w w P w

Token sequence2221321113323111123213213…

Speech

NGSST 2006 冬季講習會

GMM-Tokenizer SystemIntroduction

b0 b1 b2 …….. bm-1 bm

c1 c2 .……. cm

……Input voice data

bi : 切割位置

m : 切割段落數目

ci : 段落中心值

New Token Index

NGSST 2006 冬季講習會

GMM-Tokenizer SystemResults on 45s utterances

GMMTokenizer

3 Lang 6 Lang 11 Lang

No Seg.Seg

90.00%91.67%

62.50%65.83%

49.55%57.73%

NGSST 2006 冬季講習會

Future Work

Try to combine these two systems in order to further improve the performance.

NGSST 2006 冬季講習會

The Future of LID task

NGSST 2006 冬季講習會

LID task in the future 1990s is the golden age of LID

research. Many methods were proposed and had been shown successful during that time. However, some problems should still be further investigated.

NGSST 2006 冬季講習會

LID task in the future

Pre-processing Post-processingLID

SystemSpeech

HypothesizedLanguage

LanguageDependentKnowledge

NGSST 2006 冬季講習會

LID task in the futurepre-processing phase

Segmentation method applied to language identification task. Does language-independent

segmentation method exist ? If exists, it can be further argued that

whether this kind of segmentation is suitable for LID task or not.

NGSST 2006 冬季講習會

LID task in the future More language-specific knowledge.

So far, the most high-performance LID system is heavily based on HUGE amount of training data.

NGSST 2006 冬季講習會

LID task in the future Prosodic based system

No suitable hierarchy for prosody exists so far.

How to model prosodic information ?

NGSST 2006 冬季講習會

LID task in the futurepost-processing phase

Apply decision fusion strategies to combine several systems. System complexity of sub-systems can

be reduced. Fusion strategies are somehow different

from those in the distributed detection problem.

NGSST 2006 冬季講習會

Thank You !

top related