英文連續語音辨識之初步研究 an initial study on english continuous speech recognition

51
英英英英英英英英英英英英英 An Initial Study on English Continuous Speech Recognition 指指指指 指指指指指 指指指 指指指 指指指 指指指指 指指指 指指

Upload: clayton-morin

Post on 01-Jan-2016

69 views

Category:

Documents


4 download

DESCRIPTION

英文連續語音辨識之初步研究 An Initial Study on English Continuous Speech Recognition. 指導教授:陳柏琳博士 研究生:許庭瑋     陳冠宇 中華民國 九十六 年 七 月 十三 日. 大綱. 簡介 基本語音辨識流程 當前英文語音辨識研究的發展 本論文使用之英文音素定義與辨識用詞典 詞內三連音素狀態分享之聲學模型建立 台師大大詞彙連續語音辨識器 研究內容與實驗 前端語音特徵擷取探討 語言模型調適 聲學模型訓練 實驗語料介紹、設定、結果 結論與未來展望. - PowerPoint PPT Presentation

TRANSCRIPT

  • An Initial Study on English Continuous Speech Recognition

  • :: Bayes Theoryp(O)

  • 1BBN2IBM(T.J. Watson)345Dragon Systems 6LIMSI-CNRS 7SRI 8AT&T 9MsState ISIP 10(Microsoft)

  • () 20023 (International Computer Science Institution , ICSI)(DARPA)EARS (Effective Affordable Reusable Speech-to-text Program)(Rich Transcription)RT03RT04 (Linguistic Data Consortium, LDC)SwitchboardSwitchboard CellularCallhomeEARSLDC(Fisher Collection)

  • ()

    BBNIBMCU2004 BBN/LIMSIIBM 20042004 CU-HTK20RT10RT10RTRT 04RT04RT0313.5%15.2%17%2,300()2,100()2,180()VTLN()PLP + CMSHLDA+MLLTVTLNPLP + CMVN +LDAfMPE + LDA+MLLTVTLNHLDA+ CMVN1. ML-SI (+HLDA) I. STM II. SCTM III. Cross-word SCTM2. ML-HLDA-SAT (+MLLT)1.SI.DC.PLP2.SA.FC.fMPE3.SA.DC.fMPE+MPEMPE + TriphoneQuinphone

  • ()

    BBNIBMCU2004 BBN/LIMSIIBM 20042004 CU-HTKWitten-Bell +Interpolated LMKneser-Ney +Interpolated LMKneser-Ney + Good-Turing +Interpolated LM1. ML-SI : I.Triphone + Bigram II.Within-word Quinphone + Trigram III.Cross-word Quinphone + Fourgram2. ML-HLDA-SAT3. Regression Classes1. SI.DC.PLP: Quinphone + Fourgram2. SA.FC.fMPE: Quinphone + Fourgram3. SA.DC.fMPE+MPE: Septaphone + Fourgram1. Triphone + Fourgram2. Quinphone + Fourgram3. Lattice MLLR

  • 40 6(silence) sil (pause)sp

  • ():Festlex CMU105,626 begin b ih g ih ncoffee k aa f iy hello hh ax l owyes y eh s ("begin" nil (((b ih g) 0) ((ih n) 1)))("coffee" nil (((k aa f) 1) ((iy) 0)))("hello" nil (((hh ax l) 0) ((ow) 1)))("yes" nil (((y eh s) 1))) Festlex CMU Festlex CMU

  • ax: (mean) (Covarience Matrix) ()2 1 (39)

  • (Context dependence)

  • ()

  • 1. (40)

  • 2.40*40*40 =64000() (Data Sparseness)

  • 3. (State)(Tying)(Tree-based Clustering) 1 :(Root)

  • 3. () 2 : (Decision Tree) :

  • 3. ()

  • 4.

  • Viterbi

    (Tree-Copy Search)Bigram(Word Graph Rescoring)Trigram

  • : (Channel Effects) (CMS)

    (CMVN) :

    (LDA)(HLDA) (MLLT)

  • (LDA) HMM (B) (W)

    (HLDA)

    (MLLT)

    ,,

  • () (MFCC) (MFCC+CMS) (MFCC+CMVN) (LDA+MLLT+CMVN) (HLDA+MLLT+CMVN)

  • :(Count Merging)(Model Interpolation)

  • ():(Count Merging) : Data level CA CB (Model Interpolation) : Model level

  • HMMHMM1128

    HMM

  • (Confusion Matrix)(Normalized) () ()(Likelihood) (Substitution) : w iy w eh : w w aw ae

  • (EAT) (16 KHz) (VOA)(16 KHz) (BNC)(102M) 90%10%

  • ()

    EAT1grandpa2for instance3 six five seven seven four five seven 4 Green Mountain Energy

    VOA1their workshops were long ago damaged2an internet message taking responsibility for their deaths3it is one of those things that i dreaded the entire time

  • EAT

    VOA

    (hr)5,3403.3330,6375000.564,373()5,178

    (hr)20,0007.02 53,9221,0000.652,781()2,370

  • VOAFeature : MFCC_CMS Language Model :BNC+VOA(1:1)

    (%)TCWG1*176,07346.7254.102*2145,31846.5153.013*3217,74445.6252.944*4290,50544.9150.86

  • ()EATFeature : MFCC_CMS Language Model: EAT 40.55%49.53% 4

    (%)TCWG1125,37530.1240.552*1143,73536.4149.533*4549,95336.4549.35

  • VOA(Count Merging)Feature: MFCC_CMS Mixtures: 76,073 () BNCVOABNCBNC

    (%)TCWG110BNC45.9051.43201VOA47.7049.46311BNC+VOA46.7254.104150BNC+VOA*5046.2853.7851100BNC+VOA*10046.3153.65

  • ()EAT(Count Merging)Feature: HLDA+MLLT+CMVN Mixtures: 26,548 EATBNCEATBNC

    (%)TCWG110BNC32.2128.83201EAT45.2252.01311BNC+EAT32.3533.5741100BNC+EAT*10036.9239.86

  • ()VOA(Model Interpolation)

    (%)(%)(%)(%)0.0051.430.5552.090.0552.850.6051.860.1052.550.6551.700.1552.940.7051.430.2052.800.7551.150.2552.570.8050.970.3052.280.8550.810.3552.140.9050.290.4052.050.9549.850.4552.161.0048.480.5052.23--

  • VOALanguage Model :BNC+VOA(1:1)*1

    (%)TCWG1MFCC78,41245.2552.052MFCC_CMS76,07346.7254.103MFCC_CMVN73,08345.8351.644LDA+MLLT_ CMVN70,67251.5459.895HLDA+MLLT_ CMVN71,62749.2354.42

  • ()EATLanguage Model: EAT*1 MFCCMFCC_CMSMFCC_CMVN EAT(Channel Effects)

    (%)TCWG1MFCC145,31929.6940.042MFCC_CMS143,73536.4149.533MFCC_CMVN138,71333.9347.024LDA+MLLT_CMVN138,28947.3059.535HLDA+MLLT_CMVN141,33346.4859.71

  • ex.0~1 viterbi

  • (Supervised Training) (Lightly Supervised Training) (Unsupervised Training)How are youHow are you

  • (True Transcription)

  • ()

    51.7358.2057.84

  • EAT

    HLDA+MLLT+CMVN(hr)20,0007.02 53,92242,96033.4 108,3231,0000.65 2,781()4,229

  • EAT

    (%)---TCWG1HMM(1)141,33350.1457.842HMM(3)221,82049.7851.733HMM(4)191,31450.8658.20

  • ()EAT

    (%)---TCWG1HMM(1) 141,333 50.14 57.842HMM(2) 216,31856.29 64.74

  • ()EAT0.2

    zs0.38ayax0.25shs0.38ayt0.25jhr0.33kt0.23jht0.33uhax0.23zhax0.33mn0.23zhl0.33aoow0.23zhsh0.33chn0.22awl0.30ths0.22ngn0.29bf0.21dt0.27lr0.20awaa0.25iyih0.20

  • () () M EAT(General)EAT = *

    MNAMN10120.510150.512160.41021400.4:::

  • ()EAT

    (%) ()(%) ()TCWGTCWG50.6158.0550.6158.0510.80045.8752.7346.8755.2820.97049.6056.7949.8657.8730.970.151.0858.2050.9358.2340.970.350.8657.8751.1558.52

  • VOAEAT

    VOAEAT1LDA+MLLT+CMVNHLDA+MLLT+CMVN23.33(5340)40.42(62906)30.56(500)0.65(1000)45,1784,229570,672()216,310()64,3738,8507BNC+VOA(1:1)EAT859.89 %65.71 %

  • (Minimum Phone Error, MPE)EAT