01048079

8/3/2019 01048079

1/4

Constructing Speech Processing Systems on Universal Phonetic CodesAccompanied with Reference Acoustic ModelsK a z u y o T A N A K A '. ', H i r oa k i K O J I M A ' , N a h o k o F U J I M U R A ' , a n d Yoshiaki ITOH'

' N a t i o na l I ns t i t u te o fA dvanced I ndus t r i a l S c ience and Techno logv ( A 1S u' U n iv e r s i l y o L ib r a r y and I n fo r ma ti on S c ience, ' Iw a te P r e fec tu ra l U n i ve r s i t y

{ k a z t a n a k a ; h.ko j ima}@ a is i . go . j p , y - i t oh@ iw a ie - pu . ac.jp

AbstractThis paper proposes a novel speech processingframework, where all o fhe speech da ta a re onceencoded into universal phonetic code(UPC) sequencesand speech processing s.vstems. such as speechrecognilion, retrieval, digesting, are constructed on thisUPC domain. First of al l , we infroduce an IPA-basedsub-phonetic segmenlfSPS) sei as the UPC to deal withmultil ingual speech. In the UPC(SPS) domain, eachUP C accompanies a reference acoustic model which isindependent o real acoustic models used in the encodingprocess. Processing, such as recognilion, in the UP Cdomain is conducred based on [he distance between tiP Csequences estimated by using [he reference acousric

models. We c o n j r m the proposed framework bycow t r uc l i ng a speech recognition and a vocabulaw-freespeech refrieval system on the SPS domain. We showseveral experimental resulrs on these Systems, usingJapanese and English speech data ei.$.

1. IntroductionW e are developing a framework and architecture fortiPC(universal-phoneric-code)-based speech processingsystems[l] , where all of the speech data, contained in orentered into information systems, are encoded into UPCsequences and speech processing systems, such asrecognition, retrieval, indexing, digesting, are constructedon this code domain, as illustrated in Fig.1. We describe,in the paper, a basic scheme and procedure for thisframework and present a speech recognition and a speechdata retrieval system constructed on the UPC domain asits applications.The IPA (International Phonetic Alphabet) orXSAMPA (ASCII code of the IPA) [2] will he acandidate set for the UPC set. We. however. propose amore f ine segment. called sub-phonetic segmeni (SPS),which is derived from XSAMPA on the basis of th e

acoustic-articulatory considerations. We have alreadyconfirmed the advantage o f SPS-like units in recognitionexperiments[3,4]. The SP S can deal with multilingualspeech in the sam e level as he XSAM PA (or IPA) can.In Fig. I , the encoding process incorporates adaptationof the UPC(SPS)models to the speech data environmcnts,so that the coded UPC sequences approximatelyrepresent phonemic information of speech data. Inthe UPC domain, each U PC accompanies a reference(/standard) acoustic model which is predefined and

1051-465U02 17.00Q 2002 IEEE 728

Speech Speech

Encoding into UPC(SPS)

Feature Extraction* *I ISpeech Wave 1

Fig. I Proposed ramew ork for speech processing.independent of such acoustic models that are really usedin the encoding process. Processing, such as recognition.in the UPC domain is conducted based o n the distancebetween UPC sequences estimated using the referenceacoustic models. Erroneous UPC sequences occurred inthe encoding process is compensated by this operation.Here note that the UPC domain processing is separatedfrom the acoustic domain environments because of notusing the acoustic models in encoding process. Thispoint is essentially different from the conventionalspeech processing methods. where they employ anintegration of maximal probabilities estimated byacoustic models that depend on the input speechenvironments.Based on the proposed framework, w e can constructspeech recognition systems which effectively handlemultiple hypotheses, such as multilingual ones, bycalculating only distances between UP C (S PS ) sequences.Th e architecture will be given in section 4.We also present a vocabulary free speech retrievalsystem in section 5 . The system can retrieve a key speechphrase from objective speech DB. When both key wordsand objective DB are real speech, it is difficult tofunction well because th e acoustic characteristics of thosetwo are, in usual cases, considerably different. Inaddition, key words are usually limited in vocabulary sizebecause of speech recognition performance. Theproposed system architecture resolves these diiiicultiesby introducing phrase spotting in UPC domaincomputation.
mailto:kojima%[email protected]:kojima%[email protected]:kojima%[email protected]

8/3/2019 01048079

2/4

2. Sub-phonetic Segment (SPS)The SPS labels are obtained from the XSAMPAsegment sequences using conversion rules. The rules arecreated by considering the acoustic-articulatorycharacteristics. We basically adopt only primaryXSAMPA symbols and represent minor phoneticvariations by statistical distributions in acoustic domain.The SPS sequences converted from XSAMPA sequencesconsist of stationary and non-stationary (transitional)segments in the speech stream. as indicated in Fig. 2.

The SPS set is extended from those originallyproposed for recognizing Japanese utterances[3,4].Similar segment category was also used in an Italianspeech recognition system[5]. Acous tic models of theXSAMPA segments and SPS are represented simply byLR-HMM with three states and three loops. We canestimate those HMMs using speech samples by anordinary HMM training method[6]. Acoustic parametersare the same as those used in paper[4].

3. Encoding Speech Signals into SPSSequencesAs described in section 1. speech signals are onceencoded into SPS sequences by SPS unit recognitionusing SPS-HMMs. where SPS-HM Ms are adapted to thecorresponding speech signal data environments. such asrecording environment. speakers' mother tongue.maleifemalelchild. etc., but not necessary to adapt toindividual speakers. W e also use language-dependentSPS label pair grammars in optimizing the SPSsequence[4]. An example of SP S sequence obtainedfrom real speech samples is shown in Fig. 2(iv) . We can

see that the SPS sequence obtained by the encodingshows a sequence similar to that ofo rigina l labeling.

f i ) Orieinal sentence from TIMIT-DBShe hadyour dark suit in greasy wash water allye ar.( i i ) XSAMPA descriDtion from TIMIT-DB labeline byexnert. lsliehtlv modified bv our rules)h# S i h E dc ldZ @ r d c l d A k c l k s U qN g r l g r i : w- 00 S P A U w@ q 001 j I @ r hiihi i # S SS Si ii ih hh hE E EdZ dcl ddZ dZ@ @@ @rr r rd d r l d d dA AA A k kc l kk !U ss su II U uq qq qN NNN g g c l g g g r r r r i i i i i zz zi ii iw ww WO000 OS SSS# PAU #w ww WO000 Od dcl dd d@ @@ @q qqq 0 000 01 I1 l j j yjl / I I@ @@ @r rr r# h#

( iv) SPS sequence obtained bv SPS-HMM unit

00 d c l d(i i i ) SPS setluence converted from the above bv rules

recoenition

I h# #S SS Si ii i h hh hE EE EdZ dcl ddZ dZi i i id dcl dadA AA Ak kcl kk ks ss s i i i i l / I IN NN Ng g c l g g gr r rr i i i iz zz zsssSI / w wiv W O000 OS SS Sw n'w W O000 Od d r l d d dA AA A@ @@ @ q q q q 0 000 01L I1 ljj, j l / r r r r# h#

bin.2 An e x a m d e of the encodinn resulr fo r a re ,. . -unerance ofa senrence.The distance matrix is calculated using the referenceSPS-HMMs which can be considered as the standardacoustic models accompanied with SP S labels. Therefore,the optimal matching process is independent ofthe SPS-HMM s used in the encoding stage.For calculating the distance matrix and the optimalmatching by D ynamic Programm ing (DP). it is necessaryto define a distance between HMMs[4]. The distance can

be calculated by Kullback-Leibler divergence, but wesimply approximate as follows:Let us denote the centroid ofeac h state bv4. Speech Recognition S ystem c,(k): centroid vector of the jth distribution in state i

v,(k): diag onal varianc e vector cor resp ond ing to c&of category k4.1 System A rchitectureA block diagram of the automatic speech recognitionconstructed on the proposed framework is shown in Fig. then we defin e the distance between HMM- k an d -I as3. One feature of this system is to use the distance matrixof each SPS-pairs for the optimal matching between ,=, 1 . 1D ( k , O = ~ m i n , [ l l ~ ~ ( ~ ) - ~ , ~ . ( ~ ) l l ~ lthe input SPS sequence and SPS networks[ll of where the norms between centroid vectors are

hypothesized words. normalized by vq(ki.

Word Lexicon(to be recognized)

Input Bottom-Up SP S Distance MatrixSpeech -+WaveFig. 3 Block diagram o the speech recog nition sysfem based on theproposedframework.

~

729

8/3/2019 01048079

3/4

4.2 Recognition ResultsTo compa re performance of the proposed method withthat of the ordinary recognition m ethod. we have used thesame speech sample set for both recognition methods.The test sample set consisted of 492 words uttered byfour different male speakers. The ordinaly recognitionsystem was implemented by a phoneme-HMM-basedrecognition[7]. Other conditions. such as acousticfeatures. HMM training. were set up into the sameconditions. Num ber of mixtures of Gaussian probabilitydistributions in each state was two in both HMM s.The result ofreco gnit ion rate was 91% when using theproposed method. On the other hand. the base linerecognit ion score by' the conventional phoneme-HMM-based method was 89% for the same sample set.Therefore the recognition rate was almost comparableorder in both methods.5. Speech Retrieval System

5.1 System ArchitectureSpeech retrieval systems usually function either asretrieving key words. given by spoken words. from text-based DB. or retrieving key words, given by text. fromspeech-based DB[8]. On the other hand, the proposedspeech retrieval system can retrieve key phrases, given byspeech. from the object speech DB. In this system. if th eobject speech DB has some sections similar to thoseincluded in user's key phrase, they can be extracted.using only the accumulated distance between arbitrarydurations of SPS sequences. Therefore, it characterizes avocabulary- and gramma r-free system. This function ismade possible by applying Shifr Continuous DynamicProgramming (ShiJI-CDP) [9] to the optimal matchingbetween SPS sequences. The system also workseffectively even when the quality or environmentalcondition of tho se two speech data is considerably

different each other. because the environment-adaptiveSPS encoding plays the role of normalizing suchacoustic variations.Fig. 4 illustrates the system configuration. Th eprocedure is given in the following, where (Aj, (B) , . . , H)indicate the block labels in the figure.( I ) Estimate the reference SPS-HMMs from a basicspeech sam ple set to prepare the distance matrix (H).(2 ) Adapt the base-form SPS-H MM s to the environmentof speech DB (A) to create SPS-HMM (E). Thenencode speech DB (A ) using those SPS-HMMs toobtain SPS sequence DB (6).(3)As the same way, adapt the base-form SPS-H MM s tothe environment of the key phrase speech (C) tocreate SPS-HMMs (F), then encode key phrase usingthose SPS-HM M to obtain SPS sequence (0).(4) Successively match SPS sequence (D) of the keyphrase with the SPS sequence DB (B) using the Shift-CDP and detec! adequate part of sentences from theobject speech DB.5.2 Prelimin ary ExperimentsWe conducted the following experiments to confirmthe feasibility of the proposed method. The basicconfiguration o f the experiments was to extract suchsentences as those including key phrases in their parts.from a sentence speech DE. The key phrases were set toone or two pans of the sentences. as described in eachexperimental condition. Therefore, if we can extract thesentence that include the specified key phrase. then it is"hit"; if not. then "missing": and "false alarm(FA)" is tomis-extract such sen tence that includes no key phrase. Inthe following experiments. the distance matrix IH) w ascalculated using base-form SPS-HMM set. The resultsdepends on the boundary condition. so that the followingshows results in adequate conditions.IExoeriment-I1 The object speech D B was MOCH A---Ej SPS sequences (G) Ke y phrase spot-tin8 byShift CDPin the SP S domain results

(Hj Distance matrixfor SPS pairs(0 ) SP S code se -quence of user's ke yC) Ke y phrase,uttered by user Encoding

Fig. 4 Block diagram of the proposed speech retrieval sysrem which rerrieve user's key phrasespeech from a speech DB .

730

8/3/2019 01048079

4/4

(SPS sequence of 1.5 sec or 2.0 sec)ey phrase(used in I & II )Sentence(SPS Sequence)Key phrase(used in 111) ce of 30+30 labels, abou t 2.Osec)Fig. 5 Illustration for composing key phrases used in rhe experiments. 1. 11, Il l indicate theexperiment N0.s.

TlMlT English sentence set (460 sentences), uttered byfive male speakers. The key phrases were extracted frompart of th e same sentences uttered by different five malespeakers (see Fig. 5). Several too short utterancesamples were removed from the test data. Th e results areshown in Tabl e 1.Table I Result offiperiment-I.

key phrase hitting missing FA1.5 ecleneth 790519100 119s/91 nn 1368(86.9%) (13.1%)2.0 sec length 7909/8384 47418384 1650(94.3%) (5.7%)LExperiment-Ill The condition was same as those inExperiment-I, except that the key phrases uttered by fivefemale speakers. The results areshown in Table 2.The results of experiments I an d II indicate that theperformance ofresult I is almost the sam e level as hat of

II , therefore the acoustic difference between males an dfemales are normalized by the adaptation scheme shownin Fig.4.Table 2 Result OfExperimenl-11

ke y phrase hitting missing FA1.5 eclength inn58 /1135 0'~ 1292111350 21692 . 0 . ~ ~ength 1onn211n5nn 49811nsno 3140(88.6%) ( I I .4%)(94.3%) (5.7%)

[Exoeriment-Ill1 This is a test for effectiveness of th eShift-CDP 19). The speech sentence DB was ATR-A setwhich includes 50 Japanese sentences. uttered by tenmale speakers. The key phrases were created byTable 3 Result of experiment-Ill, where complere hitgives I000 samples

Thresh old (short length') ( Long length**)hining FA hining F A1.0 915 78 602 n1 .1 943 183 693 n1.2 964 440 768 21.3 983 1022 845 17

Sentence detection is done if the number o f SP S labelscontained in the matched scctian (s) ismore than 25.** The number is mare than 40.

connecting tw o sections of each sentence utterance to beone phrase, as shown in Fig. 5 . 'The key phraseutterances were female Voices and each section lengthcorresponded to 30 SP S labels (about 1.0 sec), andmatched sections were detected if the accumulateddistance of the corresponding section length were underthe threshold value. In this case. the number of testqueries is 100, the complete hitting gives 1000 samples.and the masimum number o f the possible false alarm is49000. The result is shown in Ta bl e 3.

6. Concluding RemarksWe have proposed a new speech processingframework. in which speech application systems areconstructed on the universal phonetic symbol domain.W e show that recognition systems can he constructed inalmost the same performance as conventional systems.and in addition, vocabulary-free speech retrieval systems.the function of which will be difficult to construct by theconventional methods, can be constructed.

References[ I ] K. Tanaka, H. Kojima. "Speech recognition method with alanguage-independent intermediate phonetic code,'. Proc. uf[2] hnp:llwww.phon.ucl.ac.ukihomsisampaihome[3] K. Tanaka, S . Hayamiru, K. Ohta, "A demiphonemenetwork representation of speech and automatic labelingtechniques", Proc. oflCASSP86, pp.309-312. 1986. , .[4 ] ti. Tanaka, H. Kojima, "A between-word distancecalculation in a symbolic domain and its applications to speechrecognition," Information Science s, Vol. 123, No.1-2, pp.25-41,Elsevier Science,2000.[5] D.Albesano, R. Gemello, F. Mana, .'Hybrid HM M - NNmodeling of statianary-transitianal units fo r continuous speechrecognition," Pmc. ICONIP-97. Vol.2, pp.1 112-1115, 1997.[6] S . Young, The H TK Book, Entropic Cambridge ResearchLab,1996.[7] K . Tanaka, H . Kojima, "A method of extracting time-

varying acoustic features for speech recognition", Prac. ofICASSP97, pp.1391-1394,1997.[8] J.T. Foate, et al, "Unc onstraine d keyword sporting usingphone lattice with application to spoken document retrieval,"Computer Speech and Language I I. pp.207-224, 1997.[91 Y. Itoh. K.Tanaka, "Automatic Labeling and Digesting forLecture Speech Utilizing Repeated Speechby Shifl CDP," Proc.EUROSPEECH200I. pp.1805-1808,2001.

I C S L P ~ O O O ,01.4,pp.191-194,2000.

731

01048079

Documents