€¦ · web viewrecord a training database that contains several repetitions (originally...

25
A basic isolated-words ASR system (HTK V3.x) A BASIC ISOLATED-WORDS AUTOMATIC SPEECH RECOGNITION SYSTEM (INTRODUCTORY TUTORIAL TO HTK V3.X / WINDOWS) MASTER COURSE: SISTEMAS DE INTERACCIÓN NATURAL (PROF. LUIS A. HERNÁNDEZ GÓMEZ) ETSI TELECOMUNICACIONES UNIVERSIDAD POLITÉCNICA DE MADRID Master Sistemas de Interacción Natural 1

Upload: dotu

Post on 15-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

A BASIC ISOLATED-WORDS AUTOMATIC SPEECH RECOGNITION SYSTEM

(INTRODUCTORY TUTORIAL TO HTK V3.X / WINDOWS)

MASTER COURSE:

SISTEMAS DE INTERACCIÓN NATURAL

(PROF. LUIS A. HERNÁNDEZ GÓMEZ)

ETSI TELECOMUNICACIONES

UNIVERSIDAD POLITÉCNICA DE MADRID

Master Sistemas de Interacción Natural 1

Page 2: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

Contenido1.- What’s HTK?............................................................................................................................3

2.- Isolated-words ASR system.....................................................................................................3

2.1 Pasos Necesarios................................................................................................................3

2.2 Organización del Espacio de Trabajo..................................................................................3

2.3 Instalación/Acceso Básico HTK...........................................................................................4

3.- Creación del Corpus de Entrenamiento..................................................................................5

4.- Extracción de Características Acústicas...................................................................................7

5.- Definición de Topología de los HMMs.....................................................................................9

6.- Proceso de Entrenamiento de los HMMs..............................................................................12

6.1 Entrenamiento - Paso 1- HINIT.........................................................................................12

6.2 Entrenamiento - Paso 2 - HREST.......................................................................................14

6.2 Entrenamiento - Paso 3 - HEREST.....................................................................................15

7.- Diseño de Tarea de Reconocimiento y Pruebas....................................................................16

7.1 Diccionario........................................................................................................................16

7.2 Gramática de Reconocimiento.........................................................................................16

7.3 Pruebas de Reconocimiento............................................................................................18

7.3.1 Pruebas Off-line.........................................................................................................18

7.3.2 Pruebas On-line.........................................................................................................19

Master Sistemas de Interacción Natural 2

Page 3: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

1.- What’s HTK?The aim of an automatic speech recognition system (ASR) is to transcribe speech into to text. The state-of-the-art technology in ASR systems is based on the use of HMM (Hidden Markov Models) or Hidden Markov Models.

HTK (Hidden Markov Model Toolkit) is a tool developed by the Department Engineering Department (CUED) at the University of Cambridge (http://htk.eng.cam.ac.uk/) to create, evaluate and modify HMMs. HTK is primarily intended for speech recognition research, but can (and is) used for a wide number of applications.

HTK is a library of functions and tools in C, running on the command line; its distribution is free, including source code. From the above web address HTK can download and access their complete documentation (htkbook).

2.- Isolated-words ASR systemIn this tutorial we will use HTK to build a simple recognizer that allows the recognition of two words spoken in isolation "Yes" and "No".

Based on this basic recognizer, you can start experimenting with more complex problems that could require a more advanced use of HTK, it is advisable to use HTK manual information: htkbook.

2.1 Necessary stepsTo create the Yes / No recognizer we must follow the following steps:

1. Record a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording (see Section 3).

2. Process each recorded file to extract the acoustic features (e.g. cepstrum). These features will be used both to train and test the speech recognition system. (Section 4). The procedure of extracting parameters from the speech signal is usually referred to as the Front-End of the ASR system.

3. Define the topology of the HMMs to be used. Note that it will take at least three models: HMM_SI; HMM_NO and HMM_SIL, the latter will represent the silence or noise before and after each word. (Section 5)

4. Train the HMM models. (Section 6)

5. Test the recognition system both off-line (–audio from wav files-) and on-line (–audio from the microphone-). (Section 7)

Master Sistemas de Interacción Natural 3

Page 4: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

2.2 Working environmentIn order to follow the isolated-word ASR example in this tutorial we will use a working environment including the following directory structure:

Prueba\ : contains all the scripts (using Windows .bat files) needed to execute HTK commands with their proper related configuration files.

Prueba\sonidos : where audio files are located (we will use .wav files with : 16KHz and 16 bit/sample).

Prueba\labels : it stores label files (with .lab extension) containing the text transcription for each audio file in Prueba/sonidos ; these label files will be used to train the HMM models.

Prueba\vcar : to store feature files (i.e. mel-frequency cepstrum files) for each audio file in Prueba/sonidos.

Prueb\recono : after the recognition of a file this directory will contain the corresponding text file with the recognition (i.e. transcription) result (.rec files)

Prueba\hmms: to store the HMMs prototypes for SI, NO and RUIDO HMM models. Prueba\hmmsiniciales: where HMMs will be generated after the first training step

using HINIT HTK tool. Prueba\hmmhrest: where HMMs will be generated after the second training step

using HREST HTK tool. Prueba\hmmherest: where a single file containing ALL the HMMs after the third and

last training step using HEREST tool.

2.3 Installation/Basic Access to HTKTo have access to HTK executable tools/programs in the directory ... \ HTKV3.XX \ bin.win32 from a Windows command window (cmd), it’s necessary to include the bin.win32 directory in the path of the system.

Master Sistemas de Interacción Natural 4

Page 5: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

3.- Generating the Training CorpusTo train the models of SI, NO and SIL (silence / noise) it is necessary to define a corpus which obviously has to contain the words "si" and "no", and in this case pronounced in isolation. Thus the corpus contains a repetition of each word in isolation.

The training database must also have a number of repetitions for each word; the more repetitions the better (and different speakers, if for speaker independent recognition ....).

For this example, we only use 5 repetitions for each word. So we have to record five times "si" and five times "no".

But apart from the audio, you need to generate a file (tags / labels) that contains a label (label) for each "linguistic / acoustic" which will involve an HMM. That is, if we record the first sentence by saying "si" in an audio file si_1.wav, the "units" that are in that audio file s1_1.wav will be sil - si - sil, and that information should be manually included (annotations) in a file called s1_1.lab.

Following this recording plus manual annotation strategy, the process of training is more "efficient" (discussion on this issue ...) if it includes not only what is within the audio file s1_1.wav, but also what ranges of time span are associated to each unit.

These actions to manually include labels and mark the start and end times can be done with the tool HSLAB (for HTK), or using any other HTK compatible audio editor, for example, WaveSurfer www.speech.kth.se/wavesurfer/, which is more powerful and easy to use.

IMPORTANT: When starting Wavesurfer ensure that:

- Properties of Sound: record at 16Khz and 16 bin (mono)

- Apply Configuration: indicating that we will use the HTK Transcription

(see figure below)

Master Sistemas de Interacción Natural 5

Page 6: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

Using this configuration we will record and annotate 5 utterances for “si” and 5 for “no”, generating in:

Prueba\sonidos : si_1.wav, .... si_5.wav , no_1.wav ,.... no_5.wav ( .wav : a 16KHz y 16 bit/sample).

Prueba\labels : si_1.lab, .... si_5.lab , no_1.lab ,.... no_5.lab

The format for transcriptions (lab) in HTK, is shown in the example below, time marks in HTK are expressed in 10-7 secs. (si_1.lab):

(start end label)0 3564899 sil3564899 6288848 si6288848 9835466 sil

Master Sistemas de Interacción Natural 6

Page 7: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

The Figure below shows an example of the Wavesurfer screen corresponding to the previous labels for the word “si” (si_1.wav audio file):

4.- Acoustic Features Extraction (Front-End)The next step will be processing all the recorded audio files si_n.wav, no_n.wav to extract the features that will use the Speech Recognizer.

HTK implements a wide range of different features that can be used in ASR systems: ceptrum, LPC, MEL, FilterBank, PLP,... The Front-End is implemented in the HCopy tool. The simplest way of using HCopy is:

HCopy -C conf si_1.wav si_1.mfc

OPTIONS: -C conf: Configuration file to include details on the feature extraction process (as the type and no. parameters)si_1.wav: input speech audio file si_1.mfc: output file (parameters)

Master Sistemas de Interacción Natural 7

Page 8: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

In this tutorial, to process ALL the files using a single command we will use:

HCopy -C conf -S flist.scp

(This is included in file SacaCep.bat)

OPTIONS:flist.scp : contains a list of pairs: audio_wav_file output_mfc_file.\sonidos\si_1.wav .\vcar\si_1.mfc.\sonidos\si_2.wav .\vcar\si_2.mfc.....\sonidos\no_5.wav .\vcar\no_5.mfc

The configuration file conf in this example has been define to make it possible to use the training HMM models in on-line tests (i.e using the microphone). The selected parameters are:

SOURCEKIND = WAVEFORM // Type on input filesSOURCEFORMAT = WAV // Extension for input filesSOURCERATE = 625 // Sampling rate T in 100 nano-sec. HTK time units (sampling rate 1/T = 16 KHz) ZMEANSOURCE = TUSESILDET = TRUE

// USES A Voice Activity Detector (necessary for on-line tests; BUT sometimes produces error for off-line use and it is necessary to use FALSE)

MEASURESIL = FALSEOUTSILWARN = TRUE

TARGETKIND = MFCC_0_D_A // Type of parameterization MelFFT + C0 (Energy) + Delta y Acceleration (Doble Delta)

TARGETFORMAT = HTKTARGETRATE = 100000.0 // frame rate 10 msec. (sliding window each 10 msec.) SAVEWITHCRC = FALSEWINDOWSIZE = 250000.0 // window size 25 msec.USEHAMMING = TRUEPREEMCOEF = 0.97 // Pre-emphasisNUMCHANS = 12 // Number of channels for the FFTNUMCEPS = 10 // Number of cesptrum per window

// This field together with MFCC_0_D_A : give us the total number of// coefficients we are going to use 10 ceps + 10 Deltaceps + 10 DDceps + 1C0 // + 1DC0 +1 DDCep = 33 coeficients per frame (window)

ENORMALISE = FALSE // No energy normalization (COULD BE USED TO COMPENSATE DIFFERENT RECORDING LEVELS)CEPLIFTER = 22 // Ponderation to give more relevance to more discriminative cepstrum LIFTERMaster Sistemas de Interacción Natural 8

Page 9: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

USEPOWER = TRUE // it could also be tested using abs (FFT) instead of power.The execution of SacaCep.bat (it is advisable to do this in the Windows command line to check for possible errors...). The next figure illustrate an error that in some cases appears when using USESILDET = TRUE ; in this you should try: USESILDET = FALSE

Once SacaCep.bat script is correctly executed you could see a set of mfc files \Prueba\vcar : si_1.mfc, si_2.mfc,... no_5.mfc

HTK includes the HList tool to display the content of files with parameters (i.e. mfc files); for example:

Hlist -h -e 1 si_1.mfc Displays in the command (cmd) screen:----------------------------------- Source: si_1.mfc ----------------------------------- Sample Bytes: 132 Sample Kind: MFCC_D_A_0 Num Comps: 33 Sample Period: 10000.0 us Num Samples: 98 File Format: HTK------------------------------------ Samples: 0->1 -------------------------------------0: -5.769 1.360 -1.041 -1.606 -0.643 -1.303 -1.380 0.017 1.731 1.495 58.589 0.279 2.649 2.098 2.927 3.047 1.078 2.394 -0.201 -1.034 -0.393 -1.811 0.360 0.164 0.189 0.027 -0.204 -0.258 -0.458 -0.676 -0.360 0.012 -0.0531: -3.598 9.894 5.159 5.613 7.360 2.961 5.144 2.312 5.040 2.044 55.530 0.969 3.575 2.640 3.051 3.669 0.158 2.005 -2.530 -1.962 0.340 -2.089 0.343 -0.380 -0.218 -0.505 -0.743 0.055 -1.005 -0.319 -0.367 -0.473 0.120----------------------------------------- END ------------------------------------------

Showing the file header, and the first 33 observation vectors for frames 0 and 1. In the example you can identify the C0 parameter (related to energy) as the highest values in position 11 of each window (58.589 for frame 0: ; and 55.530 in frame 1:)

5.- Topology for the HMMsUp to this point we have all the data and information required to start training our Hidden Markov Models HMMs. These HMMs will be the statistical patterns that will represent the Master Sistemas de Interacción Natural 9

Page 10: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

“acoustic realization” of the linguistic units we have decided to model; in our case: si, no and silence.

But before that we have to DEFINE the HMM topology we are going to use (see more details in Introduction to Digital Speech Processing, Lawrence R. Rabiner1 and Ronald W. Schafer)

In this tutorial we will use the same topology for all the linguistic units, that is for “si”, “no” and “silencio”. We will use:

12 states ( 14 in HTK: 12 + 2 start/end) and two gaussians (mixtures) per state.

To this end we will have to create a text file with a PROTOTYPE of this HMM topology using HTK syntax format:

Each file name MUST be equal to the unit: so far, for the file names and linguistic unit names we will use will be: si , no and sil

So, for example, the file named si will be:

~o <vecsize> 33 <MFCC_0_D_A> ~h "si"

<BeginHMM> <Numstates> 14

<State> 2 <NumMixes> 2<Stream> 1 <Mixture> 1 0.6

<Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Master Sistemas de Interacción Natural 10

Page 11: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

<Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<Mixture> 2 0.4<Mean> 330.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 331.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 3 <NumMixes> 2<Stream> 1

......................................

<Transp> 140.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.00.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.40.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<EndHMM>

Comments of the different fields inthis data structure:

~o <vecsize> 33 <MFCC_0_D_A> // MUST be the same parameterization we used in HCopy

~h "si" // Name for the HMM model (MUST be EQUAL TO THE FILE NAME!!!)

<BeginHMM> // Initial HMM tag <Numstates> 14 // 14 states : 12 “real” states + beginning + end

<State> 2 <NumMixes> 2<Stream> 1 // HTK allows to divide the input vector

// 33 components in several streams <Mixture> 1 0.6 // First gaussian and initial weight

<Mean> 33 // Means vector (initialized to zero)................

<Variance> 33 // Variances vector (initialized to one)

(at the end the Matrix with transition probabilities is included a ij, in this case only transition to the next state is allowed; note also the special definition for the initial and final HTK states)

<Transp> 14

Master Sistemas de Interacción Natural 11

Page 12: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<EndHMM> // Final HMM tag

Once the three HMM PROTOTYPES have been defined we can start the training procedure.

6.- HMMs Training ProcessThe training of the HMMs is made from the files of features, xxx.mfc in the directory. \ Prueba \ vcar, tag files, xxx.lab, in the \ Prueba \ labels directory, and HMMs prototypes in the .\ Prueba \ hmms directory.

The HTK training process can be raised in different ways, here we use the more standard and simple three-steps procedure that will sequentially generate better and more accurate HMMs. The HMMs in each step will be stored in different directories. These directories together with the HTK tools for each step: HInit (for step 1), HREST (step 2) and HEREST (Step 3) are illustrated in the following diagram:

6.1 Training - STEP 1- HINITThe first step provides initial values for the mean and variance vectors in the HMM prototypes.

HInit uses the files with the feature vectors (mfc files) together with the time labels (in .lab files) to do a uniform initialization of the mean and variances of each state in each model. (We will discuss this in class).

Master Sistemas de Interacción Natural 12

hmmsHInit(1)

HRest(2)

HERest(3)

hmmsiniciales

hmmrest

hmmerest

Page 13: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

There are other ways to initialize the HMMs using HCompv, where a broad initialization is done and a file is generated (varfloor) with the smallest values for the Gaussian variancs.

~v varFloor1<Variance> 332.23232e-001 1.03872e-001 (...) 4.00123e-003

However in this tutorial we will only use HInit as it is executed using the EntrenaListaHInit.bat script.

NOTE that HInit is executed THREE TIMES one for each HMM: si, no, sil

HInit -T 1 -l si -L .\labels -M .\hmmsiniciales .\hmms\si -S listatrainHInit -T 1 -l no -L .\labels -M .\hmmsiniciales .\hmms\no -S listatrainHInit -T 1 -l sil -L .\labels -M .\hmmsiniciales .\hmms\sil -S listatrain

-T 1 : tracing level (log information)

-l si : training label (HInit will look for this label in the *.lab files)

-L .\labels : label directory (where *.lab files are located)

- M .\hmmsiniciales : output directory for the new HMMs

.\hmms : input directory where HMM prototypes (Section 5) are located.

listatrain, contains the list of all the fles to be used in the training process (the *.mfc files in the .\vcar directory)

.\vcar\si_1.mfc

.\vcar\si_2.mfc

.\vcar\si_3.mfc

.\vcar\si_4.mfc

.\vcar\si_5.mfc

.\vcar\no_1.mfc

.\vcar\no_2.mfc

.\vcar\no_3.mfc

.\vcar\no_4.mfc

.\vcar\no_5.mfc

The next figure shows output information generated during HInit execution for each HMM. It can be seen that it is an iterative process to fit the training data in terms of logProb (comments in class). HInit has several parameters to define the convergence of this process, in the example, as no information is specified, default values are used (i.e. convergence stopped at 20 iterations, or logProb variation below epdsilon 0.0001)

Master Sistemas de Interacción Natural 13

Page 14: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

The result of this step will generate new si, no and sil HMM models in the .\Prueba\hmmsiniciales directory.

6.2 Training - STEP 2 - HRESTThis second training step re-train the HMM models of the previous step. HRest also keeps the time marks in the lab files, but now the state segmentation inside each label (unit) is not uniform as it was in HInit. Now the state segmentation inside each label is done using the HMM models.

The use of the HTK HRest tool is defined in the script file: EntrenaListaHRest.bat. This script executes HRest THREE times, one for each HMM model: si, no, sil

HRest -T 1 -l si -L .\labels -M .\hmmhrest .\hmmsiniciales\si -S listatrainHRest -T 1 -l no -L .\labels -M .\hmmhrest .\hmmsiniciales\no -S listatrainHRest -T 1 -l sil -L .\labels -M .\hmmhrest .\hmmsiniciales\sil -S listatrain

Master Sistemas de Interacción Natural 14

Page 15: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

The difference of this script compared to the one for HInit are the input and output directories for the HMM models. Now the inputs are the HInit-generated models, in hmmsiniciales, and the output is stored in the hmmhrest directory.

The information HRest displays on the screen is similar to the one produced by HInit.

6.2 Training - Step 3 - HERESTThe last training step is the most powerful one as it trains in a join way ALL the HMM models. This step only considers the transcriptions in the lab files BUT NOT the time marks. It could be said that in some sense in the HERest HTK tool ALL the time marks are obtained using the HMM models.

The script file to execute HERest is EntrenaListaHERest.bat. It can be seen that a single execution of HERest re-trains all the three models (Note that more tan one use of HERest could improve the trained HMM models)..

HERest -T 1 -L .\labels -M .\hmmherest -d .\hmmhrest -S listatrain listahmms

To use HERest is necessary to use a file listahmms containing the list of ALL the HMM models to be trained (in this example: si, no y sil).

After the execution of EntrenaListaHERest.bat , the directory .\Prueba\hmmherest will contain a single file called newMacros. This file contains ALL the training models:

~o<STREAMINFO> 1 33<VECSIZE> 33<NULLD><MFCC_D_A_0><DIAGC>~h "no"<BEGINHMM><NUMSTATES> 14

.....

<ENDHMM>~h "si"<BEGINHMM><NUMSTATES> 14.....

<ENDHMM>~h "sil"<BEGINHMM><NUMSTATES> 14

.....

<ENDHMM>

Master Sistemas de Interacción Natural 15

Page 16: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

This concludes the training process, so the next step will be to use the trained HMM models for speech recognition.

7.- Define the Speech Recognition Task and Test-itThe use of a set of trained HMM models in a Speech Recognition systems is generally much more complex that what we will see in this tutorial. But we will see all the necessary components in a simplified way.

The two main components to define a Speech Recognition task are: The Dictionary (Word Lexicon) and the Language Model. (again, see more details can be seen in Introduction to Digital Speech Processing, Lawrence R. Rabiner1 and Ronald W. Schafer)

7.1 Dictionary(Word Lexicon)The dictionary or word lexicón contains as information the list of words the system i sable to recognize and HOW these words are defined in terms of the HMM models.

In this example the mapping from word to HMM model is straight, so the dictionary definition is as simple as:

sil silsi sino no

BUT in more general cases where HMM of phonemes are used, each word should be described in terms of the sequence of phonemes that represent it.

Master Sistemas de Interacción Natural 16

Page 17: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

7.2 Language Models - GrammarTogether with the dictionary it is necessary to specify the sequence of words that the recognizer will be able to recognize. The ways to define these possible sequences of words are mainly two: 1) probabilistic models, or language models; 2) deterministic graphs of words, or recognition grammars.

In our example we will use a simple grammar corresponding to a graph indicating that the recognizer will optionally recognize a silence at the beginning and/or end of the pronunciation. The grammar, as it is illustrated below, will be able to recognize “si”, “no” or “nothing” (i.e just silence; we will discuss in class out-of-vocabulary OOV words)

de reconocimiento indica, principalmente, qué palabras o vocabulario va a ser capaz de reconocer el sistema, y cómo quedan definidas esas palabras en función de los HMMs que tiene el mismo.

En nuestro caso la relación palabra del vocabulario HMM es directa, por lo que el fichero dic es muy simple:

sil silsi sino no

To design grammars HTK allows to use a human-readable syntax that needs to be compiled in a more complex-to-read grapsh format (nodes and arcs).

The human-readable description of the previous grammar can be:

$comando = si | no;

( {sil} [ $comando ] {sil} )

The label “comando” represents “si” OR “no” (i.e. in parallel);

{} defines zero or more repetitions (the sil model can be skipped or repeated);

[] defines zero or one repetition (so “si” or “no” can be recognized only one time but can also be skipped)

Once the grammar is saved in a text file, in our example the text file “gram” it MUST be compiled or parsed using the HParse HTK tool, as:

HParse gram red

Master Sistemas de Interacción Natural 17

SILSI

NO

SIL

Page 18: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

This generates the file “red” containing the format needed by HVite, the HTK tool for performing recognition (as we will see in the next section).

File “red” :

VERSION=1.0N=8 L=12 I=0 W=sil I=1 W=!NULL I=2 W=no I=3 W=si I=4 W=sil I=5 W=!NULL I=6 W=!NULL I=7 W=!NULL J=0 S=1 E=0 J=1 S=5 E=0

.......

You can try different variations in the grammar to test the recognition of different utterances containing “si” and “no”.

7.3 Recognition TestsOnce we have the acoustic HMM models:

the list of models are in: listahmms ;

the HMM models are in ./hmmherest\newMacros

the dictionary (file dic), and the recognition network (file red)

we can perform some recognition tests, both over wav files(off-line) and over the microphone audio input (on-line).

7.3.1 Off-line TestsThe HTK tool for recognition is HVite. HVite implements the Viterbi algorithm for an efficient recognition process.

The script file Reconoce_auto.bat makes use of HVite to recognize two files already in *.mfc format already used for training. As si_1.mfc y no_1.mfc have been used for training this is an “auto test”, so the recognition should be without errors. The two executions for HVite in the Reconoce_auto.bat script are:

HVite -w red -l .\recono -T 1 -H .\hmmherest\newMacros dic listahmms .\vcar\si_1.mfcHVite -w red -l .\recono -T 1 -H .\hmmherest\newMacros dic listahmms .\vcar\no_1.mfc

Master Sistemas de Interacción Natural 18

Page 19: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

Note that together with the mfc files, the use of HVite includes, –l .\recono , to set the directory name where the recognition results will be stored. For each recognized file a .rec file is generated with a format similar to .lab files but including the logProbability obtained by HVite.

For example, the file .\recono\si_1.rec

0 4000000 sil -2632.9824224000000 7200000 si -1941.1038827200000 9800000 sil -1721.177368

NOTE that in order to test other recordings not used for training we must:

1. Record an audio file (.wav 16 KHz y 16 bit): for example: prueba.wav

2. Process this audio file using HCopy: prueba.mfc, and the same conf file used for training.

3. Run HVite as above but on prueba.mfc

The script file ReconocePr.bat process a wav file called prueba.wav stored in .\sonidos and generates both the .\vcar\prueba.mfc parameters file and the .\recono\prueba.rec recognized file.

To do a more “formal” evaluation of a recognizer it should be necessary to record a bigger Data Base and then to perform an analysis of the recognition errors. HTK includes an evaluation tool called HResults (see the htkbook).

7.3.2 On-line TestsAnother possibility HTK offers is to make on-line tests directly from the audio acquired through the PC microphone.

To that end also the HVite tool is used, but instead of specifying an input mfc file, you must use –C trealconf , being trealconf a text file similar to the conf file used in HCopy, as the parameterization of the input audio must be the same that we used for training.

HVite -w red -l .\recono -T 1 -H .\hmmherest\newMacros -C trealconf dic listahmms

The file trealconf is, thus, similar to conf but including an indication of real time input and the need for using a voice activity detector.

For these real-time tests it is very important to have the recording conditions and voice activity detector under control. To listen to the speech signal that is been received by the speech recognizer it is very useful to activate the “echo” feature in HVite (–g switch in HVite).

HVite -g -w red -l .\recono -T 1 -H .\hmmherest\newMacros -C trealconf dic listahmms

Master Sistemas de Interacción Natural 19

Page 20: €¦ · Web viewRecord a training database that contains several repetitions (originally pronounced in isolation) of each word. We will have to record and LABEL each recording

A basic isolated-words ASR system (HTK V3.x)

There is also a framework over HTK to build simple interactive applications combining speech recognition, text-to-speech and dialog management. This framework is the ATK Real-Time API for HTK (http://htk.eng.cam.ac.uk/develop/atk.shtml).

Master Sistemas de Interacción Natural 20