ngasr 2011 暑期講習會講者：林奇嶽

Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate基於隨機森林法之爆發起始偵測及其在嗓音起始時間預估之應用

NGASR 2011 暑期講習會講者：林奇嶽

2011/07/12 2

Outline Burst Onset Detection

Burst onset Feature representation Random forest (RF) Experimental results

Voice Onset Time Estimate Voice onset time (VOT) Proposed HMM+RF system Experimental results

Conclusion

2011/07/12 3

Section I Burst Onset Detection

2011/07/12 4

Burst OnsetFundamental phonetics A stop or an affricate consonant consists of

following speech events:1. Closure: air flow is completely blocked with certain

articulators in the vocal tract. (voice bar or silence)2. Release: the blockage is suddenly released, resulting

in a puff of air rushing out of the mouth.3. Aspiration (stop) or Fricative (affricate)

The most salient event is the onset of the release, which is commonly termed burst onset.

Burst onset Burst onset

2011/07/12 5

Burst OnsetFundamental phonetics Burst onset could be the shortest event in

speech signal. A sudden increase of all-band energy exhibits a stripe

pattern in a Fourier-based spectrogram. Such an all-band energy dies out immediately.don’t carry

2011/07/12 6

Burst OnsetFundamental phonetics To detect burst onsets in continuous speech, we

focus on a small spectro-temporal patch containing a “closure-burst transition”.

don’t carry

2011/07/12 7

Feature representationTwo-dimensional Cepstral Coefficient

Two-dimensional cepstral coefficients (TDCC) are used to encode such a “closure-burst transition”.

In deriving TDCC for each spectro-temporal patch, we perform two discrete cosine transforms to compact the transition information into a small set of coefficients. 1st DCT: cepstral analysis (along frequency axis) 2nd DCT: dynamic behavior of the coefficients from the

first DCT (along time axis) Between the two DCTs is a cepstral mean subtraction

(CMS)

2011/07/12 8

Feature representationTwo-dimensional Cepstral Coefficient Similarity of dynamic feature derivation between

the conventional regression formula and TDCC.

Derivative coeff. Accelerative coeff.Relative frame distance Relative frame distance

Coe

ffic

ient

val

ue

Coe

ffic

ient

val

ue

2011/07/12 9

Feature representationDerive TDCC from a spectro-temporal patch Each frame in a patch is an LPC-derived

spectrum. Frame length: 10 ms (160 samples) Frame shift: 2 ms (32 samples) LP analysis with an order of 24. The LPC-derived

spectrum is obtained with a 512-point DFT.Extract 55coefficients

55x1vector

Coefficients are extracted in a row-major fashion

2011/07/12 10

Feature representation Waveform and Feature Plane

Closure-burst transition patterns for detecting burst onsets

2011/07/12 11

Random forestFundamental

A random forest (RF) consists of following techniques An ensemble of classifiers

RF is an ensemble of tree classifiers Bootstrapping and aggregating (bagging)

Generate multiple training sets for tree classifiers Final decision is made by a plurality vote (majority

vote) Random subspace

Introduce randomness during node splitting.

2011/07/12 12


RF construction procedure1. Bootstrapping training set for each tree classifier2. Growing one tree and adding it to the forest. The

step is terminated when a specified number of trees is reached.

1. While searching for an optimal cut, only considering a few dimensions. Repeat this whenever a node needs a split.

2. Growing the tree to its maximal size without any posterior pruning. (highest purity)

3. During testing, each tree in the forest hypothesizes a class for the input vector. Then a final decision is made by a plurality vote.

2011/07/12 13


Bootstrapping training data

D-dimensional vector

Randomly select d dimensions to search for an optimal split, where d~sqrt(D)

Each node achieves highest purity. There is no posterior pruning. Each tree classifier is fully grown and

then is added to the ensemble. Repeat the procedure several times to construct more tree classifiers

2011/07/12 14

Random forestBroad phonetic category of manners

Articulatory manners stop, affricate, fricative, nasal, semivowel, vowel,

non- speech “Stop” is further divided into

Voiced-stop burst Voiceless-stop burst Stop-aspiration

“burst”: voiced-stop burst, voiceless-stop burst“non-burst”: all other classes

2011/07/12 15

Random forestImbalanced training data

The problem of imbalanced training data The numbers of training vectors from different

manners are highly imbalanced. #Vowel >> #Fricative > … > #Stop (#Burst)

Conventional bootstrap causes problems. Most of training vectors are selected from the

majority classes such as “Vowel” and “Fricative”. The target class “Burst”, however, may not be

sampled sufficiently. Thus a resulting tree classifier lacks discriminative power to detect burst onsets.

2011/07/12 16

Random forestAsymmetric Bootstrap Generate balanced training data

burst fricative vowel

BootstrappedTraining DataBootstrappedTraining Data

The procedure repeats several times• Over-sampling the “burst” class• Down-sampling the other classes

2011/07/12 17

Random forestDetect burst onsets For each input vector , the forest votes for its

class

2011/07/12 18

Random forestDetect burst onsets

b1s

b2s

b3s

b4s

b5s

0 3.68 0 0 3.90 frame

2011/07/12 19

Random forestDetect burst onsets

b1s

b2s

b3s

b4s

b5s

0 3.68 0 0 3.90 frame

2011/07/12 20

Experimental Results Speech materials TIMIT corpus (English read speech)

Microphone speech, 16 kHz sampling rate, 16-bit PCM format.

630 speakers, including 438 males and 192 females

8 different dialect regions in the US (DR1~DR8) Training set :462 speakers (326M, 136F)

Testing set: 168 speakers (112M, 56F) Each speaker spoke 10 sentences,

2 SA sentences: fixed contexts 5 SX sentences: phonetically compact 3 SI sentences: phonetically diverse

2011/07/12 21

Experimental Results Speech materials TIMIT corpus

Training data are from four speakers in DR1 Training data for “burst” class are exclusively from

stops. Testing data are all utterances from TIMIT TEST set.

6991 stops

631 affricates

2011/07/12 22

Experimental Results RF-based burst onset detector Random forest settings

Training dataset: 4 speakers from TIMIT DR1 Broad phonetic category of articulatory manners

nine classes Apply asymmetric bootstrap to balance the training data

56-dim feature vector (D=56), including 55 TDCCs and 1 average log-energy of the patch.

The detector consists of 30 trees The dimension of random subspace during the node

splitting is d=8 No posterior tree pruning

2011/07/12 24

Experimental Results Precision of detection Median: 3.1 ms Interdecile Range: 12.6 ms Precision: Voiceless > Voiced

2011/07/12 25

Experimental ResultsSources of false alarm

Dental fricative //Dental fricative //

Most onsets of dental fricatives are detected as having burst onsets, and they are hard to be rejected.

Other sources are fricatives and pause segments.

2011/07/12 26

Experimental Resultsmissed detection rate

The missed detection rate increases as the confidence threshold increases. Stops (5.1% 6.5%) Affricates (13.6%

15.8%)

2011/07/12 27

Experimental Results Comparison of different RF settings D: # of feature dimension d: # of randomly selected dimensions in node

splitting.

2011/07/12 28

Experimental Results Comparison of various learning machines Accuracy: RF SVM > GMM Execution time: RF GMM >> SVM SVM kernel: RBF LIN

2011/07/12 29

Experimental Results Comparison of various amount of training data Training data are from dialect region one (DR1) SVM-RBF starts to surpass RF as more data are

included. SVM-RBF takes far more time in training and testing.

2011/07/12 30

Summary The proposed RF-based detector is able to

efficiently detect burst onsets in continuous speech. The detector only needs few training data. Experimental results demonstrate its

applicability. The proposed asymmetric bootstrap

technique can resolve the problem of imbalanced training data.

2011/07/12 31

Section II Voice Onset Time

Estimate

2011/07/12 32

Voice Onset Time Voice onset time (VOT) was proposed in

1960s. It was expected to effectively distinguish between English /b, d, g/ and /p, t, k/. Another cues are “voicing”, “articulatory

force”, and “aspiration.” VOT is defined as a time difference

between burst onset and voicing onset.

onset voicing onset burst :: vb

vb

tt

tt

2011/07/12 33

Voice Onset TimeTwo examples of VOT

borrow

tim

2011/07/12 34

Voice Onset Time VOT can be classified into several

categories Voicing Lead: VOT is negative-valued Voicing Coincide: VOT is about zero Voicing Lag: VOT is positive-valued

Distributions of VOT are different from language to language. Two-modal: English, Spanish, Mandarin, Dutch Three-modal: Korean, Thai Four-modal: Hindi

2011/07/12 35

Voice Onset TimeExisting automatic methods to estimate VOT Automatic VOT estimate methods include

Forced alignment performed by an HMM phone recognizer (HMM-FA) Pros: efficient, suitable for large corpus Cons: aligned boundaries normally do not meet the

onsets Onset detector for burst and voicing onsets

(OD) Pros: estimated onset locations are more accurate Cons: only suitable for isolated words

Combination of the two (HMM-FA+OD) Have the pros of the two previous methods at the

same time

2011/07/12 36

Proposed HMM+RF SystemFlowchart of the system

2011/07/12 37

Proposed HMM+RF SystemSystem overview The proposed system consists of two parts:

Forced alignment based on HMM Roughly locate stop consonants in continuous

speech. The aligned boundaries typically do not align

with true onset locations. Onset Detection based on random forest

For each aligned stop consonant, the detector searches its neighborhood for its burst and voicing onsets.

2011/07/12 38

Proposed HMM+RF System HMM-based phone recognizer

HMM-based phone recognizer Training dataset: the whole TIMIT training set 48 context-independent English phones HMM topology: three-state left-to-right HMM,

each state has eight Gaussian components. ML training + EM algorithm Execute five times of embedded training every time

the number of Gaussian components are doubled. 13-dim MFCC + 1-dim log-energy plus their

derivative and accelerative coefficients.

2011/07/12 39

Proposed HMM+RF System RF-based onset detector Random forest based onset detector

Training dataset: 4 speakers from TIMIT training set Broad phonetic category of articulatory manners

Burst burst onset Vocalic voicing onset

56-dim TDCC vector The detector consists of 30 trees

The dimension of random subspace during the node split is 8

No posterior tree pruning Apply asymmetric bootstrap to balance the training data

from the broad phonetic categories.

2011/07/12 40

Proposed HMM+RF System More details about the onset detector

Burst onset detection The procedure is the same as described in

Section I. Voicing onset detection

The first frame of a detected ‘vocalic’ segment following a detected burst onset is regarded as the voicing onset.

2011/07/12 41


Voicing onset adjustment procedure

(a) Aspiration or release portion: is large (b) Vocalic portion: is small in the region between (a) and (b) will be

large

2011/07/12 42


An example of voicing onset adjustment

2011/07/12 43

Experimental ResultsEvaluation dataset

Subset of TIMIT testing set 3,784 stop consonants in 968 distinct

words. 2,344 word-initial stop consonants and

1,440 word-medial stop consonants. The selected stop consonants are left-

context independent, but right-context dependent.

2011/07/12 44

Experimental ResultsEvaluation dataset

The list of eligible succeeding vowels in the experiment.

Ht. (Vowel Height): Low, Mid-Low, Mid-High, HighBk. (Vowel Backness): Front, Central, Back

2011/07/12 45

Experimental Results Performance Evaluation

Four systems to be compared HMM-FA-PL

HMM Forced Alignment at Phone Level HMM-FA-PL+OD

HMM-FA-PL with Onset Detection HMM-FA-SL

HMM Forced Alignment at State Level HMM-FA-SL+OD

HMM-FA-SL with Onset Detection

2011/07/12 46

Experimental Results Performance Evaluation

Absolute temporal deviation between an estimated VOT and its true value.

The deviations are presented in terms of cumulative relative frequency distributions Four tolerances:

5 ms, 10 ms, 15 ms, and 20 ms

2011/07/12 47

Experimental Results VOT estimates in voiced and voiceless stops

Estimating VOTs of voiced stops with HMM-FA-PL are very poor. HMM topology limitation HMM-FA-SL significantly

improves the estimates The effect of an

additional onset detection is remarkable.

2011/07/12 48

Experimental Results 3D-histograms of estimate deviations

HMM-FA-SL corrects estimate deviation of burst onset in HMM-FA-PL

With additional OD, the estimates of burst and voicing onsets are both enhanced

Experimental Results

Performance Comparison

Method < 5 ms < 10 ms < 15 ms < 20 ms

HMM-FA-SL+OD 57.2 83.4 93.4 96.5

HMM-FA-PL+OD 55.5 81.2 91.5 95.7

HMM-FA-SL+RS* 56.1 80.6 90.9 94.1

Stouten & Van hamme**

-- 76.1 -- 91.4

2011/07/12 49

* RS: Reassigned Spectrum** Stouten & Van hamme (2009) employed RS technique to estimate VOT.

Absolute deviation of estimation

2011/07/12 50

Experimental Results

Performance in Detail VOT estimates of voiced velar stop /g/ are

less accurately estimated than other five stops. On average, VOTs of velar stops (/g/, /k/) are

less accurately estimated. VOTs of word-medial voiced stops are

less accurately estimated than their word-initial counterparts. Caused by failed detection of burst onset. Contrarily, the estimations for voiceless stops

in word-medial and word-initial positions are statistically the same.

2011/07/12 51

Experimental Results Failed onset detection in word-medial stops

Example of failed burst onset detection No noticeable burst onset

2011/07/12 52

Experimental Results Failed onset detection in word-medial stops

Example of failed burst onset detection Surrounded by strong vocalic pulses.

2011/07/12 53

Summary HMM-based forced alignment provides

less accurate VOT estimates; however, applying an additional onset detection can significantly improve the accuracy.

The accuracy of VOT estimation varies, depending on a stop’s position in a word, and its articulation places.

2011/07/12 54

Conclusion The proposed RF-based burst onset detector

employs the spectro-temporal patterns of closure-burst transition to efficiently detect burst onsets in continuous speech.

The burst onset detection combines the voicing onset detection to significantly enhance VOT estimates which are initially made by HMM-based forced alignment.

The method could be useful for speech event annotation and speech assessment.

2011/07/12 55

Thank You

ngasr 2011 暑期講習會 講者：林奇嶽

Documents

ngasr 2011 暑期講習會講者：林奇嶽