ngasr 2011 暑期講習會 講者:林奇嶽

55
Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate 基基基基基基基基基基基基基基基基基 基基基基基基基基基基基 NGASR 2011 基基基基基 基基 基基基

Upload: mele

Post on 30-Jan-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate 基於隨機森林法之爆發起始偵測及其在 嗓音起始時間預估之應用. NGASR 2011 暑期講習會 講者:林奇嶽. Outline. Burst Onset Detection Burst onset Feature representation Random forest (RF) Experimental results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NGASR 2011  暑期講習會 講者:林奇嶽

Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate基於隨機森林法之爆發起始偵測及其在嗓音起始時間預估之應用

NGASR 2011 暑期講習會講者:林奇嶽

Page 2: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 2

Outline Burst Onset Detection

Burst onset Feature representation Random forest (RF) Experimental results

Voice Onset Time Estimate Voice onset time (VOT) Proposed HMM+RF system Experimental results

Conclusion

Page 3: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 3

Section I Burst Onset Detection

Page 4: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 4

Burst OnsetFundamental phonetics A stop or an affricate consonant consists of

following speech events:1. Closure: air flow is completely blocked with certain

articulators in the vocal tract. (voice bar or silence)2. Release: the blockage is suddenly released, resulting

in a puff of air rushing out of the mouth.3. Aspiration (stop) or Fricative (affricate)

The most salient event is the onset of the release, which is commonly termed burst onset.

Burst onset Burst onset

Page 5: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 5

Burst OnsetFundamental phonetics Burst onset could be the shortest event in

speech signal. A sudden increase of all-band energy exhibits a stripe

pattern in a Fourier-based spectrogram. Such an all-band energy dies out immediately.don’t carry

Page 6: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 6

Burst OnsetFundamental phonetics To detect burst onsets in continuous speech, we

focus on a small spectro-temporal patch containing a “closure-burst transition”.

don’t carry

Page 7: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 7

Feature representationTwo-dimensional Cepstral Coefficient

Two-dimensional cepstral coefficients (TDCC) are used to encode such a “closure-burst transition”.

In deriving TDCC for each spectro-temporal patch, we perform two discrete cosine transforms to compact the transition information into a small set of coefficients. 1st DCT: cepstral analysis (along frequency axis) 2nd DCT: dynamic behavior of the coefficients from the

first DCT (along time axis) Between the two DCTs is a cepstral mean subtraction

(CMS)

Page 8: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 8

Feature representationTwo-dimensional Cepstral Coefficient Similarity of dynamic feature derivation between

the conventional regression formula and TDCC.

Derivative coeff. Accelerative coeff.Relative frame distance Relative frame distance

Coe

ffic

ient

val

ue

Coe

ffic

ient

val

ue

Page 9: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 9

Feature representationDerive TDCC from a spectro-temporal patch Each frame in a patch is an LPC-derived

spectrum. Frame length: 10 ms (160 samples) Frame shift: 2 ms (32 samples) LP analysis with an order of 24. The LPC-derived

spectrum is obtained with a 512-point DFT.Extract 55coefficients

55x1vector

Coefficients are extracted in a row-major fashion

Page 10: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 10

Feature representation Waveform and Feature Plane

Closure-burst transition patterns for detecting burst onsets

Page 11: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 11

Random forestFundamental

A random forest (RF) consists of following techniques An ensemble of classifiers

RF is an ensemble of tree classifiers Bootstrapping and aggregating (bagging)

Generate multiple training sets for tree classifiers Final decision is made by a plurality vote (majority

vote) Random subspace

Introduce randomness during node splitting.

Page 12: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 12

Random forestFundamental

RF construction procedure1. Bootstrapping training set for each tree classifier2. Growing one tree and adding it to the forest. The

step is terminated when a specified number of trees is reached.

1. While searching for an optimal cut, only considering a few dimensions. Repeat this whenever a node needs a split.

2. Growing the tree to its maximal size without any posterior pruning. (highest purity)

3. During testing, each tree in the forest hypothesizes a class for the input vector. Then a final decision is made by a plurality vote.

Page 13: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 13

Random forestFundamental

Bootstrapping training data

D-dimensional vector

Randomly select d dimensions to search for an optimal split, where d~sqrt(D)

Each node achieves highest purity. There is no posterior pruning. Each tree classifier is fully grown and

then is added to the ensemble. Repeat the procedure several times to construct more tree classifiers

Page 14: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 14

Random forestBroad phonetic category of manners

Articulatory manners stop, affricate, fricative, nasal, semivowel, vowel,

non- speech “Stop” is further divided into

Voiced-stop burst Voiceless-stop burst Stop-aspiration

“burst”: voiced-stop burst, voiceless-stop burst“non-burst”: all other classes

Page 15: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 15

Random forestImbalanced training data

The problem of imbalanced training data The numbers of training vectors from different

manners are highly imbalanced. #Vowel >> #Fricative > … > #Stop (#Burst)

Conventional bootstrap causes problems. Most of training vectors are selected from the

majority classes such as “Vowel” and “Fricative”. The target class “Burst”, however, may not be

sampled sufficiently. Thus a resulting tree classifier lacks discriminative power to detect burst onsets.

Page 16: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 16

Random forestAsymmetric Bootstrap Generate balanced training data

burst fricative vowel

BootstrappedTraining DataBootstrappedTraining Data

The procedure repeats several times• Over-sampling the “burst” class• Down-sampling the other classes

Page 17: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 17

Random forestDetect burst onsets For each input vector , the forest votes for its

class

Page 18: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 18

Random forestDetect burst onsets

b1s

b2s

b3s

b4s

b5s

0 3.68 0 0 3.90 frame

Page 19: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 19

Random forestDetect burst onsets

b1s

b2s

b3s

b4s

b5s

0 3.68 0 0 3.90 frame

Page 20: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 20

Experimental Results Speech materials TIMIT corpus (English read speech)

Microphone speech, 16 kHz sampling rate, 16-bit PCM format.

630 speakers, including 438 males and 192 females

8 different dialect regions in the US (DR1~DR8) Training set :462 speakers (326M, 136F)

Testing set: 168 speakers (112M, 56F) Each speaker spoke 10 sentences,

2 SA sentences: fixed contexts 5 SX sentences: phonetically compact 3 SI sentences: phonetically diverse

Page 21: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 21

Experimental Results Speech materials TIMIT corpus

Training data are from four speakers in DR1 Training data for “burst” class are exclusively from

stops. Testing data are all utterances from TIMIT TEST set.

6991 stops

631 affricates

Page 22: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 22

Experimental Results RF-based burst onset detector Random forest settings

Training dataset: 4 speakers from TIMIT DR1 Broad phonetic category of articulatory manners

nine classes Apply asymmetric bootstrap to balance the training data

56-dim feature vector (D=56), including 55 TDCCs and 1 average log-energy of the patch.

The detector consists of 30 trees The dimension of random subspace during the node

splitting is d=8 No posterior tree pruning

Page 23: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 23

Experimental Results Detection Examples

Stops

Dental fricativeaffricate

|Put |the| butcher | block |table

Page 24: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 24

Experimental Results Precision of detection Median: 3.1 ms Interdecile Range: 12.6 ms Precision: Voiceless > Voiced

Page 25: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 25

Experimental ResultsSources of false alarm

Dental fricative //Dental fricative //

Most onsets of dental fricatives are detected as having burst onsets, and they are hard to be rejected.

Other sources are fricatives and pause segments.

Page 26: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 26

Experimental Resultsmissed detection rate

The missed detection rate increases as the confidence threshold increases. Stops (5.1% 6.5%) Affricates (13.6%

15.8%)

Page 27: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 27

Experimental Results Comparison of different RF settings D: # of feature dimension d: # of randomly selected dimensions in node

splitting.

Page 28: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 28

Experimental Results Comparison of various learning machines Accuracy: RF SVM > GMM Execution time: RF GMM >> SVM SVM kernel: RBF LIN

Page 29: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 29

Experimental Results Comparison of various amount of training data Training data are from dialect region one (DR1) SVM-RBF starts to surpass RF as more data are

included. SVM-RBF takes far more time in training and testing.

Page 30: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 30

Summary The proposed RF-based detector is able to

efficiently detect burst onsets in continuous speech. The detector only needs few training data. Experimental results demonstrate its

applicability. The proposed asymmetric bootstrap

technique can resolve the problem of imbalanced training data.

Page 31: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 31

Section II Voice Onset Time

Estimate

Page 32: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 32

Voice Onset Time Voice onset time (VOT) was proposed in

1960s. It was expected to effectively distinguish between English /b, d, g/ and /p, t, k/. Another cues are “voicing”, “articulatory

force”, and “aspiration.” VOT is defined as a time difference

between burst onset and voicing onset.

onset voicing onset burst :: vb

vb

tt

tt

Page 33: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 33

Voice Onset TimeTwo examples of VOT

borrow

tim

Page 34: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 34

Voice Onset Time VOT can be classified into several

categories Voicing Lead: VOT is negative-valued Voicing Coincide: VOT is about zero Voicing Lag: VOT is positive-valued

Distributions of VOT are different from language to language. Two-modal: English, Spanish, Mandarin, Dutch Three-modal: Korean, Thai Four-modal: Hindi

Page 35: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 35

Voice Onset TimeExisting automatic methods to estimate VOT Automatic VOT estimate methods include

Forced alignment performed by an HMM phone recognizer (HMM-FA) Pros: efficient, suitable for large corpus Cons: aligned boundaries normally do not meet the

onsets Onset detector for burst and voicing onsets

(OD) Pros: estimated onset locations are more accurate Cons: only suitable for isolated words

Combination of the two (HMM-FA+OD) Have the pros of the two previous methods at the

same time

Page 36: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 36

Proposed HMM+RF SystemFlowchart of the system

Page 37: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 37

Proposed HMM+RF SystemSystem overview The proposed system consists of two parts:

Forced alignment based on HMM Roughly locate stop consonants in continuous

speech. The aligned boundaries typically do not align

with true onset locations. Onset Detection based on random forest

For each aligned stop consonant, the detector searches its neighborhood for its burst and voicing onsets.

Page 38: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 38

Proposed HMM+RF System HMM-based phone recognizer

HMM-based phone recognizer Training dataset: the whole TIMIT training set 48 context-independent English phones HMM topology: three-state left-to-right HMM,

each state has eight Gaussian components. ML training + EM algorithm Execute five times of embedded training every time

the number of Gaussian components are doubled. 13-dim MFCC + 1-dim log-energy plus their

derivative and accelerative coefficients.

Page 39: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 39

Proposed HMM+RF System RF-based onset detector Random forest based onset detector

Training dataset: 4 speakers from TIMIT training set Broad phonetic category of articulatory manners

Burst burst onset Vocalic voicing onset

56-dim TDCC vector The detector consists of 30 trees

The dimension of random subspace during the node split is 8

No posterior tree pruning Apply asymmetric bootstrap to balance the training data

from the broad phonetic categories.

Page 40: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 40

Proposed HMM+RF System More details about the onset detector

Burst onset detection The procedure is the same as described in

Section I. Voicing onset detection

The first frame of a detected ‘vocalic’ segment following a detected burst onset is regarded as the voicing onset.

Page 41: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 41

Proposed HMM+RF System More details about the onset detector

Voicing onset adjustment procedure

(a) Aspiration or release portion: is large (b) Vocalic portion: is small in the region between (a) and (b) will be

large

Page 42: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 42

Proposed HMM+RF System More details about the onset detector

An example of voicing onset adjustment

Page 43: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 43

Experimental ResultsEvaluation dataset

Subset of TIMIT testing set 3,784 stop consonants in 968 distinct

words. 2,344 word-initial stop consonants and

1,440 word-medial stop consonants. The selected stop consonants are left-

context independent, but right-context dependent.

Page 44: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 44

Experimental ResultsEvaluation dataset

The list of eligible succeeding vowels in the experiment.

Ht. (Vowel Height): Low, Mid-Low, Mid-High, HighBk. (Vowel Backness): Front, Central, Back

Page 45: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 45

Experimental Results Performance Evaluation

Four systems to be compared HMM-FA-PL

HMM Forced Alignment at Phone Level HMM-FA-PL+OD

HMM-FA-PL with Onset Detection HMM-FA-SL

HMM Forced Alignment at State Level HMM-FA-SL+OD

HMM-FA-SL with Onset Detection

Page 46: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 46

Experimental Results Performance Evaluation

Absolute temporal deviation between an estimated VOT and its true value.

The deviations are presented in terms of cumulative relative frequency distributions Four tolerances:

5 ms, 10 ms, 15 ms, and 20 ms

Page 47: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 47

Experimental Results VOT estimates in voiced and voiceless stops

Estimating VOTs of voiced stops with HMM-FA-PL are very poor. HMM topology limitation HMM-FA-SL significantly

improves the estimates The effect of an

additional onset detection is remarkable.

Page 48: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 48

Experimental Results 3D-histograms of estimate deviations

HMM-FA-SL corrects estimate deviation of burst onset in HMM-FA-PL

With additional OD, the estimates of burst and voicing onsets are both enhanced

Page 49: NGASR 2011  暑期講習會 講者:林奇嶽

Experimental Results

Performance Comparison

Method < 5 ms < 10 ms < 15 ms < 20 ms

HMM-FA-SL+OD 57.2 83.4 93.4 96.5

HMM-FA-PL+OD 55.5 81.2 91.5 95.7

HMM-FA-SL+RS* 56.1 80.6 90.9 94.1

Stouten & Van hamme**

-- 76.1 -- 91.4

2011/07/12 49

* RS: Reassigned Spectrum** Stouten & Van hamme (2009) employed RS technique to estimate VOT.

Absolute deviation of estimation

Page 50: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 50

Experimental Results

Performance in Detail VOT estimates of voiced velar stop /g/ are

less accurately estimated than other five stops. On average, VOTs of velar stops (/g/, /k/) are

less accurately estimated. VOTs of word-medial voiced stops are

less accurately estimated than their word-initial counterparts. Caused by failed detection of burst onset. Contrarily, the estimations for voiceless stops

in word-medial and word-initial positions are statistically the same.

Page 51: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 51

Experimental Results Failed onset detection in word-medial stops

Example of failed burst onset detection No noticeable burst onset

Page 52: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 52

Experimental Results Failed onset detection in word-medial stops

Example of failed burst onset detection Surrounded by strong vocalic pulses.

Page 53: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 53

Summary HMM-based forced alignment provides

less accurate VOT estimates; however, applying an additional onset detection can significantly improve the accuracy.

The accuracy of VOT estimation varies, depending on a stop’s position in a word, and its articulation places.

Page 54: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 54

Conclusion The proposed RF-based burst onset detector

employs the spectro-temporal patterns of closure-burst transition to efficiently detect burst onsets in continuous speech.

The burst onset detection combines the voicing onset detection to significantly enhance VOT estimates which are initially made by HMM-based forced alignment.

The method could be useful for speech event annotation and speech assessment.

Page 55: NGASR 2011  暑期講習會 講者:林奇嶽

2011/07/12 55

Thank You