ngasr 2011 暑期講習會 講者:林奇嶽
DESCRIPTION
Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate 基於隨機森林法之爆發起始偵測及其在 嗓音起始時間預估之應用. NGASR 2011 暑期講習會 講者:林奇嶽. Outline. Burst Onset Detection Burst onset Feature representation Random forest (RF) Experimental results - PowerPoint PPT PresentationTRANSCRIPT
Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate基於隨機森林法之爆發起始偵測及其在嗓音起始時間預估之應用
NGASR 2011 暑期講習會講者:林奇嶽
2011/07/12 2
Outline Burst Onset Detection
Burst onset Feature representation Random forest (RF) Experimental results
Voice Onset Time Estimate Voice onset time (VOT) Proposed HMM+RF system Experimental results
Conclusion
2011/07/12 3
Section I Burst Onset Detection
2011/07/12 4
Burst OnsetFundamental phonetics A stop or an affricate consonant consists of
following speech events:1. Closure: air flow is completely blocked with certain
articulators in the vocal tract. (voice bar or silence)2. Release: the blockage is suddenly released, resulting
in a puff of air rushing out of the mouth.3. Aspiration (stop) or Fricative (affricate)
The most salient event is the onset of the release, which is commonly termed burst onset.
Burst onset Burst onset
2011/07/12 5
Burst OnsetFundamental phonetics Burst onset could be the shortest event in
speech signal. A sudden increase of all-band energy exhibits a stripe
pattern in a Fourier-based spectrogram. Such an all-band energy dies out immediately.don’t carry
2011/07/12 6
Burst OnsetFundamental phonetics To detect burst onsets in continuous speech, we
focus on a small spectro-temporal patch containing a “closure-burst transition”.
don’t carry
2011/07/12 7
Feature representationTwo-dimensional Cepstral Coefficient
Two-dimensional cepstral coefficients (TDCC) are used to encode such a “closure-burst transition”.
In deriving TDCC for each spectro-temporal patch, we perform two discrete cosine transforms to compact the transition information into a small set of coefficients. 1st DCT: cepstral analysis (along frequency axis) 2nd DCT: dynamic behavior of the coefficients from the
first DCT (along time axis) Between the two DCTs is a cepstral mean subtraction
(CMS)
2011/07/12 8
Feature representationTwo-dimensional Cepstral Coefficient Similarity of dynamic feature derivation between
the conventional regression formula and TDCC.
Derivative coeff. Accelerative coeff.Relative frame distance Relative frame distance
Coe
ffic
ient
val
ue
Coe
ffic
ient
val
ue
2011/07/12 9
Feature representationDerive TDCC from a spectro-temporal patch Each frame in a patch is an LPC-derived
spectrum. Frame length: 10 ms (160 samples) Frame shift: 2 ms (32 samples) LP analysis with an order of 24. The LPC-derived
spectrum is obtained with a 512-point DFT.Extract 55coefficients
55x1vector
Coefficients are extracted in a row-major fashion
2011/07/12 10
Feature representation Waveform and Feature Plane
Closure-burst transition patterns for detecting burst onsets
2011/07/12 11
Random forestFundamental
A random forest (RF) consists of following techniques An ensemble of classifiers
RF is an ensemble of tree classifiers Bootstrapping and aggregating (bagging)
Generate multiple training sets for tree classifiers Final decision is made by a plurality vote (majority
vote) Random subspace
Introduce randomness during node splitting.
2011/07/12 12
Random forestFundamental
RF construction procedure1. Bootstrapping training set for each tree classifier2. Growing one tree and adding it to the forest. The
step is terminated when a specified number of trees is reached.
1. While searching for an optimal cut, only considering a few dimensions. Repeat this whenever a node needs a split.
2. Growing the tree to its maximal size without any posterior pruning. (highest purity)
3. During testing, each tree in the forest hypothesizes a class for the input vector. Then a final decision is made by a plurality vote.
2011/07/12 13
Random forestFundamental
Bootstrapping training data
D-dimensional vector
Randomly select d dimensions to search for an optimal split, where d~sqrt(D)
Each node achieves highest purity. There is no posterior pruning. Each tree classifier is fully grown and
then is added to the ensemble. Repeat the procedure several times to construct more tree classifiers
2011/07/12 14
Random forestBroad phonetic category of manners
Articulatory manners stop, affricate, fricative, nasal, semivowel, vowel,
non- speech “Stop” is further divided into
Voiced-stop burst Voiceless-stop burst Stop-aspiration
“burst”: voiced-stop burst, voiceless-stop burst“non-burst”: all other classes
2011/07/12 15
Random forestImbalanced training data
The problem of imbalanced training data The numbers of training vectors from different
manners are highly imbalanced. #Vowel >> #Fricative > … > #Stop (#Burst)
Conventional bootstrap causes problems. Most of training vectors are selected from the
majority classes such as “Vowel” and “Fricative”. The target class “Burst”, however, may not be
sampled sufficiently. Thus a resulting tree classifier lacks discriminative power to detect burst onsets.
2011/07/12 16
Random forestAsymmetric Bootstrap Generate balanced training data
burst fricative vowel
BootstrappedTraining DataBootstrappedTraining Data
The procedure repeats several times• Over-sampling the “burst” class• Down-sampling the other classes
2011/07/12 17
Random forestDetect burst onsets For each input vector , the forest votes for its
class
2011/07/12 18
Random forestDetect burst onsets
b1s
b2s
b3s
b4s
b5s
0 3.68 0 0 3.90 frame
2011/07/12 19
Random forestDetect burst onsets
b1s
b2s
b3s
b4s
b5s
0 3.68 0 0 3.90 frame
2011/07/12 20
Experimental Results Speech materials TIMIT corpus (English read speech)
Microphone speech, 16 kHz sampling rate, 16-bit PCM format.
630 speakers, including 438 males and 192 females
8 different dialect regions in the US (DR1~DR8) Training set :462 speakers (326M, 136F)
Testing set: 168 speakers (112M, 56F) Each speaker spoke 10 sentences,
2 SA sentences: fixed contexts 5 SX sentences: phonetically compact 3 SI sentences: phonetically diverse
2011/07/12 21
Experimental Results Speech materials TIMIT corpus
Training data are from four speakers in DR1 Training data for “burst” class are exclusively from
stops. Testing data are all utterances from TIMIT TEST set.
6991 stops
631 affricates
2011/07/12 22
Experimental Results RF-based burst onset detector Random forest settings
Training dataset: 4 speakers from TIMIT DR1 Broad phonetic category of articulatory manners
nine classes Apply asymmetric bootstrap to balance the training data
56-dim feature vector (D=56), including 55 TDCCs and 1 average log-energy of the patch.
The detector consists of 30 trees The dimension of random subspace during the node
splitting is d=8 No posterior tree pruning
2011/07/12 23
Experimental Results Detection Examples
Stops
Dental fricativeaffricate
|Put |the| butcher | block |table
2011/07/12 24
Experimental Results Precision of detection Median: 3.1 ms Interdecile Range: 12.6 ms Precision: Voiceless > Voiced
2011/07/12 25
Experimental ResultsSources of false alarm
Dental fricative //Dental fricative //
Most onsets of dental fricatives are detected as having burst onsets, and they are hard to be rejected.
Other sources are fricatives and pause segments.
2011/07/12 26
Experimental Resultsmissed detection rate
The missed detection rate increases as the confidence threshold increases. Stops (5.1% 6.5%) Affricates (13.6%
15.8%)
2011/07/12 27
Experimental Results Comparison of different RF settings D: # of feature dimension d: # of randomly selected dimensions in node
splitting.
2011/07/12 28
Experimental Results Comparison of various learning machines Accuracy: RF SVM > GMM Execution time: RF GMM >> SVM SVM kernel: RBF LIN
2011/07/12 29
Experimental Results Comparison of various amount of training data Training data are from dialect region one (DR1) SVM-RBF starts to surpass RF as more data are
included. SVM-RBF takes far more time in training and testing.
2011/07/12 30
Summary The proposed RF-based detector is able to
efficiently detect burst onsets in continuous speech. The detector only needs few training data. Experimental results demonstrate its
applicability. The proposed asymmetric bootstrap
technique can resolve the problem of imbalanced training data.
2011/07/12 31
Section II Voice Onset Time
Estimate
2011/07/12 32
Voice Onset Time Voice onset time (VOT) was proposed in
1960s. It was expected to effectively distinguish between English /b, d, g/ and /p, t, k/. Another cues are “voicing”, “articulatory
force”, and “aspiration.” VOT is defined as a time difference
between burst onset and voicing onset.
onset voicing onset burst :: vb
vb
tt
tt
2011/07/12 33
Voice Onset TimeTwo examples of VOT
borrow
tim
2011/07/12 34
Voice Onset Time VOT can be classified into several
categories Voicing Lead: VOT is negative-valued Voicing Coincide: VOT is about zero Voicing Lag: VOT is positive-valued
Distributions of VOT are different from language to language. Two-modal: English, Spanish, Mandarin, Dutch Three-modal: Korean, Thai Four-modal: Hindi
2011/07/12 35
Voice Onset TimeExisting automatic methods to estimate VOT Automatic VOT estimate methods include
Forced alignment performed by an HMM phone recognizer (HMM-FA) Pros: efficient, suitable for large corpus Cons: aligned boundaries normally do not meet the
onsets Onset detector for burst and voicing onsets
(OD) Pros: estimated onset locations are more accurate Cons: only suitable for isolated words
Combination of the two (HMM-FA+OD) Have the pros of the two previous methods at the
same time
2011/07/12 36
Proposed HMM+RF SystemFlowchart of the system
2011/07/12 37
Proposed HMM+RF SystemSystem overview The proposed system consists of two parts:
Forced alignment based on HMM Roughly locate stop consonants in continuous
speech. The aligned boundaries typically do not align
with true onset locations. Onset Detection based on random forest
For each aligned stop consonant, the detector searches its neighborhood for its burst and voicing onsets.
2011/07/12 38
Proposed HMM+RF System HMM-based phone recognizer
HMM-based phone recognizer Training dataset: the whole TIMIT training set 48 context-independent English phones HMM topology: three-state left-to-right HMM,
each state has eight Gaussian components. ML training + EM algorithm Execute five times of embedded training every time
the number of Gaussian components are doubled. 13-dim MFCC + 1-dim log-energy plus their
derivative and accelerative coefficients.
2011/07/12 39
Proposed HMM+RF System RF-based onset detector Random forest based onset detector
Training dataset: 4 speakers from TIMIT training set Broad phonetic category of articulatory manners
Burst burst onset Vocalic voicing onset
56-dim TDCC vector The detector consists of 30 trees
The dimension of random subspace during the node split is 8
No posterior tree pruning Apply asymmetric bootstrap to balance the training data
from the broad phonetic categories.
2011/07/12 40
Proposed HMM+RF System More details about the onset detector
Burst onset detection The procedure is the same as described in
Section I. Voicing onset detection
The first frame of a detected ‘vocalic’ segment following a detected burst onset is regarded as the voicing onset.
2011/07/12 41
Proposed HMM+RF System More details about the onset detector
Voicing onset adjustment procedure
(a) Aspiration or release portion: is large (b) Vocalic portion: is small in the region between (a) and (b) will be
large
2011/07/12 42
Proposed HMM+RF System More details about the onset detector
An example of voicing onset adjustment
2011/07/12 43
Experimental ResultsEvaluation dataset
Subset of TIMIT testing set 3,784 stop consonants in 968 distinct
words. 2,344 word-initial stop consonants and
1,440 word-medial stop consonants. The selected stop consonants are left-
context independent, but right-context dependent.
2011/07/12 44
Experimental ResultsEvaluation dataset
The list of eligible succeeding vowels in the experiment.
Ht. (Vowel Height): Low, Mid-Low, Mid-High, HighBk. (Vowel Backness): Front, Central, Back
2011/07/12 45
Experimental Results Performance Evaluation
Four systems to be compared HMM-FA-PL
HMM Forced Alignment at Phone Level HMM-FA-PL+OD
HMM-FA-PL with Onset Detection HMM-FA-SL
HMM Forced Alignment at State Level HMM-FA-SL+OD
HMM-FA-SL with Onset Detection
2011/07/12 46
Experimental Results Performance Evaluation
Absolute temporal deviation between an estimated VOT and its true value.
The deviations are presented in terms of cumulative relative frequency distributions Four tolerances:
5 ms, 10 ms, 15 ms, and 20 ms
2011/07/12 47
Experimental Results VOT estimates in voiced and voiceless stops
Estimating VOTs of voiced stops with HMM-FA-PL are very poor. HMM topology limitation HMM-FA-SL significantly
improves the estimates The effect of an
additional onset detection is remarkable.
2011/07/12 48
Experimental Results 3D-histograms of estimate deviations
HMM-FA-SL corrects estimate deviation of burst onset in HMM-FA-PL
With additional OD, the estimates of burst and voicing onsets are both enhanced
Experimental Results
Performance Comparison
Method < 5 ms < 10 ms < 15 ms < 20 ms
HMM-FA-SL+OD 57.2 83.4 93.4 96.5
HMM-FA-PL+OD 55.5 81.2 91.5 95.7
HMM-FA-SL+RS* 56.1 80.6 90.9 94.1
Stouten & Van hamme**
-- 76.1 -- 91.4
2011/07/12 49
* RS: Reassigned Spectrum** Stouten & Van hamme (2009) employed RS technique to estimate VOT.
Absolute deviation of estimation
2011/07/12 50
Experimental Results
Performance in Detail VOT estimates of voiced velar stop /g/ are
less accurately estimated than other five stops. On average, VOTs of velar stops (/g/, /k/) are
less accurately estimated. VOTs of word-medial voiced stops are
less accurately estimated than their word-initial counterparts. Caused by failed detection of burst onset. Contrarily, the estimations for voiceless stops
in word-medial and word-initial positions are statistically the same.
2011/07/12 51
Experimental Results Failed onset detection in word-medial stops
Example of failed burst onset detection No noticeable burst onset
2011/07/12 52
Experimental Results Failed onset detection in word-medial stops
Example of failed burst onset detection Surrounded by strong vocalic pulses.
2011/07/12 53
Summary HMM-based forced alignment provides
less accurate VOT estimates; however, applying an additional onset detection can significantly improve the accuracy.
The accuracy of VOT estimation varies, depending on a stop’s position in a word, and its articulation places.
2011/07/12 54
Conclusion The proposed RF-based burst onset detector
employs the spectro-temporal patterns of closure-burst transition to efficiently detect burst onsets in continuous speech.
The burst onset detection combines the voicing onset detection to significantly enhance VOT estimates which are initially made by HMM-based forced alignment.
The method could be useful for speech event annotation and speech assessment.
2011/07/12 55
Thank You