frequency shifts and vowel identification - page not …tnearey/posters/an_92_fshift... · ·...

Frequency shifts and vowel identification

Peter F. Assmann (School of Behavioral and Brain Sciences, Univ. of Texas at Dallas, Box 830688,

Richardson TX 75083)Terrance M. Nearey (Dept. of Linguistics, University

of Alberta, Edmonton, Alberta, Canada T6E 2G2).

Introduction• Listeners can understand frequency-shifted speech

across a wide frequency range (Fu & Shannon, 1999). • We hypothesize that this ability can be explained in

terms of listeners’ sensitivity to statistical variation across talkers in natural speech.

• The aims of the present study were: 1. To study the effects of frequency shifts on the identification

of vowels spoken by 2 men, 2 women and 2 children (age 7).2. To test the predictions of a model of vowel perception that

incorporates measures of fundamental frequency (F0) and formant frequencies (FF) associated with size differences in larynx and vocal tract across talkers

Mean log FF: Geometric mean of formant frequencies: F1,F2,F3 >3000 vowels in hVd words (Assmann & Katz, 2000)

Co-variation of formant frequencies and F0 in natural speech

Pattern recognition model

• Hillenbrand & Nearey (1999) dual-target model

• Parameters: duration, mean F0, and F1, F2, F3 sampled at 20% and 80% points

• Training data: 3000+ vowels spoken by 10 men, 10 women and 30 children from the N. Texas region (Assmann & Katz, 2000)

• A posteriori probabilities derived from linear discriminant analysis for each stimulus vowel

Frequency shifts and vowel identification

• In a previous study (Assmann, Nearey & Scott, 2002) we confirmed that upward shifts in F0 or formant frequencies (FF) resulted in lower vowel identification accuracy.

• However, combining upward shifts in F0 with upward shifts in FF led to improved identification accuracy.

• The finding that vowel identification accuracy is higher with coordinated shifts in F0 and FF is well predicted by the model of vowel identification outlined below, and supports the idea that listeners are sensitive to the pattern of co-variation of F0 and FF in natural speech.

1.00 1.25 1.50 1.75 2.000

20

40

60

80

100Means and standard errors of 11 listeners

Spectrum envelope scale factor

Iden

tific

atio

n ac

cura

cy (%

)

F0*1F0*2F0*4

Vowel Identification Accuracy

(Assmann, Nearey, and Scott, ICSLP 2002).

1.00 1.25 1.50 1.75 2.00Spectrum envelope scale factor

F0*1F0*2F0*4

Predicted means

Vowel Identification Experiment

• The present study examined effects of upward and downward frequency shifts on vowel identification.

• 11 vowels (/i/, //, /e/, //, /æ/, //, / /, //, /o/, //, /u/)in hVd context spoken by 3 men, 3 women, and 3 children from the N. Texas region.

• Upward and downward frequency shifts were introduced by means of the STRAIGHT vocoder (Kawahara, 1997).

STRAIGHT vocoder

• High-resolution analysis of time-varying spectrum envelope

• Wavelet-based instantaneous frequency F0

extraction– Spectrum envelope (FF) scaling– Fundamental frequency (F0) scaling

Scale Factors

• For females and children, downward shifts tend to produce male-like voices; for adults, upward shifts heard as child-like voices.

1.00.80.6 2.01.5FF scale factors

4.01.00.5F0 scale factors

Method• Listeners were 14 Psychology undergraduates

participating for partial course credit. Since the majority had no phonetics training, they first completed 3 practice sets:Set 1: passive listening with feedback (24 resynthesized but

not frequency-shifted vowels; no response required).Set 2: practice identification (a different set of 24 vowels

presented for identification; repeated until a score of 21/24 or better was obtained).

Set 3: passive listening with feedback (24 frequency-shifted vowels; shift factors randomly chosen from the 15 conditions of the experiment; no response required)

Method

• Main experiment: 990 syllables (11 vowels x 2 talkers per group x 3 talker groups x 3 F0 scale factors x 5 FF scale factors). All conditions randomly interspersed.

• Vowels were presented diotically over headphones in a double-walled sound booth. Listeners identified the vowels using an 11-button response box drawn on computer screen labeled with keywords for the vowel category.

0.6 0.8 1 1.5 20

20

40

60

80

100

Spectrum envelope scale factor

Iden

tific

atio

n ac

cura

cy (%

)

MenWomenChildren

Effects of FF shifts

0

50

100Men

0

50

100Women

0.6 0.8 1.0 1.5 2.0 0

50

100Children

Spectrum envelope (FF) scale factor

Iden

tific

atio

n ac

cura

cy (%

)

Interaction of FF and F0 shifts

0.6 0.8 1.0 1.5 2.0 0.6 0.8 1.0 1.5 2.0

• For men’s vowels, accuracy is higher when upward shift in FF is accompanied by upward shift in F0

• For women and children, there is a recovery from downward shifts in FF when F0 is also shifted down

F0 x 0.5F0 x 1.0F0 x 4.0

Conclusions• Identification accuracy drops significantly when vowels are

shifted upward in formant frequency by a factor of 1.5 or more, or downward by a factor of 0.6 or less. Adult males are less susceptible to upward shifts than females and children, while children are less affected by downward shifts.

• In several conditions, the drop in intelligibility was reduced by combining formant shifts with corresponding changes in F0.

• Pattern recognition models predicted the effects of frequency shifts on vowel identification, including the synergistic link between F0 and formant frequency. A plausible account is that learned relationships between F0 and spectral envelope cues are responsible for this interaction.

References1. Assmann PF, Katz WF. (2000) Time-varying spectral change in the

vowels of children and adults. J Acoust Soc Am. 108(4): 1856-1866.2. Assmann, P.F., Nearey, T.M., and Scott, J.M. (2002) Modeling the

perception of frequency-shifted vowels. Proceedings of the 7th International Conference on Spoken Language Processing, pp. 425-428.

3. Fu, Q-J. & Shannon, R.V. (1999). Recognition of spectrally degraded and frequency-shifted vowels in acoustic and electric hearing. J Acoust Soc Am. 105: 1889-1900.

4. Hillenbrand JM, Nearey TM. (1999) Identification of resynthesized /hVd/ utterances: effects of formant contour. J Acoust Soc Am. 105(6): 3509-3523.

5. Kawahara, H. (1997) Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. Proc. IEEE Int. Conf. on Acoustics, Speech & Signal Processing (ICASSP '97), vol.2, pp.1303-1306.

Average correct ID per synthetic voice

LoweredMale

BaseMale

RaisedMale

RaisedF&C

LoweredF&C

BaseF&C

Basketballs w legend

Disk and spoke plotDisks = Observed ID

• The colored disks represent listeners correct identification rate– Blue: male speakers’ synthesized voices (scaled and

unscaled) ; Red: female speakers; Green: child speakers;

– The position of the center of the disk indicates the average F0 and formant frequencies of the voice

– The area of each disk is proportional to the average % corrected identification by listeners of the voice

– The circles in the legend box indicate the correct identification rate

Disk and spoke plotSpokes = Predicted ID

• Length of the spokes indicate predicted ID rate by LDFA– Trained on natural measurements of Assmann and

Katz , predictions on scaled values of this experiment– Patterns:

• Accurate predictions: ‘Basketball’ spoke length matches disk radius

• Under predictions: ‘Asterisks’ in disks, listeners do better• Over predictions: ‘Spiked disks’, model does better than

listeners

% ID well predicted by smooth function of mean F0 and mean FFAccounts for 88% of variance

‘bad’ voice ‘good’ voice

frequency shifts and vowel identification - page not …tnearey/posters/an_92_fshift... · ·...

Documents