fundamentals of speech signal processing. 1.0 1.0 speech signals
Post on 30-Dec-2015
242 Views
Preview:
TRANSCRIPT
Fundamentals of Speech Signal Processing
1.0 Speech Signals
Waveform plots of typical vowel sounds - Voiced(濁音)
tone 1
tone 2
tone 4
t
Speech Production and Source Model
tu
• Human vocal mechanism • Speech Source Model
Vocaltract
tx
u(t)
x(t)
Voiced and Unvoiced Speech
u(t) x(t)
pitch pitch
voiced
unvoiced
Waveform plots of typical consonant sounds
Unvoiced (清音) Voiced (濁音)
Waveform plot of a sentence
Frequency domain spectra of speech signals
Voiced Unvoiced
Frequency Domain
Voiced
formant frequencies
Unvoiced
formant frequencies
Frequency Domain
Spectrogram
Spectrogram
Formant Frequencies
Formant frequency contours
He will allow a rare lie.
Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang
2.0 Speech Signal Processing
Speech Signal Processing
• Major Application Areas1. Speech Coding:Digitization and Compression
Considerations : 1) bit rate (bps) 2) recovered quality 3) computation
complexity/feasibility
2. Voice-based Network Access —
User Interface, Content Analysis, User-content Interaction
LPF outputProcessing Algorithms
x(t) x[n]
Processing xk
110101…Inverse
Processing
x[n] x[n]^
Storage/transmission
• Speech Signals– Carrying Linguistic
Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.
– Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level
– Processing and Interaction of the Double-level Information
Sampling of Signals
X(t)X[n]
t n
Double Levels of Information
字 (Character)
詞 (Word)
句 (Sentence)人 人 用 電 腦電 腦
Speech Signal Processing – Processing of Double-Level Information• Speech Signal • Sampling • Processing
• Linguistic Structure
• Linguistic Knowledge
今 天 的
常 好
Lexicon Grammar今天 的
今天的 天氣 非常 好
Algorithm
Chips or Computers 天 氣 非
Voice-based Network Access
Content AnalysisUser Interface
Internet
User-Content Interaction
User Interface
—when keyboards/mice inadequate Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language
User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals
at Any Time, from Anywhere Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-
free Interfaces, Home Appliances, Wearable Devices… Small in Size, Light in Weight, Ubiquitous, Invisible… Post-PC Era Keyboard/Mouse Most Convenient for PC’s not Convenient any longer
— human fingers never shrink, and application environment is changed Service Requirements Growing Exponentially Voice is the Only Interface Convenient for ALL User Terminals at Any Time,
from Anywhere, and to the point in one utterance Speech Processing is the only less mature part in the Technology Chain
Internet Networks
Text Content
Multimedia Content
Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content
• Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)
• Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse
• The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval
• Multimedia Content Analysis based on Speech Information
Future Integrated Networks
Real–time Information– weather, traffic– flight schedule– stock price– sports scores
Special Services– Google– FaceBook–YouTube– Amazon
Knowledge Archieves– digital libraries– virtual museums
Intelligent Working Environment– e–mail processors– intelligent agents– teleconferencing– distant learning– electric commerce
Private Services– personal notebook– business databases– home appliances– network entertainments
User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing
voice information Multimedia
ContentInternet
voice
input/
output
text information
• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech
• User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues• Hand-held Devices with Multimedia Functionalities Commonly used Today• Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified
by Speech Information
Multimedia Content Analysis
Text Information Retrieval
Text Content
Voice-based Information
Retrieval
Text-to-Speech Synthesis
Spoken and multi-modal
Dialogue
3.0 Speech Coding
Waveform-based Approaches
• Pulse-Coded Modulation (PCM)
– binary representation for each sample x[n] by quantization
• Differential PCM (DPCM)
– encoding the differences
d[n] = x[n] x[n1]
d[n] = x[n] ak x[nk]
• Adaptive DPCM (ADPCM)
– with adaptive algorithms
Ref : Haykin, “Communication Systems”, 4-th Ed.
3.7, 3.13, 3.14, 3.15
P
k=1
Speech Source Model and Source Coding
•Speech Source Model
– digitization and transmission of the parameters will be adequate– at receiver the parameters can produce x[n] with the model– much less parameters with much slower variation in time lead to much less
bits required– the key for low bit rate speech coding
x[n]u[n]
parameters parameters
Excitation Generator
Vocal Tract Model
Ex G(),G(z), g[n]
x[n]=u[n]g[n]X()=U()G()X(z)=U(z)G(z)
U ()U (z)
Speech Source Model
x(t)
a[n]
t
n
Speech Source Model and Source Coding
Analysis and Synthesis
– High computation requirements are the price for low bit rate
Simplified Speech Source Model
– Excitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n]
unvoiced
voiced
randomsequencegenerator
periodic pulse traingenerator
x[n]
G(z) = 1
1 akz-k P
k = 1
Excitation
G(z), G(), g[n]
Vocal Tract Model
u[n]
Gv/u
N
– Vocal Tract parameters {ak} : LPC coefficients formant structure of speech signals– A good approximation, though not precise enough
Reference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of Huang
LPC Vocoder(Voice Coder)
N by pitch detectionv/u by voicing detection
{ak} can be non-uniform or vector quantized to reduce bit rate further
Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993
Multipulse LPC
poor modeling of u(n) is the main source of quality degradation in LPC vocoder u[n] replaced by a sequence of pulses
u[n] = bk [nnk]
– roughly 8 pulses per pitch period
– u[n] close to periodic for voiced
– u[n] close to random for unvoiced
Estimating (bk , nk) is a difficult problem
k
Multipulse LPC
Estimating (bk , nk) is a difficult problem– analysis by synthesis
– large amount of computation is the price paid for better speech quality
Multipulse LPC
Perceptual Weighting
W(z) = (1 ak z -k) / (1 ak ck z -k)
0 < c < 1
for perceptual sensitivity
W(z) = 1 , if c = 1
W(z) = 1 ak z -k , if c = 0
practically c 0.8
Error Evaluation
E = |X() X() |2 W()d
P
k = 1
P
k = 1
2
0
^
Multipulse LPC
Error Minimization and Pulse search
u[n] = bk [nnk]
x[n] = bk g[nnk]
E = E ( b1 , n1 ,b2 , n2……)
– sub-optional solution finding 1 pulse at a time
^
k
k
Code-Excited Linear Prediction (CELP)
Use of VQ to Construct a Codebook of Excitation Sequences
– a sequence consists of roughly 40 samples
– a codebook of 512 ~ 1024 patterns is constructed with VQ
– roughly 512 ~ 1024 excitation patterns are perceptually adequate
Excitation Search –analysis by synthesis
– 9 ~ 10 bits are needed for the excitation of 40 samples , while {ak} parameters in G(z) also vector quantized
Code-Excited Linear Prediction (CELP)
Receiver
– {ak} codewords can be transmitted less frequently than excitation codeword
Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33
4.0 Speech Recognition andVoice-based Network Access
Feature Extraction
unknown speech signal
Pattern Matching
Decision Making
x(t)WX
output wordfeature
vector sequence
Reference Patterns
Feature Extraction
y(t) Y
training speech
Speech Recognition as a pattern recognition problem
• A Simplified Block Diagram
• Example Input Sentence this is speech• Acoustic Models (聲學模型 ) (th-ih-s-ih-z-s-p-ih-ch)• Lexicon (th-ih-s) → this (ih-z) → is (s-p-iy-ch) → speech• Language Model (語言模型 ) (this) – (is) – (speech)
P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model
P(wi|wi-1,wi-2) tri-gram language model,etc
Basic Approach for Large Vocabulary Speech Recognition
Front-endSignal Processing
AcousticModels Lexicon
FeatureVectors
Linguistic Decoding and
Search Algorithm
Output Sentence
SpeechCorpora
AcousticModel
Training
LanguageModel
Construction
TextCorpora
LanguageModel
Input Speech
2.0
2.0 Fundamentals of Speech Recognition
Hidden Markov Models (HMM)
˙ Formulation Ot= [x1, x2, …xD]T feature vectors for frame at time t qt= 1,2,3…N state number for feature vector Ot
A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ] state transition probability
B =[ bj(o), j = 1,2,…N] observation probability
bj(o) = cjkbjk(o)
bjk(o): multi-variate Gaussian distribution for the k-th mixture of the j-th state M : total number of mixtures
cjk = 1
= [ 1, 2, …N ] initial probabilities i = Prob[q1= i]
HMM : ( A , B, ) =
a22 a11
M
k = 1
M
k = 1
o1 o2 o3 o4 o5 o6 o7 o8…………
q1 q2 q3 q4 q5 q6 q7 q8…………
observation sequence state sequence
b1(o) b2 (o) b3(o)
a13 a24
a23 states a12
Observation Sequences
1-dim Gaussian Mixtures
State Transition Probabilities
2.0
Hidden Markov Models (HMM)
˙ Double Layers of Stochastic Processes
- hidden states with random transitions for time warping
- random output given state for random acoustic characteristics
˙ Three Basic Problems
(1) Evaluation Problem:
Given O =(o1, o2, …ot…oT) and = (A, B, )
find Prob [ O | ]
(2) Decoding Problem:
Given O = (o1, o2, …ot…oT) and = (A, B, )
find a best state sequence q = (q1,q2,…qt,…qT)
(3) Learning Problem:
Given O, find best values for parameters in
such that Prob [ O | ] = max
Simplified HMM
RGBGGBBGRRR……
Peripheral Processing for Human Perception (P.34 of 7.0 )
2.0
Feature Extraction (Front-end Signal Processing)
˙ Mel Frequency Cepstral Coefficients (MFCC)
- Mel-scale Filter Bank triangular shape in frequency/overlapped uniformly spaced below 1 kHz
logarithmic scale above 1 kHz
˙ Delta Coefficients
- 1st/2nd order differences
Discrete Fourier Transform
windowed speech samples
MFCC Mel-scale Filter Bank
log( | |2 )
Inverse Discrete Fourier Transform
Mel-scale Filter Bank
2.0
Language Modeling: N-gram
W = (w1, w2, w3,…,wi,…wR) a word sequence
- Evaluation of P(W)
P(W) = P(w1) II P(wi|w1, w2,…wi-1)
- Assumption:
P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)
Occurrence of a word depends on previous N 1 words only
N-gram language models
N = 2 : bigram P(wi | wi-1)
N = 3 : tri-gram P(wi | wi-2 , wi-1)
N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)
N = 1 : unigram P(wi)
probabilities estimated from a training text database
example : tri-gram model
P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)
R
i = 2
N
i = 3
N-gram
tri-gram
W1 W2 W3 W4 W5 W6 ...... WR
W1 W2 W3 W4 W5 W6 ...... WR
𝑃 (𝑊 )=𝑃 (𝑤1)∏𝑖=2
𝑅
𝑃 (𝑤𝑖|𝑤1 ,𝑤2 ,⋯ ,𝑤 𝑖−1 )
𝑃 (𝑊 )=𝑃 (𝑤1 )𝑃 (𝑤2|𝑤1 )∏𝑖=3
𝑁
𝑃 (𝑤 𝑖|𝑤𝑖 −2 ,𝑤 𝑖−1 )
2.0
Language Modeling - Evaluation of N-gram model parameters
unigram
P(wi) =
wi : a word in the vocabulary
V : total number of different words in the vocabulary N( ) number of counts in the training text database
bigram
P(wj|wk) =
< wk, wj > : a word pair
trigram
P(wj|wk,wm) =
smoothing estimation of probabilities of rare events by statistical approaches
N(wi)
N (wj)
V
j = 1
N(<wk,wj>)
N (wk)
N(<wk,wm,wj>)
N (<wk,wm>)
… th i s ……… 50000
… th i s i s …… 500
… th i s i s a … 5
Prob [ is| this ] =
Prob [ a| this is ] =
2.0
Large Vocabulary Continuous Speech Recognition W = (w1, w2,…wR) a word sequence X = (x1, x2,…xT) feature vectors for a speech utterance
W* = Prob(W|X) MAP principle
Prob(W|X) = = max
Prob(X|W) P(W) = max
by HMM by language model
˙ A Search Process Based on Three Knowledge Sources
X
- Acoustic Models : HMMs for basic voice units (e.g. phonemes)
- Lexicon : a database of all possible words in the vocabulary, each word including its pronunciation in terms of component basic voice units
- Language Models : based on words in the lexicon
Arg Max
w
Search W*
Acoustic Models
Lexicon Language
Models
Prob(X|W) P(W)
P(X)
Text-to-speech Synthesis
Text Analysis and Letter-to-
sound Conversion
Text Analysis and Letter-to-
sound Conversion
Prosody Generation
Prosody Generation
Signal Processing
and Concatenation
Signal Processing
and Concatenation
Lexicon and Rules
Prosodic Model
Voice Unit Database
Input Text
Output Speech Signal
• Transforming any input text into corresponding speech signals • E-mail/Web page reading • Prosodic modeling • Basic voice units/rule-based, non-uniform units/corpus-based, model-
based
假如今天胡適先生還活著,又會怎樣呢?text analysis
假如 | 今天 | 胡適 | 先生 | 還 | 活 | 著 | ,又 | 會 | 怎樣 | 呢 | ?Cbb, Nd, Nb, Na, Dfa,VH,Di,, D, D, D, T,?
假如 | 今天 胡適 | 先生 還 | 活 | 著 | , 又 | 會 怎樣 | 呢 | ?Cbb, Nd, Nb, Na, Dfa,VH,Di,,D, D, D, T,?
5 1 3 1 2 1 1 4 1 2 1 5
prosody generation
Energy
Pause
Intonation
Tone
Duration
speech synthesis
· Three Major Steps- text analysis- prosody generation- speech synthesis
Text-to-Speech Synthesis
Automatic Prosodic Analysis for an Arbitrary Text Sentence
Cbb Nd Nb Na Dfa VH Di
Nd+Nb
VH+DiCbb+Nd+Nb
endstart
Na+Dfa
Nb+Na Dfa+VHCbb+Nd
Dfa+VH+Di
· Predict B1,B2,B3 (B4,B5 determined by Punctuation marks)· minor phrase patterns: (Cbb+Nd,Nb+Na,Dfa+VH+Di,D+D,D+T,etc)
假如 | 今天 胡適 | 先生還 | 活 | 著 | ,又 | 會 怎樣 | 呢 | ?Cbb,Nd, Nb, Na,Dfa,VH,Di,,D,D, D, T,?minor phrases
prosodic groups
major phrases
breath groups
words
break indices5 1 3 1 2 1 1 4 1 2 1 5
Speech Understanding
• Understanding Speaker’s Intention rather than Transcribing into Word Strings
• Limited Domains/Finite Tasks
acoustic models
phrase lexicon
Syllable Recognition
Syllable Recognition
Key Phrase Matching
Key Phrase Matching
input utterance syllable lattice phrase graph
concept graph
concept set
phrase/concept language model
Semantic Decoding
Semantic Decoding
understanding results
• An Example utterance: 請幫我查一下 台灣銀行 的 電話號碼 是幾號 ? key phrases: ( 查一下 ) - ( 台灣銀行 ) - ( 電話號碼 ) concept: (inquiry) - (target) - (phone number)
Speaker Verification
Feature Extraction
Feature Extraction VerificationVerification
input speech yes/no
• Verifying the speaker as claimed• Applications requiring verification • Text dependent/independent• Integrated with other verification schemes
Speaker Models
Speaker Models
Voice-based Information Retrieval
• Speech Instructions• Speech Documents (or Multi-media Documents including Speech
Information)
speech instruction
我想找有關新政府組成的新聞?我想找有關新政府組成的新聞?text instruction
d1
text documents
d2
d3d1
d2
d3
speech documents
總統當選人陳水扁今天早上…
Spoken Dialogue Systems
• Almost all human-network interactions can be accomplished by spoken dialogue
• Speech understanding, speech synthesis, dialogue management• System/user/mixed initiatives• Reliability/efficiency, dialogue modeling/flow control• Transaction success rate/average number of dialogue turns
Databases
Sentence Generation and Speech SynthesisOutput
Speech
Input Speech
DialogueManager
Speech Recognition and Understanding
User’s Intention
Discourse Context
Response to the user
Internet
Networks
Users
Dialogue Server
“Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May 2008
References
top related