speechandaudiocoding
TRANSCRIPT
-
8/7/2019 SpeechAndAudioCoding
1/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-1
Signal Processing for Communications
An Introduction to Advanced Technology and Research for
Undergraduates
Related Technologies and Applications:
Digital Cell Phones
Technologies for Cable Modems and Wi-Fi
Secure Military Communications
April 14, 2006, 9:45am-12pm, SCOB 101
Lectures and Modules for Undergraduates on:
Speech and Audio Coders, Andreas Spanias
Channel Coders, Tolga Duman
Time-Varying Signal Processing, Antonia Papandreou-Suppappola
Multcarrier and OFDM Systems, Cihan Tepedelenlioglu
Sponsored by the NSF Combined Research and Curriculum Development Grant 0417604
April 2006 Copyright (c) 2006 - Andreas Spanias II-2
SS EEE 30 3 RDA EEE 35 0
DSP EEE 407 CS EEE 455
Summer Freshmanand SophomoreResearch Camps
EEE 498
Intro to
SP-COMResearch
LARGE-6-LECTURE 498 MODULES (LM )
Source Coding (6 lect/1 lab-
Channel Coding (6 lect/1 lab-
Multi-carrIer(6 lect/ 1 lab-
Time-varying signaling (6 lect/1 lab-
SMALL 1-LECTURE/LAB MODULES (SM)1lecture/1exercise
4 Module Summariesto inject in 303, 350, 407, 455
DEMO MODULES (DM)
ASU J-DSP Technology foron-line Java Computer Labs
SP-COM Researchdrawn from ASU SP-COM research Feedback/
Improvement
Pedagogiesfor transition ofresearch to UG curriculum
SS EEE 30 3 RDA EEE 35 0
DSP EEE 407 CS EEE 455
Summer Freshmanand SophomoreResearch Camps
EEE
Intro to
SP-COMResearch
LARGE-6-LECTURE 498 MODULES (LM )
Source Coding (6 lect/1 lab-
Channel Coding (6 lect/1 lab-
Multi-carrIer(6 lect/ 1 lab-
Time-varying signaling (6 lect/1 lab-
SMALL 1-LECTURE/LAB MODULES (SM)1lecture/1exercise
4 Module Summariesto inject in 303, 350, 407, 455
DEMO MODULES (DM)
ASU J-DSP Technology foron-line Java Computer Labs
SP-COM Researchdrawn from ASU SP-COM research
Activities and from research
published work from other universities
Feedback/
Improvement
Pedagogiesfor transition ofresearch to UG curriculum
-
8/7/2019 SpeechAndAudioCoding
2/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-3
Wireless Communications(cell phone appl.)
Source
Coder
Channel
CodingModulator
Channel
DemodulatorChannel
Decoding
Output
Speech
Source
Decoding
Input
Speech
April 2006 Copyright (c) 2006 - Andreas Spanias II-4
Speech and Audio Coding for Mobile and
Multimedia ApplicationsCRCD Activity, April 14, 2006
by
Andreas Spanias, Professor
DSP and Speech Processing Labs.
Dept. of Electrical Engineering
Arizona State University
Tempe, AZ 85287-5706
email: [email protected]
http://www.eas.asu.edu/~spanias
-
8/7/2019 SpeechAndAudioCoding
3/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-5
Topics
1. The Speech Coding Problem
2. Speech Processing Analysis-Synthesis Algorithms
3. Historical Perspective on Algorithmic Research
4. The Standards on Speech Coding
5. Algorithm Examples
6. Research / Remarks
April 2006 Copyright (c) 2006 - Andreas Spanias II-6
Digital Speech
nTttsnTsns === |)()()( - Can be Manipulated with Software
-Opportunities for Encryption and Enhanced Privacy
-Stored with High Fidelity
-Error Control
-Mixing Voice/Data/Video- Multimedia
Why Digital
Speech?
-
8/7/2019 SpeechAndAudioCoding
4/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-7
Continuous vs Discrete-time (digital) Speech
t
x(t)
n
Continuous-time (analog) Signal Discrete-time (digital) signal
0 T 2T ...
x(n)
Qx(t) x(n)
A signal that is bandlimited to B must be sampled at a rate of fs, Bfs 2Telephone Speech is typicallyTelephone Speech is typically bandlimitedbandlimited to 3.2 kHz and sampled at 8kHzto 3.2 kHz and sampled at 8kHz
April 2006 Copyright (c) 2006 - Andreas Spanias II-8
Quantization Considerations
For uncompressed telephone speech : 8 bits per sample
8000 samples per second
for a total of 8000 x 8 = 64 kilo bits per second (kbits/s)
PCM 64 kbits is often used as a reference for comparison
To transmit this signal using a basic binary signaling schemewe need at least 32 kHz of bandwidth
-
8/7/2019 SpeechAndAudioCoding
5/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-9
Speech Coding
Speech coding or Speech compression is the field concerned
with obtaining compact digital representations of voice
signals for the purpose of efficient transmission or storage.
Speech coding involves sampling and amplitude
quantization.
The objective of speech coding is to represent speech witha minimum number of bits while maintaining its perceptual
quality.
April 2006 Copyright (c) 2006 - Andreas Spanias II-10
Medium, Low, and Very-low Rate Speech Coding
The speech methods discussed in this course are those intended
for digital speech communications where speech is generally
bandlimited to 4 kHz ( or 3.2 kHz ) and sampled at 8 kHz.
medium-rate coding - the range of 8 - 16 kbits/s
low-rate the range below 8 kbits/s and down to 2.4 kbits/s
very-low-rate the range below 2.4 kbits/s
Remark: Cellular, Voice-Over-IP and speech streaming
applications typically use low-rate coders
-
8/7/2019 SpeechAndAudioCoding
6/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-11
Frequency
Discriminator
Frequency
Meter
Filter
0-25~
Pitch Channel
Oscillator
Noise
Pitch
Filter
0-25~
Spectrum Channels
Filter
0-300~
Analysis Synthesis
Filter
0-300~
Modulator
A total of ten channels
EQLZR
EQLZR
Historical Perspective
The First Vocoder - Dudleys Channel Vocoder
H. Dudley, "Remaking Speech," J. Acoust. Soc. Am., Vol. 11, p. 169, 1939.
H. Dudley, "The Vocoder," Bell Labs. Record., 17, p. 122, 1939.
April 2006 Copyright (c) 2006 - Andreas Spanias II-12
Voiced and Unvoiced Speech
Time domain speech segment
Time (mS)
Amplitude
TAPE TIME: 3840
0 8 16 24 32
1.0
0.0
-1.0
Magnitude(dB)
-30
0
20
40
0 1 2 3 4
Frequency (KHz)
Time domain speech segment
Time (mS)
Amplitude
TAPE TIME: 8014
0 8 16 24 32
1.0
0.0
-1.0
Magnitude(dB)
-20
0
20
50
0 1 2 3 4
Frequency (KHz)
fundamentalfrequency
Formant Structure
-
8/7/2019 SpeechAndAudioCoding
7/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-13
Fine (Pitch) and Formant Structure of the
Short-time Speech Spectrum
Fine Harmonic Structure : reflects the quasi-periodicity of
speech and is attributed to the vibrating vocal chords.
Formant Structure (Spectral Envelope): is due to the
interaction of the source and the vocal tract. The vocal tract
consists of the pharynx and the mouth cavity.
Note the narrow peaks
Note the envelope peaks
April 2006 Copyright (c) 2006 - Andreas Spanias II-14
Simple Speech Synthesis Model (2)
VOCAL
TRACT
FILTER
SYNTHETIC
SPEECH
gain
Requires hard (binary)
info voicing
V/UV
Pitch
iM
i
i zabzH
=
+=
1
0
1)(
-
8/7/2019 SpeechAndAudioCoding
8/16
-
8/7/2019 SpeechAndAudioCoding
9/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-17
Code Excited Linear Prediction (2)
The Nx1 error vector
( ) ( )ksgsske wkwwc 0 =
output due to the initial filter state,0ws
Minimizing w.r.t. gkwe get( ) ( ) ( )kekek cT
cc =
( )( ) ( )kskskss
gw
T
w
w
T
wk
=
April 2006 Copyright (c) 2006 - Andreas Spanias II-18
Code Excited Linear Prediction (3)
( )( )( )
( ) ( )kskskss
sskw
T
w
w
T
ww
T
wc
2
=
The k-th excitation vector, , that minimizes is selected
closed-loop analysis is used for LTP parameters; range of values for within the integers 20 to 147
( )kXc ( )kc
M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at
Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.
-
8/7/2019 SpeechAndAudioCoding
10/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-19
3095.01
1 z
Impulse response
LTP excited by a random signal creates pseudo-periodicity
Normalized frequency (Nyquist = 1)
0 0.5 0.9 1-10
0
10
MagnitudeResponse(dB)
Frequency response
April 2006 Copyright (c) 2006 - Andreas Spanias II-20
Perceptual Weighting Filter (2)
0 100 200 300 400 500 600-15
-10
-5
0
5
10
15
20
25
30
Perceptual Filter =0.9
( )i
i
ip
i
i
i
p
i
za
zazW
=
=
=
1
1
1
1
Short TermPredictor
( )i
i
i
za
zH
=
=
10
1
1
1
-
8/7/2019 SpeechAndAudioCoding
11/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-21
Performance and Computational Complexity
A speech coding algorithm is designed and evaluated
based on:
1. Bit rate
2. The quality of reconstructed (coded) speech
3. The complexity of the algorithm
4. The end-to-end delay
April 2006 Copyright (c) 2006 - Andreas Spanias II-22
Subjective Speech Quality
Broadcast
Broadcast wideband speech refers to high quality
commentary speech at rates above 64 kbits/s.
Network or toll
Toll or Network quality refers to quality comparable
to the classical analog speech (200-3200 Hz)
Communications
Communications quality implies somewhat degraded
speech quality but adequate for cellular communications.
SyntheticSynthetic speech is usually intelligible but can be
unnatural and associated with a loss of speaker recognizability.
-
8/7/2019 SpeechAndAudioCoding
12/16
-
8/7/2019 SpeechAndAudioCoding
13/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-25
Wideband CDMA
Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicularenvironment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor
office environment)
To supports next generation data services envisioned up to 2MB/s (Full coverage
and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility
for 2 Mb/s)
Enhanced Voice Services (audioconferencing & voice mail)
Concurrent high-quality video/audio
Backward compatible with IS-95B
high security & low power
Significantly enhanced version of EVRC for voice services
- http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998
- IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published
December, 1999 by Prentice Hall PTR (ECS Professional)
April 2006 Copyright (c) 2006 - Andreas Spanias II-26
GSM Adaptive Multirate Coder
Adjusts its bit-rate according to network load
Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s
Based on CELP with 20 ms frame and 5 ms subframe
Multirate-ACELP with 10th order short-term LPC and perceptual
weighting (uses levinson)
Encodes LSPs using split VQ
An open loop LTP is first obtained and refined by closed loop
Highest bit rate provides toll quality & half rate provides communications
quality
- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification
- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on
Speech Coding, pp. 117-119, 1999
-
8/7/2019 SpeechAndAudioCoding
14/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-27
The Selectable Mode Vocoder
Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-
127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)
The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV
algorithm to be refined in the interim by participating companies according to the
publication below)
Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and
eighth rate at 800 bps
Pre-processing includes noise suppression similar to IS 127 EVRC
Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core
technology also used in the ITU G.4 Conexant submission to ITU-4
Performed better than IS-733 and IS-127 in tests with and without background noise
Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with
background noise
REFERENCES:[1] The SMV algorithm selected for TIA and 3GPP2 for CDMA applications, conference paper by Conexant systems, Y.Gao, E.
Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)
April 2006 Copyright (c) 2006 - Andreas Spanias II-28
ITU Wideband Coding G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM
G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding
G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)
ITU Telephony G.711 PCM (64 kbps) late 60s
G.726 ADPCM (32/40/ 24/16 kbps) 1988
G.728 LD-CELP coding (16 kbps) 1992
G.723.1 True Speech (5.3/6.3 kbps) 1995
G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998
G.4kbps Toll quality at 4 kbps (on going)
Non-ITU MPEG1/Audio (includes MP3), 1991 MPEG2/Audio: 64 kbps (1992)
MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)
MPEG7/Audio: audio/speech/MIDI coding (ongoing)
STANDARDS AT A GLANCE
-
8/7/2019 SpeechAndAudioCoding
15/16
April 2006 Copyright (c) 2006 - Andreas Spanias II-29
TIA
CDMA
IS968,4,2 kbps Q-CELP (Qualcomm CELP, 1992)
IS127 8.55, 4, 0.8 kbps EVRC(Enhanced Variable. Rate Coder, 1996)
IS733 13.3, 6.2, 2.7, 1 kbps VRC(Variable Rate Coder, 1998)
3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)
TDMA
IS547.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)
IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)
PCS1800 (GSM variant working at 1800 MHz)
IS136-410 12.2 kbps US1 (1999)
ETSI (GSM):
13 kbps RPE-LTP (Full rate GSM, 1988)
6.5 kbps VSELP (Half-rate GSM, 1993)
12.2 kbps EFR (Enhanced full-rate GSM, 1996)
12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)
ARIB Japan
Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP
Half-rate PDC 3.45 kbps Multimode CELP`
STANDARDS AT A GLANCE (2)
April 2006 Copyright (c) 2006 - Andreas Spanias II-30
Bit rate (kbps))
Vocoder/Waveform/Hybrid
1 2 4 8 16 32 64
Vocoders
Waveform Coders
Hybrid Coders
LPC10e
CELP
ADPCM PCMMOS1-5 SMV
MELP
-
8/7/2019 SpeechAndAudioCoding
16/16