speechandaudiocoding

8/7/2019 SpeechAndAudioCoding

1/16

April 2006 Copyright (c) 2006 - Andreas Spanias II-1

Signal Processing for Communications

An Introduction to Advanced Technology and Research for

Undergraduates

Related Technologies and Applications:

Digital Cell Phones

Technologies for Cable Modems and Wi-Fi

Secure Military Communications

April 14, 2006, 9:45am-12pm, SCOB 101

Lectures and Modules for Undergraduates on:

Speech and Audio Coders, Andreas Spanias

Channel Coders, Tolga Duman

Time-Varying Signal Processing, Antonia Papandreou-Suppappola

Multcarrier and OFDM Systems, Cihan Tepedelenlioglu

Sponsored by the NSF Combined Research and Curriculum Development Grant 0417604


SS EEE 30 3 RDA EEE 35 0

DSP EEE 407 CS EEE 455

Summer Freshmanand SophomoreResearch Camps

EEE 498

Intro to

SP-COMResearch

LARGE-6-LECTURE 498 MODULES (LM )

Source Coding (6 lect/1 lab-

Channel Coding (6 lect/1 lab-

Multi-carrIer(6 lect/ 1 lab-

Time-varying signaling (6 lect/1 lab-

SMALL 1-LECTURE/LAB MODULES (SM)1lecture/1exercise

4 Module Summariesto inject in 303, 350, 407, 455

DEMO MODULES (DM)

ASU J-DSP Technology foron-line Java Computer Labs

SP-COM Researchdrawn from ASU SP-COM research Feedback/

Improvement

Pedagogiesfor transition ofresearch to UG curriculum

SS EEE 30 3 RDA EEE 35 0

DSP EEE 407 CS EEE 455

Summer Freshmanand SophomoreResearch Camps

EEE

Intro to

SP-COMResearch

LARGE-6-LECTURE 498 MODULES (LM )

Source Coding (6 lect/1 lab-

Channel Coding (6 lect/1 lab-

Multi-carrIer(6 lect/ 1 lab-

Time-varying signaling (6 lect/1 lab-

SMALL 1-LECTURE/LAB MODULES (SM)1lecture/1exercise

4 Module Summariesto inject in 303, 350, 407, 455

DEMO MODULES (DM)

ASU J-DSP Technology foron-line Java Computer Labs

SP-COM Researchdrawn from ASU SP-COM research

Activities and from research

published work from other universities

Feedback/

Improvement

Pedagogiesfor transition ofresearch to UG curriculum


2/16


Wireless Communications(cell phone appl.)

Source

Coder

Channel

CodingModulator

Channel

DemodulatorChannel

Decoding

Output

Speech

Source

Decoding

Input

Speech


Speech and Audio Coding for Mobile and

Multimedia ApplicationsCRCD Activity, April 14, 2006

by

Andreas Spanias, Professor

DSP and Speech Processing Labs.

Dept. of Electrical Engineering

Arizona State University

Tempe, AZ 85287-5706

email: [email protected]

http://www.eas.asu.edu/~spanias


3/16


Topics

1. The Speech Coding Problem

2. Speech Processing Analysis-Synthesis Algorithms

3. Historical Perspective on Algorithmic Research

4. The Standards on Speech Coding

5. Algorithm Examples

6. Research / Remarks


Digital Speech

nTttsnTsns === |)()()( - Can be Manipulated with Software

-Opportunities for Encryption and Enhanced Privacy

-Stored with High Fidelity

-Error Control

-Mixing Voice/Data/Video- Multimedia

Why Digital

Speech?


4/16


Continuous vs Discrete-time (digital) Speech

t

x(t)

n

Continuous-time (analog) Signal Discrete-time (digital) signal

0 T 2T ...

x(n)

Qx(t) x(n)

A signal that is bandlimited to B must be sampled at a rate of fs, Bfs 2Telephone Speech is typicallyTelephone Speech is typically bandlimitedbandlimited to 3.2 kHz and sampled at 8kHzto 3.2 kHz and sampled at 8kHz


Quantization Considerations

For uncompressed telephone speech : 8 bits per sample

8000 samples per second

for a total of 8000 x 8 = 64 kilo bits per second (kbits/s)

PCM 64 kbits is often used as a reference for comparison

To transmit this signal using a basic binary signaling schemewe need at least 32 kHz of bandwidth


5/16


Speech Coding

Speech coding or Speech compression is the field concerned

with obtaining compact digital representations of voice

signals for the purpose of efficient transmission or storage.

Speech coding involves sampling and amplitude

quantization.

The objective of speech coding is to represent speech witha minimum number of bits while maintaining its perceptual

quality.


Medium, Low, and Very-low Rate Speech Coding

The speech methods discussed in this course are those intended

for digital speech communications where speech is generally

bandlimited to 4 kHz ( or 3.2 kHz ) and sampled at 8 kHz.

medium-rate coding - the range of 8 - 16 kbits/s

low-rate the range below 8 kbits/s and down to 2.4 kbits/s

very-low-rate the range below 2.4 kbits/s

Remark: Cellular, Voice-Over-IP and speech streaming

applications typically use low-rate coders


6/16


Frequency

Discriminator

Frequency

Meter

Filter

0-25~

Pitch Channel

Oscillator

Noise

Pitch

Filter

0-25~

Spectrum Channels

Filter

0-300~

Analysis Synthesis

Filter

0-300~

Modulator

A total of ten channels

EQLZR

EQLZR

Historical Perspective

The First Vocoder - Dudleys Channel Vocoder

H. Dudley, "Remaking Speech," J. Acoust. Soc. Am., Vol. 11, p. 169, 1939.

H. Dudley, "The Vocoder," Bell Labs. Record., 17, p. 122, 1939.


Voiced and Unvoiced Speech

Time domain speech segment

Time (mS)

Amplitude

TAPE TIME: 3840

0 8 16 24 32

1.0

0.0

-1.0

Magnitude(dB)

-30

0

20

40

0 1 2 3 4

Frequency (KHz)

Time domain speech segment

Time (mS)

Amplitude

TAPE TIME: 8014

0 8 16 24 32

1.0

0.0

-1.0

Magnitude(dB)

-20

0

20

50

0 1 2 3 4

Frequency (KHz)

fundamentalfrequency

Formant Structure


7/16


Fine (Pitch) and Formant Structure of the

Short-time Speech Spectrum

Fine Harmonic Structure : reflects the quasi-periodicity of

speech and is attributed to the vibrating vocal chords.

Formant Structure (Spectral Envelope): is due to the

interaction of the source and the vocal tract. The vocal tract

consists of the pharynx and the mouth cavity.

Note the narrow peaks

Note the envelope peaks


Simple Speech Synthesis Model (2)

VOCAL

TRACT

FILTER

SYNTHETIC

SPEECH

gain

Requires hard (binary)

info voicing

V/UV

Pitch

iM

i

i zabzH

=

+=

1

0

1)(


8/16


9/16


Code Excited Linear Prediction (2)

The Nx1 error vector

( ) ( )ksgsske wkwwc 0 =

output due to the initial filter state,0ws

Minimizing w.r.t. gkwe get( ) ( ) ( )kekek cT

cc =

( )( ) ( )kskskss

gw

T

w

w

T

wk

=


Code Excited Linear Prediction (3)

( )( )( )

( ) ( )kskskss

sskw

T

w

w

T

ww

T

wc

2

=

The k-th excitation vector, , that minimizes is selected

closed-loop analysis is used for LTP parameters; range of values for within the integers 20 to 147

( )kXc ( )kc

M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at

Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.


10/16


3095.01

1 z

Impulse response

LTP excited by a random signal creates pseudo-periodicity

Normalized frequency (Nyquist = 1)

0 0.5 0.9 1-10

0

10

MagnitudeResponse(dB)

Frequency response


Perceptual Weighting Filter (2)

0 100 200 300 400 500 600-15

-10

-5

0

5

10

15

20

25

30

Perceptual Filter =0.9

( )i

i

ip

i

i

i

p

i

za

zazW

=

=

=

1

1

1

1

Short TermPredictor

( )i

i

i

za

zH

=

=

10

1

1

1


11/16


Performance and Computational Complexity

A speech coding algorithm is designed and evaluated

based on:

1. Bit rate

2. The quality of reconstructed (coded) speech

3. The complexity of the algorithm

4. The end-to-end delay


Subjective Speech Quality

Broadcast

Broadcast wideband speech refers to high quality

commentary speech at rates above 64 kbits/s.

Network or toll

Toll or Network quality refers to quality comparable

to the classical analog speech (200-3200 Hz)

Communications

Communications quality implies somewhat degraded

speech quality but adequate for cellular communications.

SyntheticSynthetic speech is usually intelligible but can be

unnatural and associated with a loss of speaker recognizability.


12/16


13/16


Wideband CDMA

Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicularenvironment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor

office environment)

To supports next generation data services envisioned up to 2MB/s (Full coverage

and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility

for 2 Mb/s)

Enhanced Voice Services (audioconferencing & voice mail)

Concurrent high-quality video/audio

Backward compatible with IS-95B

high security & low power

Significantly enhanced version of EVRC for voice services

- http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998

- IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published

December, 1999 by Prentice Hall PTR (ECS Professional)


GSM Adaptive Multirate Coder

Adjusts its bit-rate according to network load

Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s

Based on CELP with 20 ms frame and 5 ms subframe

Multirate-ACELP with 10th order short-term LPC and perceptual

weighting (uses levinson)

Encodes LSPs using split VQ

An open loop LTP is first obtained and refined by closed loop

Highest bit rate provides toll quality & half rate provides communications

quality

- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification

- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on

Speech Coding, pp. 117-119, 1999


14/16


The Selectable Mode Vocoder

Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-

127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)

The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV

algorithm to be refined in the interim by participating companies according to the

publication below)

Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and

eighth rate at 800 bps

Pre-processing includes noise suppression similar to IS 127 EVRC

Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core

technology also used in the ITU G.4 Conexant submission to ITU-4

Performed better than IS-733 and IS-127 in tests with and without background noise

Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with

background noise

REFERENCES:[1] The SMV algorithm selected for TIA and 3GPP2 for CDMA applications, conference paper by Conexant systems, Y.Gao, E.

Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)


ITU Wideband Coding G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM

G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding

G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)

ITU Telephony G.711 PCM (64 kbps) late 60s

G.726 ADPCM (32/40/ 24/16 kbps) 1988

G.728 LD-CELP coding (16 kbps) 1992

G.723.1 True Speech (5.3/6.3 kbps) 1995

G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998

G.4kbps Toll quality at 4 kbps (on going)

Non-ITU MPEG1/Audio (includes MP3), 1991 MPEG2/Audio: 64 kbps (1992)

MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)

MPEG7/Audio: audio/speech/MIDI coding (ongoing)

STANDARDS AT A GLANCE


15/16


TIA

CDMA

IS968,4,2 kbps Q-CELP (Qualcomm CELP, 1992)

IS127 8.55, 4, 0.8 kbps EVRC(Enhanced Variable. Rate Coder, 1996)

IS733 13.3, 6.2, 2.7, 1 kbps VRC(Variable Rate Coder, 1998)

3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)

TDMA

IS547.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)

IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)

PCS1800 (GSM variant working at 1800 MHz)

IS136-410 12.2 kbps US1 (1999)

ETSI (GSM):

13 kbps RPE-LTP (Full rate GSM, 1988)

6.5 kbps VSELP (Half-rate GSM, 1993)

12.2 kbps EFR (Enhanced full-rate GSM, 1996)

12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)

ARIB Japan

Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP

Half-rate PDC 3.45 kbps Multimode CELP`

STANDARDS AT A GLANCE (2)


Bit rate (kbps))

Vocoder/Waveform/Hybrid

1 2 4 8 16 32 64

Vocoders

Waveform Coders

Hybrid Coders

LPC10e

CELP

ADPCM PCMMOS1-5 SMV

MELP


16/16

speechandaudiocoding

Documents