speechandaudiocoding

Upload: hachan

Post on 08-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 SpeechAndAudioCoding

    1/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-1

    Signal Processing for Communications

    An Introduction to Advanced Technology and Research for

    Undergraduates

    Related Technologies and Applications:

    Digital Cell Phones

    Technologies for Cable Modems and Wi-Fi

    Secure Military Communications

    April 14, 2006, 9:45am-12pm, SCOB 101

    Lectures and Modules for Undergraduates on:

    Speech and Audio Coders, Andreas Spanias

    Channel Coders, Tolga Duman

    Time-Varying Signal Processing, Antonia Papandreou-Suppappola

    Multcarrier and OFDM Systems, Cihan Tepedelenlioglu

    Sponsored by the NSF Combined Research and Curriculum Development Grant 0417604

    April 2006 Copyright (c) 2006 - Andreas Spanias II-2

    SS EEE 30 3 RDA EEE 35 0

    DSP EEE 407 CS EEE 455

    Summer Freshmanand SophomoreResearch Camps

    EEE 498

    Intro to

    SP-COMResearch

    LARGE-6-LECTURE 498 MODULES (LM )

    Source Coding (6 lect/1 lab-

    Channel Coding (6 lect/1 lab-

    Multi-carrIer(6 lect/ 1 lab-

    Time-varying signaling (6 lect/1 lab-

    SMALL 1-LECTURE/LAB MODULES (SM)1lecture/1exercise

    4 Module Summariesto inject in 303, 350, 407, 455

    DEMO MODULES (DM)

    ASU J-DSP Technology foron-line Java Computer Labs

    SP-COM Researchdrawn from ASU SP-COM research Feedback/

    Improvement

    Pedagogiesfor transition ofresearch to UG curriculum

    SS EEE 30 3 RDA EEE 35 0

    DSP EEE 407 CS EEE 455

    Summer Freshmanand SophomoreResearch Camps

    EEE

    Intro to

    SP-COMResearch

    LARGE-6-LECTURE 498 MODULES (LM )

    Source Coding (6 lect/1 lab-

    Channel Coding (6 lect/1 lab-

    Multi-carrIer(6 lect/ 1 lab-

    Time-varying signaling (6 lect/1 lab-

    SMALL 1-LECTURE/LAB MODULES (SM)1lecture/1exercise

    4 Module Summariesto inject in 303, 350, 407, 455

    DEMO MODULES (DM)

    ASU J-DSP Technology foron-line Java Computer Labs

    SP-COM Researchdrawn from ASU SP-COM research

    Activities and from research

    published work from other universities

    Feedback/

    Improvement

    Pedagogiesfor transition ofresearch to UG curriculum

  • 8/7/2019 SpeechAndAudioCoding

    2/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-3

    Wireless Communications(cell phone appl.)

    Source

    Coder

    Channel

    CodingModulator

    Channel

    DemodulatorChannel

    Decoding

    Output

    Speech

    Source

    Decoding

    Input

    Speech

    April 2006 Copyright (c) 2006 - Andreas Spanias II-4

    Speech and Audio Coding for Mobile and

    Multimedia ApplicationsCRCD Activity, April 14, 2006

    by

    Andreas Spanias, Professor

    DSP and Speech Processing Labs.

    Dept. of Electrical Engineering

    Arizona State University

    Tempe, AZ 85287-5706

    email: [email protected]

    http://www.eas.asu.edu/~spanias

  • 8/7/2019 SpeechAndAudioCoding

    3/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-5

    Topics

    1. The Speech Coding Problem

    2. Speech Processing Analysis-Synthesis Algorithms

    3. Historical Perspective on Algorithmic Research

    4. The Standards on Speech Coding

    5. Algorithm Examples

    6. Research / Remarks

    April 2006 Copyright (c) 2006 - Andreas Spanias II-6

    Digital Speech

    nTttsnTsns === |)()()( - Can be Manipulated with Software

    -Opportunities for Encryption and Enhanced Privacy

    -Stored with High Fidelity

    -Error Control

    -Mixing Voice/Data/Video- Multimedia

    Why Digital

    Speech?

  • 8/7/2019 SpeechAndAudioCoding

    4/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-7

    Continuous vs Discrete-time (digital) Speech

    t

    x(t)

    n

    Continuous-time (analog) Signal Discrete-time (digital) signal

    0 T 2T ...

    x(n)

    Qx(t) x(n)

    A signal that is bandlimited to B must be sampled at a rate of fs, Bfs 2Telephone Speech is typicallyTelephone Speech is typically bandlimitedbandlimited to 3.2 kHz and sampled at 8kHzto 3.2 kHz and sampled at 8kHz

    April 2006 Copyright (c) 2006 - Andreas Spanias II-8

    Quantization Considerations

    For uncompressed telephone speech : 8 bits per sample

    8000 samples per second

    for a total of 8000 x 8 = 64 kilo bits per second (kbits/s)

    PCM 64 kbits is often used as a reference for comparison

    To transmit this signal using a basic binary signaling schemewe need at least 32 kHz of bandwidth

  • 8/7/2019 SpeechAndAudioCoding

    5/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-9

    Speech Coding

    Speech coding or Speech compression is the field concerned

    with obtaining compact digital representations of voice

    signals for the purpose of efficient transmission or storage.

    Speech coding involves sampling and amplitude

    quantization.

    The objective of speech coding is to represent speech witha minimum number of bits while maintaining its perceptual

    quality.

    April 2006 Copyright (c) 2006 - Andreas Spanias II-10

    Medium, Low, and Very-low Rate Speech Coding

    The speech methods discussed in this course are those intended

    for digital speech communications where speech is generally

    bandlimited to 4 kHz ( or 3.2 kHz ) and sampled at 8 kHz.

    medium-rate coding - the range of 8 - 16 kbits/s

    low-rate the range below 8 kbits/s and down to 2.4 kbits/s

    very-low-rate the range below 2.4 kbits/s

    Remark: Cellular, Voice-Over-IP and speech streaming

    applications typically use low-rate coders

  • 8/7/2019 SpeechAndAudioCoding

    6/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-11

    Frequency

    Discriminator

    Frequency

    Meter

    Filter

    0-25~

    Pitch Channel

    Oscillator

    Noise

    Pitch

    Filter

    0-25~

    Spectrum Channels

    Filter

    0-300~

    Analysis Synthesis

    Filter

    0-300~

    Modulator

    A total of ten channels

    EQLZR

    EQLZR

    Historical Perspective

    The First Vocoder - Dudleys Channel Vocoder

    H. Dudley, "Remaking Speech," J. Acoust. Soc. Am., Vol. 11, p. 169, 1939.

    H. Dudley, "The Vocoder," Bell Labs. Record., 17, p. 122, 1939.

    April 2006 Copyright (c) 2006 - Andreas Spanias II-12

    Voiced and Unvoiced Speech

    Time domain speech segment

    Time (mS)

    Amplitude

    TAPE TIME: 3840

    0 8 16 24 32

    1.0

    0.0

    -1.0

    Magnitude(dB)

    -30

    0

    20

    40

    0 1 2 3 4

    Frequency (KHz)

    Time domain speech segment

    Time (mS)

    Amplitude

    TAPE TIME: 8014

    0 8 16 24 32

    1.0

    0.0

    -1.0

    Magnitude(dB)

    -20

    0

    20

    50

    0 1 2 3 4

    Frequency (KHz)

    fundamentalfrequency

    Formant Structure

  • 8/7/2019 SpeechAndAudioCoding

    7/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-13

    Fine (Pitch) and Formant Structure of the

    Short-time Speech Spectrum

    Fine Harmonic Structure : reflects the quasi-periodicity of

    speech and is attributed to the vibrating vocal chords.

    Formant Structure (Spectral Envelope): is due to the

    interaction of the source and the vocal tract. The vocal tract

    consists of the pharynx and the mouth cavity.

    Note the narrow peaks

    Note the envelope peaks

    April 2006 Copyright (c) 2006 - Andreas Spanias II-14

    Simple Speech Synthesis Model (2)

    VOCAL

    TRACT

    FILTER

    SYNTHETIC

    SPEECH

    gain

    Requires hard (binary)

    info voicing

    V/UV

    Pitch

    iM

    i

    i zabzH

    =

    +=

    1

    0

    1)(

  • 8/7/2019 SpeechAndAudioCoding

    8/16

  • 8/7/2019 SpeechAndAudioCoding

    9/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-17

    Code Excited Linear Prediction (2)

    The Nx1 error vector

    ( ) ( )ksgsske wkwwc 0 =

    output due to the initial filter state,0ws

    Minimizing w.r.t. gkwe get( ) ( ) ( )kekek cT

    cc =

    ( )( ) ( )kskskss

    gw

    T

    w

    w

    T

    wk

    =

    April 2006 Copyright (c) 2006 - Andreas Spanias II-18

    Code Excited Linear Prediction (3)

    ( )( )( )

    ( ) ( )kskskss

    sskw

    T

    w

    w

    T

    ww

    T

    wc

    2

    =

    The k-th excitation vector, , that minimizes is selected

    closed-loop analysis is used for LTP parameters; range of values for within the integers 20 to 147

    ( )kXc ( )kc

    M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at

    Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.

  • 8/7/2019 SpeechAndAudioCoding

    10/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-19

    3095.01

    1 z

    Impulse response

    LTP excited by a random signal creates pseudo-periodicity

    Normalized frequency (Nyquist = 1)

    0 0.5 0.9 1-10

    0

    10

    MagnitudeResponse(dB)

    Frequency response

    April 2006 Copyright (c) 2006 - Andreas Spanias II-20

    Perceptual Weighting Filter (2)

    0 100 200 300 400 500 600-15

    -10

    -5

    0

    5

    10

    15

    20

    25

    30

    Perceptual Filter =0.9

    ( )i

    i

    ip

    i

    i

    i

    p

    i

    za

    zazW

    =

    =

    =

    1

    1

    1

    1

    Short TermPredictor

    ( )i

    i

    i

    za

    zH

    =

    =

    10

    1

    1

    1

  • 8/7/2019 SpeechAndAudioCoding

    11/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-21

    Performance and Computational Complexity

    A speech coding algorithm is designed and evaluated

    based on:

    1. Bit rate

    2. The quality of reconstructed (coded) speech

    3. The complexity of the algorithm

    4. The end-to-end delay

    April 2006 Copyright (c) 2006 - Andreas Spanias II-22

    Subjective Speech Quality

    Broadcast

    Broadcast wideband speech refers to high quality

    commentary speech at rates above 64 kbits/s.

    Network or toll

    Toll or Network quality refers to quality comparable

    to the classical analog speech (200-3200 Hz)

    Communications

    Communications quality implies somewhat degraded

    speech quality but adequate for cellular communications.

    SyntheticSynthetic speech is usually intelligible but can be

    unnatural and associated with a loss of speaker recognizability.

  • 8/7/2019 SpeechAndAudioCoding

    12/16

  • 8/7/2019 SpeechAndAudioCoding

    13/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-25

    Wideband CDMA

    Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicularenvironment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor

    office environment)

    To supports next generation data services envisioned up to 2MB/s (Full coverage

    and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility

    for 2 Mb/s)

    Enhanced Voice Services (audioconferencing & voice mail)

    Concurrent high-quality video/audio

    Backward compatible with IS-95B

    high security & low power

    Significantly enhanced version of EVRC for voice services

    - http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998

    - IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published

    December, 1999 by Prentice Hall PTR (ECS Professional)

    April 2006 Copyright (c) 2006 - Andreas Spanias II-26

    GSM Adaptive Multirate Coder

    Adjusts its bit-rate according to network load

    Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s

    Based on CELP with 20 ms frame and 5 ms subframe

    Multirate-ACELP with 10th order short-term LPC and perceptual

    weighting (uses levinson)

    Encodes LSPs using split VQ

    An open loop LTP is first obtained and refined by closed loop

    Highest bit rate provides toll quality & half rate provides communications

    quality

    - ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification

    - R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on

    Speech Coding, pp. 117-119, 1999

  • 8/7/2019 SpeechAndAudioCoding

    14/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-27

    The Selectable Mode Vocoder

    Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-

    127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)

    The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV

    algorithm to be refined in the interim by participating companies according to the

    publication below)

    Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and

    eighth rate at 800 bps

    Pre-processing includes noise suppression similar to IS 127 EVRC

    Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core

    technology also used in the ITU G.4 Conexant submission to ITU-4

    Performed better than IS-733 and IS-127 in tests with and without background noise

    Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with

    background noise

    REFERENCES:[1] The SMV algorithm selected for TIA and 3GPP2 for CDMA applications, conference paper by Conexant systems, Y.Gao, E.

    Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)

    April 2006 Copyright (c) 2006 - Andreas Spanias II-28

    ITU Wideband Coding G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM

    G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding

    G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)

    ITU Telephony G.711 PCM (64 kbps) late 60s

    G.726 ADPCM (32/40/ 24/16 kbps) 1988

    G.728 LD-CELP coding (16 kbps) 1992

    G.723.1 True Speech (5.3/6.3 kbps) 1995

    G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998

    G.4kbps Toll quality at 4 kbps (on going)

    Non-ITU MPEG1/Audio (includes MP3), 1991 MPEG2/Audio: 64 kbps (1992)

    MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)

    MPEG7/Audio: audio/speech/MIDI coding (ongoing)

    STANDARDS AT A GLANCE

  • 8/7/2019 SpeechAndAudioCoding

    15/16

    April 2006 Copyright (c) 2006 - Andreas Spanias II-29

    TIA

    CDMA

    IS968,4,2 kbps Q-CELP (Qualcomm CELP, 1992)

    IS127 8.55, 4, 0.8 kbps EVRC(Enhanced Variable. Rate Coder, 1996)

    IS733 13.3, 6.2, 2.7, 1 kbps VRC(Variable Rate Coder, 1998)

    3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)

    TDMA

    IS547.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)

    IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)

    PCS1800 (GSM variant working at 1800 MHz)

    IS136-410 12.2 kbps US1 (1999)

    ETSI (GSM):

    13 kbps RPE-LTP (Full rate GSM, 1988)

    6.5 kbps VSELP (Half-rate GSM, 1993)

    12.2 kbps EFR (Enhanced full-rate GSM, 1996)

    12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)

    ARIB Japan

    Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP

    Half-rate PDC 3.45 kbps Multimode CELP`

    STANDARDS AT A GLANCE (2)

    April 2006 Copyright (c) 2006 - Andreas Spanias II-30

    Bit rate (kbps))

    Vocoder/Waveform/Hybrid

    1 2 4 8 16 32 64

    Vocoders

    Waveform Coders

    Hybrid Coders

    LPC10e

    CELP

    ADPCM PCMMOS1-5 SMV

    MELP

  • 8/7/2019 SpeechAndAudioCoding

    16/16