introduction to mpeg-1/2 l2 audio - ntut.edu.tw · 2015-08-10 · tone masks noise (tmn) ~ -18 db...

33
Introduction to Introduction to MPEG MPEG - - 1/2 L2 Audio 1/2 L2 Audio 尤信程 尤信程 國立台北科技大學資訊工程系 國立台北科技大學資訊工程系

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction to Introduction to MPEGMPEG--1/2 L2 Audio 1/2 L2 Audio

    尤信程尤信程

    國立台北科技大學資訊工程系國立台北科技大學資訊工程系

  • ContentsContentsIntro to audio codingIntro to audio codingPsychoacousticsPsychoacousticsAudio encodingAudio encodingFrame structureFrame structureAudio decodingAudio decodingConclusionsConclusions

  • Intro to audio coding (1)Intro to audio coding (1)Audio coding uses MPEGAudio coding uses MPEG--1/2 (part 3) Layer II, 1/2 (part 3) Layer II, originally developed by Philipsoriginally developed by PhilipsMPEGMPEG--1 Layer III audio is known as MP1 Layer III audio is known as MP--3 (by 3 (by FhGFhG--IIS)IIS)Sampling rate: MPEGSampling rate: MPEG--1 = 48 1 = 48 ksks/s, MPEG/s, MPEG--2 = 2 = 24 24 ksks/s/sEach audio frame encodes 1152 PCM samples Each audio frame encodes 1152 PCM samples per channel. Audio frame duration: MPEGper channel. Audio frame duration: MPEG--1 = 1 = 24 ms, MPEG24 ms, MPEG--2 = 48 ms.2 = 48 ms.

  • Intro to audio coding (2)Intro to audio coding (2)DAB audio uses MPEGDAB audio uses MPEG--1 Layer II & MPEG1 Layer II & MPEG--2 2 LSF Layer II codingLSF Layer II codingDVBDVB--T audio uses MPEGT audio uses MPEG--1 Layer II1 Layer IIDVBDVB--T audio can optionally use other audio T audio can optionally use other audio coding technique such as MPEGcoding technique such as MPEG--2 MC Layer II 2 MC Layer II & Dolby AC& Dolby AC--3 3 DAB also uses PAD for carrying dataDAB also uses PAD for carrying dataDVBDVB--T does not use PAD as it can transmit T does not use PAD as it can transmit data via TSdata via TS

  • Intro to audio coding (3)Intro to audio coding (3)MPEGMPEG--1 audio for higher bitrate, MPEG1 audio for higher bitrate, MPEG--2 for 2 for lower bitrate (lower audio quality). Contrary to lower bitrate (lower audio quality). Contrary to common believingcommon believingLower sampling freq for low bitrate coding: Lower sampling freq for low bitrate coding: common practice in audio societycommon practice in audio societyMPEGMPEG--2 Multi2 Multi--Channel (MC) extension is not Channel (MC) extension is not currently used in Taiwancurrently used in TaiwanMPEGMPEG--4 HE4 HE--AAC could be a future coding AAC could be a future coding schemescheme

  • Intro to audio coding (4)Intro to audio coding (4)Features of MPEGFeatures of MPEG--1/2 audio standard1/2 audio standard

    psychoacoustic modelpsychoacoustic model3232--subbandsubband filterbankfilterbankBit allocation tableBit allocation tableSCaleFactorSCaleFactor Selection Info (SCFSI)Selection Info (SCFSI)Data groupingData grouping

    Bitrate range: 32 kbps ~ 384 kbpsBitrate range: 32 kbps ~ 384 kbps

  • Psychoacoustics (1)Psychoacoustics (1)Brief intro to psychoacousticsBrief intro to psychoacoustics

    Absolute hearing thresholdAbsolute hearing thresholdFrequency masking (main factor)Frequency masking (main factor)Temporal masking Temporal masking

    Layer I/II uses psychoacoustic model ILayer I/II uses psychoacoustic model ILayer III uses psychoacoustic model IILayer III uses psychoacoustic model II

  • Psychoacoustics (2)Psychoacoustics (2)Absolute hearing threshold is the minimum Absolute hearing threshold is the minimum level of sound human beings can hear level of sound human beings can hear Every one has (slightly) different thresholdEvery one has (slightly) different thresholdSPL (Sound Pressure Level) = 20 log (SPL (Sound Pressure Level) = 20 log (PP//PP00) ) dB. dB. PP00 = 2 x 10= 2 x 10––55 N/mN/m22

    Freq

    SPL

  • Psychoacoustics (3)Psychoacoustics (3)Frequency masking: Human ears can be Frequency masking: Human ears can be modeled as many modeled as many bandpass bandpass filters.filters.Different signals within the Different signals within the passband passband of one of one filter will interfere each other. Stronger one filter will interfere each other. Stronger one masks weaker one.masks weaker one.

    Freq

    SPL

  • Psychoacoustics (4)Psychoacoustics (4)Passband Passband of a of a bandpass bandpass filter is known as one filter is known as one critical band. There are around 25 critical critical band. There are around 25 critical bands. bands. Noise Masks Tone (NMT) ~ Noise Masks Tone (NMT) ~ --6 dB6 dBTone Masks Noise (TMN) ~ Tone Masks Noise (TMN) ~ --18 dB18 dBInterInter--criticalcritical--band masking also existsband masking also existsFrequency masking assumes stationary Frequency masking assumes stationary signals. Not correct for transient signals.signals. Not correct for transient signals.

  • Psychoacoustics (5)Psychoacoustics (5)Temporal maskingTemporal masking

    PrePre--masking (5 ms)masking (5 ms)Simultaneous maskingSimultaneous maskingPostPost--masking (200 ms)masking (200 ms)

    PrePre--masking can be used to mask premasking can be used to mask pre--echoes.echoes.In general, temporal masking is more difficult to In general, temporal masking is more difficult to use.use.

  • Audio encoding (1)Audio encoding (1)Simple block diagram of MPEGSimple block diagram of MPEG--1/2 encoding.1/2 encoding.

    Digital Audio Input Filter

    Bank

    Psychoacoustic Model

    Bitstream Formatting

    Signal to Mask Ratio

    Quantized Samples Encoded Bitstream

    Bit allocation

    and scalefact

    or calculatio

    n

  • Audio encoding (2)Audio encoding (2)FilterFilter--bank, also called subband analysis, has bank, also called subband analysis, has 32 32 bandpass bandpass filters with the same protofilters with the same proto--type.type.Let Let gg((nn) be the impulse response of proto) be the impulse response of proto--type.type.Filter Filter ii in the filter bank is obtained byin the filter bank is obtained by

    Recall MPY in time domain is convolution in Recall MPY in time domain is convolution in frequency domain. Center frequency is shifted frequency domain. Center frequency is shifted according to cosine term.according to cosine term.

    )64

    )16)(12(cos()()( π−+⋅= ningnhi

  • Audio encoding (3)Audio encoding (3)The protoThe proto--type filter is a 512type filter is a 512--coefficient FIR coefficient FIR filter. Coefficients are in the spec.filter. Coefficients are in the spec.Since the BW of each BPF is (1/32) of full BW, Since the BW of each BPF is (1/32) of full BW, decimate its output samples by 32.decimate its output samples by 32.

    h0(n) ↓ 32

    h31(n) ↓ 32

    PCM input

    Sub-band output

  • Audio encoding (4)Audio encoding (4)Number of bits to quantize a subband sample Number of bits to quantize a subband sample based on PSY model is recorded in Bit based on PSY model is recorded in Bit Allocation (BA). 3 X 12 = 36 samples from the Allocation (BA). 3 X 12 = 36 samples from the same subband share one BA info. 12 samples same subband share one BA info. 12 samples of one subband = 1 part.of one subband = 1 part.The amplification ratio of a subband sample in The amplification ratio of a subband sample in encoding is called encoding is called SCaleFactor SCaleFactor (SCF). It(SCF). It’’s s shared by 12 subband samples. shared by 12 subband samples.

  • Audio encoding (5)Audio encoding (5)ItIt’’s possible to use one, two, or three SCFs possible to use one, two, or three SCF’’s for s for 36 output samples of one subband.36 output samples of one subband.This info is called This info is called SCaleFactor SCaleFactor Selection Selection Information (SCFSI). Four different cases Information (SCFSI). Four different cases identified:identified:

    Three different SCFThree different SCF’’ssFirstFirst--two parts share one SCFtwo parts share one SCFLastLast--two parts share one SCFtwo parts share one SCFAll three parts share one SCFAll three parts share one SCF

  • Audio encoding (6)Audio encoding (6)1152 PCM samples per 1152 PCM samples per ch ch is packed in one is packed in one audio frame.audio frame.Subband samples in bitstream: Subband samples in bitstream:

    3 samples from L 3 samples from L ch ch of SB 0of SB 03 samples form R 3 samples form R ch ch of SB 0of SB 0Same for SB 1 till SB 31Same for SB 1 till SB 31Repeat the above 12 timesRepeat the above 12 times

    Remember SCF scope is 12 samples from the Remember SCF scope is 12 samples from the same SB.same SB.

  • Audio encoding (7)Audio encoding (7)Quantized subband samples may be grouped Quantized subband samples may be grouped together into one codeword. together into one codeword. The following quantization levels use grouping The following quantization levels use grouping in packing subband samples.in packing subband samples.

    3 levels: use 5 bits3 levels: use 5 bits5 levels: use 7 bits5 levels: use 7 bits9 levels: use 10 bits9 levels: use 10 bits

  • Audio frame structure (1)Audio frame structure (1)The audio frame in MPEGThe audio frame in MPEG--1/2 has the 1/2 has the following fields: header, CRC, audio data, and following fields: header, CRC, audio data, and ancillary.ancillary.Audio data has BA field, SCFSI field, SCF field, Audio data has BA field, SCFSI field, SCF field, and and sb sb sample field.sample field.DAB uses ancillary to pack FDAB uses ancillary to pack F--PAD (Fixed PAD (Fixed Program Associated Data), XProgram Associated Data), X--PAD (Extended PAD (Extended PAD), and SCFPAD), and SCF--CRC. CRC.

  • Audio frame structure (2)Audio frame structure (2)CRC field in MPEGCRC field in MPEG--1/2 is optional, but is 1/2 is optional, but is mandatory in DAB.mandatory in DAB.

    Header CRC

    BA Subband samplesSCFSCFSCI

    SCF CRCStuffing X-PAD

    Audio data Ancillary

    F-PAD

  • Audio frame structure (3)Audio frame structure (3)SCF CRC = CRC for next frameSCF CRC = CRC for next frame’’s SCF.s SCF.FF--PAD contains the followingPAD contains the following

    Dynamic Range Control (DRC)Dynamic Range Control (DRC)XX--PAD info (no, short, long)PAD info (no, short, long)Music/Speech flag Music/Speech flag ISRC ISRC (Int(Int’’l Standard Recording Code)l Standard Recording Code)

  • Audio frame structure (4)Audio frame structure (4)XX--PAD has many PAD has many ““application types,application types,”” including including

    labellabeldata data ITTS (Interactive Text Transmit System)ITTS (Interactive Text Transmit System)MOT (Multimedia Object Transfer protocol) MOT (Multimedia Object Transfer protocol) etc.etc.

  • Audio frame structure (5)Audio frame structure (5)Frame header has 32 bits, arranged as follows:Frame header has 32 bits, arranged as follows:AAAAAAAA AAAABCCD EEEEFFGH IIJJKLMMAAAAAAAA AAAABCCD EEEEFFGH IIJJKLMMA = sync word = 11111111 1111A = sync word = 11111111 1111B = ID. 0 = MEPGB = ID. 0 = MEPG--2, 1 = MPEG2, 1 = MPEG--11C = layer ID. 10 = Layer 2. (DAB uses it only)C = layer ID. 10 = Layer 2. (DAB uses it only)D = protection bit. Use 0 (with CRC) for DAB.D = protection bit. Use 0 (with CRC) for DAB.E = bitrate index. From 32 kbps ~ 384 kbps.E = bitrate index. From 32 kbps ~ 384 kbps.F = Sampling frequency. 01 = 48/24 kHz.F = Sampling frequency. 01 = 48/24 kHz.G = padding = 0. No padding requiredG = padding = 0. No padding requiredH = private bit. Not used.H = private bit. Not used.

  • Audio frame structure (6)Audio frame structure (6)AAAAAAAA AAAABCCD EEEEFFGH IIJJKLMM.AAAAAAAA AAAABCCD EEEEFFGH IIJJKLMM.

    I = mode. 00 = stereo, 01 = joint stereo, 10 = I = mode. 00 = stereo, 01 = joint stereo, 10 = dual channel, 11 = mono.dual channel, 11 = mono.J = mode extension. Used only if mode = 01. J = mode extension. Used only if mode = 01. 00 = bound @ 4, 01 = bound @ 8, 00 = bound @ 4, 01 = bound @ 8, 10 = bound @12, 11 = bound @16.10 = bound @12, 11 = bound @16.K = copyright. 0 = no, 1 = yesK = copyright. 0 = no, 1 = yesL = original/copy. 0 = copy, 1 = org.L = original/copy. 0 = copy, 1 = org.M = emphasis. Use 00 = no emphasis.M = emphasis. Use 00 = no emphasis.

  • Audio Decoding (1)Audio Decoding (1)Sync frame header by finding 1111 1111 1111Sync frame header by finding 1111 1111 1111Parse frame header to know the coding mode.Parse frame header to know the coding mode.Bitrate info can also be obtained from FIC.Bitrate info can also be obtained from FIC.Use assigned BA table to parse BA.Use assigned BA table to parse BA.Parse SCFSI. No SCFSI for BA field of a SB = 0Parse SCFSI. No SCFSI for BA field of a SB = 0Parse SCF based on SCFSI and BA.Parse SCF based on SCFSI and BA.Parse SB samples. Watch out grouping Parse SB samples. Watch out grouping codeword.codeword.

  • Audio Decoding (2)Audio Decoding (2)Check if frame CRC and SCF CRC are OK. Do Check if frame CRC and SCF CRC are OK. Do error concealment if necessary.error concealment if necessary.ReRe--quantize SB samples.quantize SB samples.Perform subband synthesis via better Perform subband synthesis via better algorithms than ISOalgorithms than ISO’’s.s.Synchronize CODECSynchronize CODEC’’s Fs if necessary.s Fs if necessary.

  • Audio Decoding (3)Audio Decoding (3)

  • Audio Decoding (4)Audio Decoding (4)Brief introduction to deBrief introduction to de--grouping and regrouping and re--quantize the SB samplesquantize the SB samples..

    11.. DeDe--grouping: for i = 0 to 2grouping: for i = 0 to 2S[i] = c % S[i] = c % nlevelnlevelc = c div c = c div nlevelnlevel

    2. S2. S’’ = Invert MSB of S. S= Invert MSB of S. S’’ is 2is 2’’ss--complement fractional number.complement fractional number.

    3. S3. S’’’’= C * (S= C * (S’’ + D). C & D in tables of spec.+ D). C & D in tables of spec.4. S4. S’’’’’’ = S= S’’’’ * SCF* SCF

  • Audio Decoding (5)Audio Decoding (5)Subband synthesis from ISOSubband synthesis from ISO’’s spec:s spec:Input 32 new subband samples,Input 32 new subband samples, SSii, i = 0,.., 31, i = 0,.., 31Shifting: for i =1023 down to 64 doShifting: for i =1023 down to 64 do

    V[i] = V[iV[i] = V[i--64]64]MatrixingMatrixing: for i = 0 to 63 do: for i = 0 to 63 do

    ∑=

    ⋅⎟⎠⎞

    ⎜⎝⎛ ++=

    31

    0 64)12)(16(cos]i[V

    kkS

    ki π

  • Audio Decoding (6)Audio Decoding (6)Build a 512 vector U: Build a 512 vector U:

    for i = 0 to 7 dofor i = 0 to 7 dofor j = 0 to 31 dofor j = 0 to 31 do

    U[i*64+j] = V[i*128+j]U[i*64+j] = V[i*128+j]U[i*64+32+j] = V[i*128+96+j]U[i*64+32+j] = V[i*128+96+j]

    Window by 512Window by 512--coefficient matrix D:coefficient matrix D:for i = 0 to 511 dofor i = 0 to 511 do

    W[i] = U[i] * D[i]W[i] = U[i] * D[i]

  • Audio Decoding (7)Audio Decoding (7)Calculate 32 PCM samples:Calculate 32 PCM samples:

    for j = 0 to 31 dofor j = 0 to 31 do

    Output 32 reconstructed PCM samplesOutput 32 reconstructed PCM samples ssjj..

    ∑=

    +=15

    0i]32ij[Wjs

  • Audio Decoding (8)Audio Decoding (8)Subband synthesis algorithm mentioned Subband synthesis algorithm mentioned previously is slowpreviously is slowWe can avoid data shifting by using circular We can avoid data shifting by using circular queue.queue.MatrixingMatrixing can be more efficient if DCT is in use. can be more efficient if DCT is in use. Cf. IEEE SPL, vol. 1, no. 2, pp. 26 Cf. IEEE SPL, vol. 1, no. 2, pp. 26 –– 28, 1994.28, 1994.A fully optimized code can be at least five times A fully optimized code can be at least five times faster! faster!

  • ConclusionsConclusions

    Introduction to audio coding Introduction to audio coding Brief intro to psychoacousticsBrief intro to psychoacousticsAudio encoding flowAudio encoding flowAudio frame structureAudio frame structureAudio decoding, including subband Audio decoding, including subband synthesissynthesis