h.264 數位影音技術 wen-jyi hwang ( 黃文吉 ) department of computer science and information...

129
H.264 數數數數數數 Wen-Jyi Hwang ( 數數數 ) Department of Computer Science and Information Engineering, National Taiwan Normal University

Upload: nicholas-elijah-garrison

Post on 13-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

H.264 數位影音技術

Wen-Jyi Hwang ( 黃文吉 )

Department of Computer Science and Information Engineering,

National Taiwan Normal University

Introduction

The H.264 is the newest video coding standard.

We also call H.264 as MPEG-4 Part 10, or MPEG-4 AVC (Advance Video Coding).

The H.264 was developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG).

The basic goal of the H.264 is to create a standard capable of supporting good video quality at lower bit rates than previous standards.

An additional goal is to provide enough flexibility so that the standard can be effectively applied to a wide range of applications with low and high bit rates, including– digital video broadcast (DVB), – DVD storage, – IPTV.

Structure of H.264/AVC Video Coder VCL: Designed to efficiently encode the video content. NAL: Formats the VCL representation of the video and

provides head information for conveyance by a variety of transport layers or storage media.

Video Coding Layer

Data Partitioning Con

trol

Dat

aNetwork Abstraction Layer

H.320 H.324MPEG2 H.323/IP Etc.

Coded Macroblock

Coded Slice/Partition

Video Coding Layer

Video Coding Layer

Data Partitioning Con

trol

Dat

a

Network Abstraction Layer

H.320 H.324MPEG2 H.323/IP Etc.

Coded Macroblock

Coded Slice/Partition

Basic Structure of VCL

Scaling & Inv. Transform

Motion-Compensation

Control Data

Quant.Transf. coeffs

Motion Data

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

InputVideoSignal

Split intoMacroblocks16x16 pixels

Intra-frame Prediction

De-blockingFilter

Decoded Video

EntropyCoding

Compressed

Video

bits

Intra-frame Prediction

Scaling & Inv. Transform

Motion-Compensation

Control Data

Quant.Transf. coeffs

Motion Data

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

Intra-frame Prediction

De-blockingFilter

Decoded Video

EntropyCoding

Compressed

Video

bits

InputVideoSignal

Split intoMacroblocks16x16 pixels

Intra-frame encoding of H.264 supports Intra_4 4, Intra_16 16 and I_PCM.

I_PCM allows the encoder directly send the values of encoded sample.

Intra_4 4 and Intra_16 16 allows the intra prediction.

Intra 1616- 4 modes- Used in flat area

Intra 44-9 modes-Used in texture area

Four modes of Intra_1616– Mode 0 (vertical) : extrapolation from upper

samples(H)

– Mode 1 (horizontal): extrapolation from left samples(V)

– Mode 2 (DC): mean of upper and left-hand samples (H+V)

– Mode 3 (Plane) : a linear “plane” function is fitted to the upper and left-hand samples H and V. This works well in areas of smoothly-varying luminance

H

V

1 (Horizontal)

…..

H

V

2 (DC)

Mean(H+V)

H

V

0 (Vertical)

………

H

V

3 (Plane)

Example: Original image

Nine modes of Intra_44– The prediction block P is calculated based on the

samples labeled A-M.– The encoder may select the prediction mode for

each block that minimizes the residual between P and the block to be encoded.

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

0

1

3

7 5

4

6

8

Block P

Samples

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 0 (Vertical)

Pred( a, e, i, m) = Pixel(A)Pred( b, f, j, n ) = Pixel(B)Pred( c, g, k, o) = Pixel(C)Pred( d, h, l, p ) = Pixel(D)

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 1 (Horizontal)

Pred( a, b, c, d ) = Pixel(I)Pred( e, f, g, h ) = Pixel(J)Pred( i, j, k, l ) = Pixel(K)Pred( m, n, o, p) = Pixel(L)

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 2 (DC)

Pred(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p)= Pixel(A+B+C+D+I+J+K+L)/8

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 3 (Diagomal Down-Left)

Pred(a) = Pixel(A + 2*B + C + 2)/4Pred(b,e) = Pixel(B + 2*C + D + 2)/4Pred(c,f,i) = Pixel(C + 2*D + E + 2)/4Pred(d,g,j,m) = Pixel(D + 2*E + F + 2)/4Pred(h,k,n) = Pixel(E + 2*F + G + 2)/4Pred(l,o) = Pixel(F + 2*G + H + 2)/4Pred(p) = Pixel(G + 3*H + 2)/4

Mean(A..D..I..L)

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 4 (Diagomal Down-Right)

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 5 (Vertical-Right) Pred(a,j) = Pixel(M + A + 1)/2Pred(b,k) = Pixel(A + B + 1)/2Pred(c,l) = Pixel(B + C + 1)/2Pred(d) = Pixel(C + D + 1)/2Pred(e,n) = Pixel(I + 2*M + A + 2)/4Pred(f,o) = Pixel(M + 2*A + B + 2)/4Pred(g,p) = Pixel(A + 2*B + C + 2)/4Pred(h) = Pixel(B + 2*C + D + 2)/4Pred(i) = Pixel(M + 2*I + J + 2)/4Pred(m) = Pixel(I + 2*J + K + 2)/4

Pred(a,f,k,p) = Pixel(I + 2*M + A + 2)/4Pred(b,g,l) = Pixel(M + 2*A + B + 2)/4Pred(c,h) = Pixel(A + 2*B + C + 2)/4Pred(d) = Pixel(B + 2*C + D + 2)/4Pred(e,j,o) = Pixel(M + 2*I + J + 2)/4Pred(i,n) = Pixel(I + 2*J + K + 2)/4Pred(m) = Pixel(J + 3*K + L + 2)/4

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 6 (Horizontal-Down)

Pred(a) = Pixel(A + B + 1)/2Pred(l) = Pixel(E + F + 1)/2Pred(b,i) = Pixel(B + C + 1)/2Pred(c,j) = Pixel(C + D + 1)/2Pred(d,k) = Pixel(D + E + 1)/2Pred(e) = Pixel(A + 2*B + C + 2)/4Pred(f,m) = Pixel(B + 2*C + D + 2)/4Pred(g,n) = Pixel(C + 2*D + E + 2)/4Pred(h,o) = Pixel(D + 2*E + F + 2)/4Pred(p) = Pixel(E + 2*F + G + 2)/4

Pred(a,g) = Pixel(M + A + 1)/2Pred(e,k) = Pixel(I + J + 1)/2Pred(i,o) = Pixel(J + K + 1)/2Pred(m) = Pixel(K + L + 1)/2Pred(b,h) = Pixel(I + 2*M + A + 2)/4Pred(c) = Pixel(M + 2*A + B + 2)/4Pred(d) = Pixel(A + 2*B + C + 2)/4Pred(f,l) = Pixel(M + 2*I + J + 2)/4Pred(j,p) = Pixel(I + 2*J + K + 2)/4Pred(n) = Pixel(J + 2*K + L + 2)/4

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 7 (Vertical-Left)

Pred(a) = Pixel(I + J + 1)/2Pred(c,e) = Pixel(J + K + 1)/2Pred(g,i) = Pixel(K + L + 1)/2Pred(b) = Pixel(I + 2*J + K + 2)/4Pred(d,f) = Pixel(J + 2*K + L + 2)/4Pred(h,j) = Pixel(K + 3*L + 2)/4Pred(k,m) = Pixel(L) Pred(l,n) = Pixel(L) Pred(o) = Pixel(L)Pred(p) = Pixel(L)

M A B C D E F G H

I

J

K

L

a

e

i

m

b

f

j

n

c

g

k

o

d

h

l

p

Mode 8 (Horizontal-Up)

Example:

Motion Estimation/Compensation

Scaling & Inv. Transform

Motion-Compensation

Control Data

Quant.Transf. coeffs

Motion Data

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

Intra-frame Prediction

De-blockingFilter

Decoded Video

EntropyCoding

Compressed

Video

bits

InputVideoSignal

Split intoMacroblocks16x16 pixels

Features of the H.264 motion estimation

– Various block sizes– ¼ sample accuracy

• 6-tap filtering to ½ sample accuracy• simplified filtering to ¼ sample accuracy

– Multiple reference pictures– Generalized B-Frames

Variable Block Size Block-Matching

– In the H.264, a video frame is first splitted using fixed size macroblocks.

– Each macroblock may then be divided into subblocks with different block sizes.

– A macroblock has a dimension of 16 16 pixels. The size of the smallest subblock is 4 4

0

16x16

0 1

8x16MB

Types

8x80 1

2 3

16x8

1

0

8x8

0

4x8

0 10 1

2 3

4x48x4

1

08x8Types

Example:

This example shows the effectiveness of block matching operations with smaller sizes.

Frame 1 Frame 2

Difference between Frame 1 and Frame 2

block-matching with size 16X16

Difference between Frame 2 and Frame 2(16X16)

Frame 2

block-matching with size 8X8

Difference between Frame 2 and Frame 2(8X8)

Frame 2

block-matching with size 4X4

Difference between Frame 2 and Frame 2(4X4)

Frame 2

To use a subblock with size less than 88, it is necessary to first split the macroblock into four 88 subblocks.

Example:

Encoding a motion vector for each subblock can cost a significant number of bits, especially if small block sizes are chosen.

Motion vectors for neighboring subblocks are often highly correlated. Therefore, each motion vector can be effectively predicted from vectors of nearby, previously coded subblocks.

The difference between the motion vector of the current block and its prediction is encoded and transmitted.

The method of forming the prediction depends on the block size and on the availability of nearby vectors.

Let E be the current block, let A be the subblock immediately to the left of E, let B be the subblock immediately above E, and let C be the subblock above and to the right of E.

It is not necessary that A, B, C, and E have the same size.

C

D B

A E

There are two modes for the prediction of

motion vectors:

•Median prediction

Use for all block sizes excluding 16×8 and 8×16

•Directional segmentation prediction

Use for 16×8 and 8×16

C

D B

A E

Median predictionIf C not exist then C=DIf B, C not exist then prediction = VA

If A, C not exist then prediction = VB

If A, B not exist then prediction = VC

Otherwise Prediction = median(VA,,VB,VC)

Directional segmentation prediction

• Vector block size 8×16 Left: prediction = VA

Right: prediction = VC

• Vector block size 16×8

Up: prediction = VB

Down: prediction =VA

Fractional Motion Estimation

In H.264, the motion vectors between current block and candidate block has ¼-pel resolution.

The samples at sub-pel positions do not exist in the reference frame and so it is necessary to create them using interpolation from nearby image samples.

E

K

F

L

A

C

G

M

R

T

B

D

H

N

S

U

I

P

J

Q

aa

bb

b

j

s

gg

hh

cc dd h m ee ff

b=round((E-5F+20G+20H-5I+J)/32)h=round((A-5C+20G+20M-5R+T)/32)j=round((aa-5bb+20b+20s-5gg+hh)/32)

Interpolation of ½-pel samples.

Interpolation of ¼-pel samples.

a=round((G+b)/2)

d=round((G+h)/2)

e=round((b+h)/2)

Multiple Reference Frames

MaximumCompression

Cannot Recognize Periodic Motion

The motion estimation techniques based on multiple reference frame technique provides opportunities for more precise inter-prediction, and also improved robustness to lost picture data.

The drawback of multiple reference frames is that both the encoder and decoder have to store the reference frames used for Inter-frame prediction in a multi-frame buffer.

Garden (CIF, 30 fps)

29

30

31

32

33

34

500k 700K 1000k

R [bit/s]

PS

NR

Y [

dB

]

with 1 previous reference

with 5 previous references

~10%

Generalized B Frames

Basic B-frames: The basic B-frames cannot be used as reference frames.

Generalized B-frames: The generalized B-frames can be used as reference frames.

Garden (CIF, 30 fps)

29

30

31

32

33

34

500k 700K 1000k

R [bit/s]

PS

NR

Y [

dB

]

with 1 previous reference

with 5 previous references

with classic B pictures

~30%

Garden (CIF, 30 fps)

29

30

31

32

33

34

500k 700K 1000k

R [bit/s]

PS

NR

Y [

dB

]

with 1 previous reference with 5 previous references with classic B pictures with generalized B pictures

~35%

Weighted Prediction

Reference frames can be weighted for motion compensation.

There are 3 types of weighted prediction.P-Frame, explicit weighted (weighting transmitted),B-Frame, explicit weighted,B-Frame, implicit weighted (weighting predicted at decoder based on the temporal distance)

The weighted prediction scheme may be effective for frames with fade transition.

Transformation/Quantization

Scaling & Inv. Transform

Motion-Compensation

Control Data

Quant.Transf. coeffs

Motion Data

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

Intra-frame Prediction

De-blockingFilter

Decoded Video

EntropyCoding

Compressed

Video

bits

InputVideoSignal

Split intoMacroblocks16x16 pixels

Transformation

The DCT operates on y, a block of N×N samples and creates Y, and N×N block of coefficients.

The forward DCT:T AyAY

The reverse DCT therefore is given by

YAAy T

DCT matrix A is orthogonal. That is,

IAA T

N

nmCmnA n 2

)12(cos),(

The elements of A are:

where

0 ,1

nN

Cn0 ,

2 n

NCn

1

0

1

0 2

)12(cos

2

)12(cos),(),(

N

n

N

mvu N

vm

N

unmnyCCvuY

1

0

1

0 2

)12(cos

2

)12(cos),(),(

N

u

N

vvu N

vm

N

unvuYCCmny

That is,

Example:

The transform matrix A for a 4×4 DCT is:

8

21cos

2

1

8

15cos

2

1

8

9cos

2

1

8

3cos

2

18

14cos

2

1

8

10cos

2

1

8

6cos

2

1

8

2cos

2

18

7cos

2

1

8

5cos

2

1

8

3cos

2

1

8cos

2

1

)0cos(2

1)0cos(

2

1)0cos(

2

1)0cos(

2

1

A

8

3cos

2

1

8cos

2

1

8cos

2

1

8

3cos

2

12

1

2

1

2

1

2

18

cos2

1

8

3cos

2

1

8

3cos

2

1

8cos

2

12

1

2

1

2

1

2

1

A

or

cbbc

aaaa

bccb

aaaa

A

where

2

1a

8cos

2

1 b

8

3cos

2

1 c

That is,

The H.264 transform is based on the 4×4 DCT with the following simplifications:

1. The transform is an integer transform.2. The core part of the transform can be realized by only shifts and additions.3. A scaling multiplication is integrated into the quantizer, reducing the total number of multiplications.

Recall that

caba

baca

baca

caba

cbbc

aaaa

bccb

aaaa

yAyAY T

where

2

1a

8cos

2

1 b

8

3cos

2

1 c

22

22

22

22

111

111

111

111

11

1111

11

1111

babbab

abaaba

babbab

abaaba

d

d

d

d

dd

dd

T

T

y

E)Cy(C

AyAYPost-scaling

We can rewrite Y as

1. We call (C y CT) the core 2D transform.2. The matrix E is a matrix of scaling factors.3. The symbol indicates that each element of (C y CT) is multiplied by the scaling factor in the same position in matrix E (i.e., is scalar multiplication rather than matrix multiplication)

dd

dd

11

1111

11

1111

C

where

and d = c/b

To simplify the implementation of the transform, d is approximated by 0.5.

In order to ensure that the transform remains orthogonal, b also needs to be modified so that:

2

1,

5

2,

2

1 dba

4/2/4/2/

2/2/

4/2/4/2/

2/2/

1121

2111

2111

1121

1221

1111

2112

1111

22

22

22

22

babbab

abaaba

babbab

abaaba

fT

ff

T

y

E)Cy(C

E)Cy(CY

The final forward transform becomes

Note that the modified core transform

1121

2111

2111

1121

1221

1111

2112

1111

y

CyC Tff

involves only shifts and additions.

1/2-11-1/2

11-1-1

1-1/2-1/21

1111

2/1111

112/11

112/11

2/1111

22

22

22

22

T

babbab

abaaba

babbab

abaaba

iii

Y

CEYCy

The inverse transform is given by:

Pre-Scaling

Quantization

H.264 assumes a scalar quantization.

The quantization should satisfy the following requirements:

(a) avoid division and/or floating point arithmetic(b) incorporate the post and pre-scaling matrices Ef and Ei.

The basic forward quantizer operation is

Z(u,v)= round( Y(u,v)/QStep )

where Y(u,v) is a transform coefficient, Z(u,v) is a quantized coefficient, and QStep is a quantizer step size.

There are 52 quantizers (i.e.,Quantization Parameter (QP)=0-51).

Increase of 1 in QP means an increase of QStep by approximately 12%

Increase of 6 in QP means an increase of QStep by a factor of 2.

QP 0 1 2 3 4 5 6 7 8 9 10 11 12 …

QStep 0.625 0.6875 0.8125 0.875 1 1.125 1.25 1.375 1.625 1.72 2 2.25 2.5 …

QP … 18 … 24 … 30 … 36 … 42 … 48 … 51

QStep … 5 … 10 … 20 … 40 … 80 … 160 … 224

The post-scaling factor (PF) (i.e., a2 , ab/2 or b2/4) is incorporated into the forward quantizer in the following way:

1. The input block y is transformed to give a block of unscaled coefficients W=Cf yCf

T.2. Then, each coefficient in W is quantized and scaled in a single operation:

where PF is a2 , ab/2 or b2/4 depending on the position (u,v).

Z(u,v)= round( W(u,v)×PF /QStep )

Why?

•Post-Scaling

Posotion PF(0,0),(2,0),(0,2) or (2,2) a2

(1,1),(1,3),(3,1) or (3,3) b2/4

Others ab/2

In order to simplify the arithmetic, the factor (PF/QStep) is implemented as a multiplication by a factor MF and a right shift, avoiding any division operations.

Z(u,v)= round( W(u,v)×MF /2qbits )

where

QStep

PFMFqbits

2

andqbits=15+QP/6

Example:

Suppose QP=4 and location (u,v)=(0,0).

Therefore, QStep=1.0, PF=a2=0.25, and qbits=15.

QStep

PFMFqbits

2

From

We have

MF=8192

The MF value for various QPs (QP 5) are shown below.

For QP>5, the factors MF remain unchanged, butqbits increases by 1 for each increment of six in QP.That is, qbits=16 for 6QP 11, qbits=17 for12 QP 17, and so on.

Table_for_MF

QPPositions

(0,0),(2,0),(0,2) or (2,2)

Positions(1,1),(1,3),(3,1) or (3,3)

Other Positions

0 13107 5243 8066

1 11916 4660 7490

2 10082 4194 6554

3 9362 3647 5825

4 8192 3355 5243

5 7282 2893 4559

Example:

Suppose QP=10.

Find MF value in the positions (0,0), (2,0) and (3,1) using Table_for_MF.

Sol.

The MF value for QP=10 is the same as that for QP=4.

Using Table_for_MF , we haveMF value = 8192 for coefficient in location (0,0),MF value = 8192 for coefficient in location (2,0), andMF value = 3355 for coefficient in location (3,1).

•Pre-Scaling

The de-quantized coefficient is given by

The inverse transform involving pre-scaling operations proceeds in the following way:

1. The dequantized block is pre-scaled to block for core 2D inverse transform.

2. The reconstructed block is then given by

QStepvuZvuY ),(),('

'YW

y

iTi CWCy

The pre-scaling factor (PF) (i.e., a2 , ab or b2) is incorporated in the computation of , together with a constant scaling factor of 64 to avoid rounding errors.

W

64),(),( PFQStepvuZvuW

The values at the output of the inverse transform should be divided by 64 to remove the constant scaling factor.

The H.264 standard does not specify QStep or PF directly. Instead, the parameters V=QStep×PF×64 is defined.

The V values for various QPs (QP 5) are shown below.

Table_for_V

QPPositions

(0,0),(2,0),(0,2) or (2,2)

Positions(1,1),(1,3),(3,1) or (3,3)

Other Positions

0 10 16 13

1 11 18 14

2 13 20 16

3 14 23 18

4 16 25 20

5 18 29 23

For QP>5, the V value increases by a factor of 2 for each increment of six in QP.

),(),(),( vuVvuZvuW

That is,

where

6/2]6mod[__),( QPQPVforTablevuV

The Complete Transformation, Quantization, Rescaling and Inverse Transformation

Encoding:1. Input 4×4 block: y2. Forward core transform: W=Cf yCf

T

3. Post-scaling and quantization: Z(u,v)= round( W(u,v)×MF /2qbits )

Decoding:1. Pre-scaling:2. Inverse core transform: 3. Re-scaling:

),(),(),( vuVvuZvuW i

Ti CWCy

64/yy

Example:

6 5 4 4

12 6 6 3

5 5 4 12

8 8 8 6

1. Suppose QP=10, and input block y =

102 14 10 2

-21 33 1 14

-4 4 -12 2

-13 -61 13 -38

2. Forward core transform: W =

3. Because QP=10, we have MF=8192,3355 or 5243, qbits=16. Z=

13 1 1 0

-2 2 0 1

-1 0 -2 0

-1 -3 1 -2

4. V=32, 50 or 40 because 2QP/6 =2.

W 416 40 32 0

-80 100 0 50

-32 0 -64 0

-40 -150 40 -100

5. Output of the inverse core transform after division by 64 is

y5 5 4 3

13 6 6 3

6 6 4 13

7 8 8 7

Entropy Coding

Scaling & Inv. Transform

Motion-Compensation

Control Data

Quant.Transf. coeffs

Motion Data

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

Intra-frame Prediction

De-blockingFilter

Decoded Video

EntropyCoding

Compressed

Video

bits

InputVideoSignal

Split intoMacroblocks16x16 pixels

Here we present two basic variable length coding (VLC) techniques used by H.264: the Exp-Golomb code and context adaptive VLC (CAVLC).

Exp-Golomb code is used universally for all symbols except for transform coefficients.

CAVLC is used for coding of transform coefficients.

•No end-of-block, but number of coefficients is decoded.•Coefficients are scanned backward.•Contexts are built dependent on transform coefficients.

Exp-Golomb codes are variable length codes with a regular construction.

First 9 codewords of Exp-Golomb codes

Exp-Golomb code

Code_num Codeword

0 1

1 010

2 011

3 00100

4 00101

5 00110

6 00111

7 0001000

8 0001001

… …

Each codeword of Exp-Golomb codes is constructed as follows:

[M zeros][1][INFO]

where INFO is an M-bit field carrying information.

Therefore, the length of a codeword is 2M+1.

Given a code_num, the corresponding Exp-Golomb codeword can be obtained by the following procedure:

(a) M= log2[code_num+1])(b) INFO=code_num+1-2M

Example:

code_num=6M=log2[6+1])=2INFO=6+1-22=3

The corresponding Exp-Golomb codeword =[M zeros][1][INFO]=00111

Given a Exp-Golomb codeword, its code_num can be found as follows:

(a) Read in M leading zeros followed by 1.(b) Read M-bit INFO field(c) code_num=2M+INFO-1

Example:

Exp-Golomb codeword=00111(a) M=2(b) INFO=3(c) code_num=22+3-1=6

A parameter v to be encoded is mapped to code_num in one of 3 ways:

ue(v) : Unsigned direct mapping, code_num=v. (Mainly used for macroblock type and reference frame index)

se(v): Signed mapping. v is mapped to code_num as follows.

code_num=2|v|, (v0) code_num=2v-1,(v>0)

(Mainly used for motion vector difference and delta QP)

me(v): Mapped symbols. Parameter v is mapped to code_num according to a table specified in the standard.

This mapping is used for coded_block_pattern parameters. An example of such a mapping is shown below.

Coded_block_pattern (Inter prediction) code_num

0 (no non-zero blocks) 0

16 (chroma DC block non-zero) 1

1 (top-left 8x8 luma block non-zero) 2

2 (top-right 8x8 luma block non-zero) 3

CAVLC

This is the method used to encode residual and zig-zag ordered blocks of coefficients of 4×4 DCT.

The CAVLC is designed to take advantage of several characteristics of quantized 4×4 blocks:

• After prediction, transformation and quantization, blocks are usually sparse (containing many zeros).

• The highest non-zero coefficients after the zig/zag ordering are often sequences of +/- 1.

• The number of non-zero coefficients in adjacent blocks is correlated.

• The level (magnitude) of non-zero coefficients tends to be higher at the start of the zig-zag scan, and lower towards the high frequencies.

The procedure described below is based on the document entitled

JVT Document JVT-C028, Gisle Bjøntegaard and Karl Lillevold, “Context-adaptive VLC (CVLC) coding of coefficients,” Fairfax, VA, May 2002.

The H.264 CAVLC is an extension of this work.

The CAVLC encoding of a block of transform coefficients proceeds as follows.

1. Encode the number of coefficients and trailing ones.2. Encode the sign of each trailing ones.3. Encode the levels of the remaining no-zero coefficients.4. Encode the total number of zeros before the last coefficients.5. Encode each run of zeros.

The first step is to encode the number of coefficients (NumCoef) and trailing ones (T1s).

The range of NumCoef is from 0 (no coefficient in the block) to 16 (16 non-zero coefficients).

The range of number of T1s is from 0 (no T1) to 3 (three or more T1s).

If there are more than 3 T1s, only the last three are treated as ``special cases” and the others are coded as normal coefficients.

•Encode the number of coefficients and trailing ones

Example:

Consider the 4×4 block shown below

-2 4 0 -1

3 0 0 0

0 0 1 0

-1 1 0 0

The Num-Coef=7 (i.e., -2, 4, 3, -1, -1, 1, 1), and Number of T1s=3.

Three tables can be used for the encoding of Num_Coeff and T1: Num-VLC0, Num-VLC1 and Num-VLC2.

Num-VLC0

NumCoef\T1s

0 1 2 3

0 1 - - -1 000011 01 - -2 0000111 01 001 -3 00001001 0001001 0001000 000114 00001000 00000110 000000101 0000105 000000111 000001011 000000100 00010116 0000000111 000001010 0000001101 000101017 00000001001 00000000110 0000001100 000101008 00000001000 00000001001 000000001010 0000001119 000000000111 000000001011 000000000101 000000010110 000000000110 0000000001101 0000000001111 0000000100011 0000000000011 0000000001100 0000000001110 00000000010012 0000000000010 00000000000100 00000000000110 000000000010113 0000000000101 00000000000111 000000000010001 0000000000100114 00000000000011 000000000000010 000000000010000 000000000000001115 0000000000000001 00000000000000011 00000000000000010 0000000000000010116 0000000000000000 00000000000000100

10000000000000010001

0000000000000010000

NumCoeff\T1s

0 1 2 3

0 11 - - -

1 000011 011 - -

2 000010 00011 010 -

3 001001 001000 001010 101

4 1000001 001011 100101 0011

5 00000111 1000000 1000010 00010

6 00000110 1000011 1001101 10001

7 000001001 10011101 10011100 100100

8 000001000 000001011 000000101 1001100

9 0000000111 000001010 000000100 10011111

10 0000000110 0000001101 0000001100 10011110

11 00000000101 00000000111 00000001001 000000111

12 00000000100 00000000110 00000001000 0000000101

13 000000000011 000000000010 000000000100 000000000111

14 0000000000011 000000000101 0000000000010 0000000001101

15 00000000000001 00000000000000 000000000000111 0000000001100

16 000000000000101 000000000000100 0000000000001101 0000000000001100

Num-VLC1

NumCoeff\T1s 0 1 2 30 0011 - - -1 0000011 0010 - -2 0000010 101110 1101 -3 000011 101001 010110 1100 4 000010 101000 010001 1111 5 101101 101011 010000 1110 6 101100 101010 010011 1001 7 101111 010101 010010 1000 8 0110101 010100 011101 00011 9 0110100 010111 011100 00010 10 0110111 0110110 0110000 011111 11 01111001 0110001 01111010 0110011 12 01111000 01111011 01100101 01100100 13 000000011 000000010 000000100 000000111 14 0000000011 000000101 0000001101 0000001100 15 0000000010 00000000011 00000000010 00000000001 16 0000000000001 000000000001 00000000000001 00000000000000

Num-VLC2

The selection of tables (Num-VLC0, Num-VLC1 and Num-VLC2) depends on the number of non-zero coefficients in upper and left-hand previously coded blocks NU and NL. A parameter N is then computed as follows:

If blocks U and L are available (i.e., in the same coded slice), N=(NU+NL)/2If only block U is available, N=NU.If only block L is available, N= NL.If neither is available, N=0.

The selection of table is based on N in the following way:

N Selected Table

0,1 Num-VLC0

2,3 Num-VLC1

4,5,6,7 Num-VLC2

8 or above FLC

The FLC is of the following form:

xxxxyy (i.e., 6 bits)

where xxxx and yy represent Num_Coeff and T1,respectively.

For each T1, a single bit encodes the sign (0=+,1=-).

The T1s are encoded backward, beginning with the highest frequency (i.e., last) T1.

•Encode the sign of each trailing ones

The level (sign and magnitude) of each remaining non-zero coefficient in the block is encoded in reverse order.

There are 5 VLC tables to choose from, Lev_VLC0 to Lev_VLC4.

Lev_VLC0 is biased towards lower magnitudes;Lev_VLC1 is biased towards slightly highermagnitudes, and so on.

•Encode the levels of the remaining no-zero coefficients

Lev-VLC010 10 0 1. .0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x x x x x x x x x

Lev-VLC11 x0 1 x0 0 1 x. .0 0 0 0 0 0 0 0 0 0 0 0 0 1 x0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x x x x x x x x x

Lev-VLC21 x x0 1 x x0 0 1 x x. .0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x x x x x x x x x

Lev-VLC41 x x x x0 1 x x x x0 0 1 x x x x. .0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x x x x x x x x x

Lev-VLC31 x x x0 1 x x x0 0 1 x x x. .0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x x x x x x x x x

This is used only when it isimpossible for a coefficient to havevalues +/- 1. It will happen when T1s<3.

Lev-VLC1

Code no. CodeLevel

(+/- 1, +/- 2,…)

Level’

(+/- 2, +/- 3,…)0 10 1 21 11 -1 -22 010 2 33 011 -2 -34 0010 3 45 0011 -3 -4.. .. .. ..

000000000000010 14 15000000000000011 -14 -15000000000000001xxxx +/- 15 to +/- 22 +/- 16 to +/- 23000000000000001xxxxxxxxxxxx +/- 23 -> +/- 24 ->

Lev-VLC0

Code no. CodeLevel

(+/- 1, +/- 2,…)

Level’

(+/- 2, +/- 3,…)0 1 1 21 01 -1 -22 001 2 33 0001 -2 -34 00001 3 4.. .. .. ..13 00000000000001 -7 -814 000000000000001xxxx +/- 8 to +/- 15 +/- 9 to +/- 1615 000000000000001xxxxxxxxxxxx +/- 16 -> +/- 17 ->

Lev-VLC2

Code no. CodeLevel

(+/- 1, +/- 2,…)0 100 11 101 -12 110 23 111 -24 0100 35 0101 -36 0110 47 0111 -48 00100 5.. .. ..

0000000000000110 280000000000000111 -28000000000000001xxxx +/- 29 to +/- 36000000000000001xxxxxxxxxxxx +/- 37 ->

Lev-VLC3

Code no. CodeLevel

(+/- 1, +/- 2,…)

0 1000 11 1001 -12 1010 23 1011 -24 1100 35 1101 -36 1110 47 1111 -48 01000 5.. .. ..

00000000000001110 5600000000000001111 -56000000000000001xxxx +/- 57 to +/- 64000000000000001xxxxxxxxxxxx +/- 65 ->

Lev-VLC4

Code no. CodeLevel

(+/- 1, +/- 2,…)

0 10000 11 10001 -12 10010 23 10011 -2.. .. ..

11110 811111 -8010000 9

.. .. ..0000000000000011110 1200000000000000011111 -1200000000000000001xxxxxxxxxxxx +/- 121 ->

To improve coding efficiency, the tables are changed along with the coding process based on the following procedure.

Inter and intra with QP >= 9

First coefficient with VLC0. Next VLC1.

Increase VLC by one (up to 2) if |Level| > 3

Intra with QP < 9

If (number of coefficients > 10)First coefficient with VLC1. Next VLC2.

elseFirst coefficient with VLC0. Next VLC1.

If (VLC = VLC1) change VLC2 if |Level| > 3.If (VLC >= VLC2) increase VLC by one (up to 4) if |Level| > 5

The following shows the table for encoding the total number of zeros before the last coefficient (TotZeros)

•Encode the total number of zeros before the last coefficient

NumCoeff

TotZeros1 2 3 4 5 6 7

0 1 111 0010 111101 01000 101100 1110001 011 101 1101 1110 01010 101101 1110012 010 011 000 0110 01011 1010 111013 0011 001 010 1010 1110 001 10014 0010 000 1011 000 011 010 11115 00011 1000 1111 100 100 000 006 00010 0101 011 110 1111 110 017 000011 1001 100 1011 110 111 1018 000010 1100 0011 010 101 100 1109 0000011 01000 1110 001 001 011 10010 0000010 11011 1010 0111 000 10111 -11 00000001 11010 11000 1111 01001 - -12 00000000 010010 110011 111100 - - -13 00000011 0100111 110010 - - - -14 000000101 0100110 - - - - -15 000000100 - - - - - -

NumCoeff

TotZeros8 9 10 11 12 13 14 15

0 101000 111000 10000 11000 1000 100 00 01 101001 111001 10001 11001 1001 101 01 12 10101 11101 1001 1101 101 11 1 -3 1011 1111 101 111 0 0 - -4 110 00 01 0 11 - - -5 00 01 11 10 - - - -6 111 10 00 - - - - -7 01 110 - - - - - -8 100 - - - - - - -9 - - - - - - - -10 - - - - - - - -11 - - - - - - - -

12 - - - - - - - -13 - - - - - - - -14 - - - - - - - -15 - - - - - - - -

The decoder can use Totzeros and NumCoeff to determine the position of the last non-zero coefficient.

•Encode each run of zeros

After TotZeros is known, we are now ready to encode the number of preceding zeros before each non-zero coefficient (called RunBefore).

Let ZerosLeft indicate how many zeros are left to distribute during this encoding process. The ZeroLeftis used for encoding the RunBefore in CAVLC.

When encoding the RunBefore for the first non-zero Coefficient, the ZerosLeft begins at TotZeros. TheZerosLeft then decreases as RunBefore of more non-zero coefficients are encoded. The encoding of each RunBefore is dependent on thecurrent ZeroLeft value as shown in the following table.

Zero-Left

Run Before1 2 3 4 5 6 >6

0 1 1 01 01 01 01 000

1 0 01 00 00 00 00 010

2 - 00 11 11 11 101 101

3 - - 10 101 101 100 100

4 - - - 100 1001 111 111

5 - - - - 1000 1101 110

6 - - - - - 1100 0011

7 - - - - - - 0010

8 - - - - - - 00011

9 - - - - - - 00010

10 - - - - - - 00001

11 - - - - - - 0000011

12 - - - - - - 0000010

13 - - - - - - 0000001

14 - - - - - - 00000001

Why maximum number is 14?

Example:

0 3 -1 0

0 -1 1 0

1 0 0 0

0 0 0 0

Consider the following interframe residual 4×4 block

The zigzag re-ordering of the block is shown below:0,3,0,1,-1,-1,0,1,0,0,0,0,0,0,0,0

Therefore,NumCoeff=5, TotZero=3, T1s=3

Assume N=0.

Encoding:

Value Code Comments

NumCoeff=5, T1s=3 0001011 N=0 Use Num-VLC0

sign of T1 (1) 0 Starting at highest frequency

sign of T1(-1) 1

sign of T1(-1) 1

Level= +1 1 Inter frame Use Lev-VLC0

Level= +3 0010 Use Lev-VLC1

TotZeros=3 1110 Also depends on NumCoeff

ZerosLeft=3;RunBefore=1 00 RunBefore of the 1st Coeff

ZerosLeft=2;RunBefore=0 1 RunBefore of the 2nd Coeff

ZerosLeft=2;RunBefore=0 1 RunBefore of the 3rd Coeff

ZerosLeft=2;RunBefore=1 01 RunBefore of the 4th Coeff

ZerosLeft=1;RunBefore=1 No code required; last coeff

The transmitted bitstream for this block is 0001011011100101110001101

Decoding:

Code Value Output Array Comments

0001011 NumCoeff=5, T1s=3

Empty

0 + 1 T1 sign

1 - -1,1 T1 sign

1 - -1,-1,1 T1 sign

1 +1 1,-1,-1,1 level value

0010 +3 +3,1,-1,-1,1 level value

1110 TotZeros=3 +3,1,-1,-1,1

00 RunBefore=1 +3,1,-1,-1,0,1 RunBefore of the 1st Coeff

1 RunBefore=0 +3,1,-1,-1,0,1 RunBefore of the 2nd Coeff

1 RunBefore=0 +3,1,-1,-1,0,1 RunBefore of the 3rd Coeff

01 RunBefore=1 +3,0,1,-1,-1,0,1 RunBefore of the 4th Coeff

0,+3,0,1,-1,-1,0,1 ZeroLeft=1

De-block Filter

Scaling & Inv. Transform

Motion-Compensation

Control Data

Quant.Transf. coeffs

Motion Data

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

Intra-frame Prediction

De-blockingFilter

Decoded Video

EntropyCoding

Compressed

Video

bits

InputVideoSignal

Split intoMacroblocks16x16 pixels

The deblocking filter improves subjective visual quality. The filter is highly context adaptive. It operates on the boundary of 4×4 blocks as shown below.

q3

q2

q1

q0

p0

p1

p2

p3

q3 q2 q1 q0 p0 p1 p2 p3

The choice of filtering outcome depends on the boundary strength and on the gradient of image samples across the boundary.

Given two adjacent blocks p and q, the boundary strength parameter Bs is selected according to the following rules.

1) p or q is intra coded,

2) Boundary is macroblock boundary.

Bs = 4

(Strongest filtering)

1) p or q is intra coded,

2) Boundary is not macroblock boundary.Bs = 3

1) Neither p or q is intra coded,

2) p or q contain coded coefficients.Bs = 2

1) Neither p or q is intra coded,

2) Neither p or q contain coded coefficients,

3) p and q have different reference frames or a different number of reference frames or different motion vector values.

Bs = 1

1. Neither p or q is intra coded,

2. neither p or q contain coded coefficients

3. p and q have same reference frames and identical motion vectors.

Bs = 0

(No filtering)

A group of samples from the set (p2,p1,p0,q0,q1,q2) is filtered only if:

(a) Bs>0 and (b) |p0-q0| < and |p1-p0| < and |q1-q0| <

where and are thresholds defined in the standard.

The threshold values increase with the average quantizer parameter QP of two blocks q and p.

When QP is small, gradient across the boundary is likely to be due to image features that should be preserved. Therefore, the thresholds and are low for small QP.

When QP is larger, blocking distortion is likely to be more significant and and are higher so that more boundary samples are filtered.

without deblock filtering with deblock filtering

Data Partitioning andNetwork Abstraction Layer

Video Coding Layer

Data Partitioning Con

trol

Dat

a

Network Abstraction Layer

H.320 H.324MPEG2 H.323/IP Etc.

Coded Macroblock

Coded Slice/Partition

A video frame is coded as one or more slices.

Each slice contains an integral number of macroblocks from 1 to total number of macroblocks in a picture.

The number of macroblocks per slice need not to be constant within a picture.

There are five slice modes. Three commonly use modes are:

1. I-slice: A slice where all macroblocks of the slice are coded using intra prediction.

2. P-slice: In addition to the coding types of the I-slice, some macroblocks of the P-slice can be coded using inter-prediction (predicted from one reference picture buffer only).

3. B-slice: In addition to the coding types available in a P-slice, some macroblocks of the B-slice can be predicted from two reference picture buffers.

In H.264, all the slices in the same frame are not necessarily of the same mode.

That is, a frame may contain I-slices, P-slices and B-slices.

In addition to I-, P-, and B- slices, the other two slice modes defined by H. 264 are SP- slice and SI-slice.

Both the SP- and SI- slices are used for video streaming applications.

The SP- slice may be beneficial for switching bitstreams from the same video sequence (but with different bitrates).

The SI- slice may be adopted for switching the bitstreams from different video sequences.

Bitstram 1

Bitstram 2

P1, n-2 P1, n-1 P1, n P1, n+1 P1, n+2

P2, n-2 P2, n-1 P2, n P2, n+1 P2, n+2

SP12, n

I2, n+3

I1, n+3

SP- Frame

Bitstram 1

Bitstram 2

P1, n-2 P1, n-1 P1, n P1, n+1 P1, n+2

P2, n-2 P2, n-1 P2, n-2 P2, n+1 P2, n+2

SI2, n

I2, n+3

I1, n+3

SI- Frame

Note that the coded data in a slice can be placed in three separate Data Partitions (A, B and C) for robust transmission.

Partition A contains the slice header and header data for each marcoblock in the slice.

Partition B contains coded residual data for Intra slice macroblocks.

Partition C contains coded residual data for Inter slice macroblocks.

In the H.264, the VCL data will be mapped into NAL units prior to transmission or storage.

Each NAL unit contains a Raw Byte Sequence Payload (RBSP), a set of data corresponding to coded video data or header information.

The NAL units can be delivered over a packet-based network or a bitstream transmission link or stored in a file.

NAL

header

RBSP NAL

header

RBSP NAL

header

RBSP

sequence of NAL units

RBSP type Description

Parameter Set Global parameter for a sequence such as picture dimensions, video format.

Supplemental

Enhancement Information

Side messages that are not essential for correct decoding of the video sequences.

Picture

Delimiter

Boundary between pictures (optional). If not present, the decoder infers the boundary based on the frame number contained within each slice header.

Coded

Slice

Header and data for a slice; this RBSP contains actual coded video data.

Data Partition

A, B or C

Three units containing Data Partitioned slice layer data (useful for error decoding).

End of Sequence

End of

Stream

Filler Data Contains ‘dummy’ data

Example:

Sequence

parameter set

SEI Picture parameter set

I Slice

(Coded slice)

Picture delimiter

P Slice (Coded slice)

P Slice

(Coded slice)

The following figure shows an example of RBSP elements.

...

Profiles

Baseline:– For lower-cost applications with limited computing

resources. This profile is used widely in mobile applications.

Main– For broadcast and storage applications. This profile

may be replaced by High profile for those applications.

Extended– For the video streaming applications,

This profile has high compression ratio and error resilience capabilities.

High– For digital video broadcast and disc

storage applications, particularly for high-definition television applications.

Extended Profile

Baseline Profile

Slice Group and ASO

RedundantSlices

I, P Slice

CAVLC

SP, SI Slices

Data Partitioning

Weighted Prediction

B Slice

Main Profile

Interlace

CABAC

High Profile

8X8 Integer DCT

422, 444 Support