implementation and parallelization of h.264 based system on multi-dsps board

34
NCTU, EE, Vision Lab Implementation and Parallelization of H.264 Based System on Multi-DSPs Board 陳陳陳 2008.06.11 1

Upload: eryk

Post on 14-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Implementation and Parallelization of H.264 Based System on Multi-DSPs Board. 陳奕安 2008.06.11. Outline. System Architecture Multithreading of this system Reference framework 5 Parallelism of H.264 Memory issue. System Architecture. MEX Board 1. PC 1. Capture Frame. H.264 Encode. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Implementation and Parallelization of H.264 Based System on Multi-DSPs Board

陳奕安 2008.06.11

1

Page 2: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Outline

System Architecture Multithreading of this system Reference framework 5 Parallelism of H.264 Memory issue

2

Page 3: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

System Architecture

PC 2 MEX Board 2

MEX Board 1

CaptureFrame

H.264 Encode

Send to Network

DisplayH.264 Decode

Receive from Network

PC 1

PC 2

3

Page 4: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

System Architecture

Input taskH.264 Encode Processing task

TX networking task

RX networking taskH.264 Decode processing task

Output task

4

Camera

Computer

Page 5: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

PC

MEX

Host/ MEX Communication

DSP started : fill memory

Initializetransfer

DSP to PCItransferrequest

Start Transfer

Transferfinished

Set DSP FIFO DirectionSet FIFO Full Flag valueDSP FIFO is reset

Start EDMAUnreset DSP1 FIFOClear PCI Interrupt

PCI started :wait for interrupt

Initializetransfer

PCI to DSPstart transferrequest

Wait fortransferfinished

Transferfinished

Set transfer sizeSet PCI FIFO directionSelect DSP data sourcesSet transfer destination addressStart PCI FIFOClear DSP Interrupt

5Data transfer from the 4 DSP (SDRAM) to PCI [7]

Page 6: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Host/ MEX Communication

6

Data

Image

Page 7: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

System Architecture

Input taskH.264 Encode Processing task

TX networking task

RX networking taskH.264 Decode processing task

Output task

7

Camera

Computer

Page 8: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Networking of H.264 Video

Application

Video Coding Layer

Network Abstraction Layer

Bitstream Adoption

Packet Adoption

Reconstructed picture

VCL DataParameter Sets

NAL-unit

H.320System

MPEG-2 System

AVC Storage

RTP Payload

Supplemental Enhancement Information

AVC / H.264

Transport

H.264 VCL and NAL[6]

H.264 High Level Architecture

Page 9: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Transport layer

Session layer

Networking of H.264 Video

MACheader

IPheader

UDPheader

RDPheader

VideoPacket

IPheader

UDPheader

RTPheader

VideoPacket

UDPheader

RTPheader

VideoPacket

RTPheader

VideoPacket

VideoPacket

Application layer

Network layer

Data link layer

Physical layer

NAL-Unit of H .264

TMS320C600 Network Developer’s Kit

Video Packetization

Page 10: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

System Architecture

Input taskH.264 Encode Processing task

TX networking task

RX networking taskH.264 Decode processing task

Output task

10

Camera

Computer

Page 11: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Input buffers

Output buffers

I/O buffer management

11

Inputing Head

Inputing

Inputing

Tail

Head

Inputing

Tail

Head

Outputing

Tail

Head

Tail

Head Tail

Outputing

Outputing

Page 12: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Input / output buffers

I/O buffer management

12

Tail

Head

Inputing

Tail

Head

Outputing

Inputing

Tail

Head

Outputing

Head

Inputing

Tail

Head

Outputing

Tail

Head

Inputing

Tail

Head

Outputing

Tail

Head

Inputing

Tail

Head

Outputing

Tail

Head

Page 13: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

System Architecture

Multithreading of this system

Input taskH.264 Encode Processing task

TX networking task

RX networking taskH.264 Decode processing task

Output task

13

Camera

Computer

Page 14: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Reference framework for DSP Reference framework 5

DSP/BIOS,TMS320 DSP Algorithm Standard

Processing flow of RF5

14

Split Joint

F0

F1

F2

V0

V1

V2

14

cell

channel

task

Fi, Vi XDAIS algorithm

Page 15: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Reference framework for DSP Data communication of RF5

SIO : Task & Device

SCOM : Task & Task

15

device driver

task

SIO object

data buffer

data pointer

writer task

reader task

task

SCOM message

data buffer

data pointerSCOM queue

Page 16: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Data communication of RF5ICC : Cell& Cell

Reference framework for DSP

16

1 2

in out in out

3

in out

data buffer

data pointercell

ICC object describing a buffer

element in an a list of pointersto ICC objects

Page 17: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Application Control of RF5Task Receiving both SCOM messages and

control messages

Reference framework for DSP

17

task

SCOM queue for data messages

SCOM message

MBX mailbox for control messages

Page 18: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

The present system

System Architecture

Input task H.264 Encode Processing task

TX networking task

18

Frame i

Frame i+1Slice NAL

Control task

Rx

Page 19: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Multithreading of this system

System Architecture

Input task H.264 Encode Processing task

TX networking task

19

Frame i

Frame i+1

MB

MB

MB

NAL

Control task

Rx

Page 20: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Parallelizing H.264

Task-level DecompositionDivide the algorithm into balance tasksAccelerate each task

Data-level DecompositionGOP-level ParallelismFrame-level ParallelismSlice-level ParallelismMacroblock-level Parallelism

20

Page 21: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

H.264 Encoder Block Diagram

21

Fn

(Current)T Q Reorder

Entropyencode

ME

F’n-1

(reference)MC

Choose Intra

prediction

Intraprediction

F’n(reconstructe

d)T -1 Q-1Filter

+

-

Dn

P

Inter

Intra

+

-

D’n

uF’n

X

NAL

Page 22: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

H.264 Decoder Block Diagram

22

ReorderEntropydecode

F’n-1

(reference)MC

Intraprediction

F’n(reconstructe

d)T -1 Q-1Filter

P

Inter

Intra

+D’n

uF’n

-

NAL

Page 23: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Task-level Decomposition Task profile for H.264

23

[2]

Page 24: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

H.264 data structure

Parallelizing H.264

GOP0 GOP1 GOP2 … GOPn

F0 F1 F2 Fn….

Slice 0Slice 1Slice 2

….Slice 3

Video Sequence

Group of picture

MB0 MB1

Frame

Slice

MB2 … MBn

Y

Cb

Cr

Macroblock 24

Page 25: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Data-level Decomposition GOP-level Parallelism

High latency, large memory Frame-level Parallelism

I, P, B frame imbalance Slice-level Parallelism

Bitrates increase Macroblock-level Parallelism

25

Page 26: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Macroblock-level Parallelism

Spatial parallelism Temporal parallelism Spatial & temporal parallelism Possible data dependencies for macroblock

26

26

Intra Pred.

MV Pred.

Intra Pred.MV Pred.DeblockingFitler

Intra Pred.

MV Pred.

Intra Pred.MV Pred.DeblockingFitler

Current MB

frame i + 1frame i

search window

Page 27: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Macroblock-level Parallelism

Spatial parallelism

27

MB(0,0)

T1

MB(1,0)

T2

MB(2,0)

T3

MB(3,0)

T4

MB(4,0)

T5

MB(0,1)

T3

MB(1,1)

T4

MB(2,1)

T5

MB(3,1)

T6

MB(4,1)

T7

MB(0,2)

T5

MB(1,2)

T6

MB(2,2)

T7

MB(3,2)

T8

MB(4,2)

T9

MB(0,3)

T7

MB(1,3)

T8

MB(2,3)

T9

MB(3,3)

T10

MB(4,3)

T11

MB(0,4)

T9

MB(1,4)

T10

MB(2,4)

T11

MB(3,4)

T12

MB(4,4)

T13

MBs processed

MBs processing

MBs to be process

Page 28: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Macroblock-level Parallelism Temporal parallelism

28

MB(0,0)

T1

MB(1,0)

T2

MB(2,0)

T3

MB(3,0)

T4

MB(4,0)

T5

MB(0,1)

T6

MB(1,1)

T7

MB(2,1)

T8

MB(3,1)

T9

MB(4,1)

T10

MB(0,2)

T11

MB(1,2)

T12

MB(2,2)

T13

MB(3,2)

T14

MB(4,2)

T15

MB(0,3)

T16

MB(1,3)

T17

MB(2,3)

T18

MB(3,3)

T19

MB(4,3)

T20

MB(0,4)

T21

MB(1,4)

T22

MB(2,4)

T23

MB(3,4)

T24

MB(4,4)

T25

MB(0,0)

T1

MB(1,0)

T2

MB(2,0)

T13

MB(3,0)

T14

MB(4,0)

T15

MB(0,1)

T16

MB(1,1)

T17

MB(2,1)

T18

MB(3,1)

T19

MB(4,1)

T20

MB(0,2)

T21

MB(1,2)

T22

MB(2,2)

T23

MB(3,2)

T24

MB(4,2)

T25

MB(0,3)

T26

MB(1,3)

T27

MB(2,3)

T28

MB(3,3)

T29

MB(4,3)

T30

MB(0,4)

T31

MB(1,4)

T32

MB(2,4)

T33

MB(3,4)

T34

MB(4,4)

T35

frame i + 1frame i

MBs processed MBs processing MBs to be process

Page 29: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Macroblock-level Parallelism Spatial & temporal parallelism

29

MB(0,0)

T5

MB(1,0)

T6

MB(2,0)

T7

MB(3,0)

T8

MB(4,0)

T9

MB(0,1)

T7

MB(1,1)

T8

MB(2,1)

T9

MB(3,1)

T10

MB(4,1)

T11

MB(0,2)

T9

MB(1,2)

T10

MB(2,2)

T11

MB(3,2)

T12

MB(4,2)

T13

MB(0,3)

T11

MB(1,3)

T12

MB(2,3)

T13

MB(3,3)

T14

MB(4,3)

T15

MB(0,4)

T13

MB(1,4)

T14

MB(2,4)

T15

MB(3,4)

T16

MB(4,4)

T17

MB(0,0)

T1

MB(1,0)

T2

MB(2,0)

T3

MB(3,0)

T4

MB(4,0)

T5

MB(0,1)

T3

MB(1,1)

T4

MB(2,1)

T5

MB(3,1)

T6

MB(4,1)

T7

MB(0,2)

T5

MB(1,2)

T6

MB(2,2)

T7

MB(3,2)

T8

MB(4,2)

T9

MB(0,3)

T7

MB(1,3)

T8

MB(2,3)

T9

MB(3,3)

T10

MB(4,3)

T11

MB(0,4)

T9

MB(1,4)

T10

MB(2,4)

T11

MB(3,4)

T12

MB(4,4)

T13

frame i + 1frame i

Page 30: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Multithreading of this system

System Architecture

Input task H.264 Encode Processing task

TX networking task

30

Frame i

Frame i+1

MB

MB

MB

NAL

Control task

Rx

Page 31: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Memory Issue

31

L1P Cache Direct Mapped 16Kbytes Total

DM642 DSP Core

L1D Cache 2-way Set Associated 16Kbytes Total

L2 Cache/ M

emory

256Kbytes Total

Two-level cache architecture of DM642

ED

MA

Controller

peripherals Limited memory of DM642 Use memory buffer to reduce memory access

Page 32: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Memory Issue

Memory hierarchy for inter prediction

32

Memory hierarchy [4]

Page 33: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Memory Issue

Slice memory buffer for intra prediction and deblocking filter

Slice Memory [5]

33

Page 34: Implementation and  Parallelization  of H.264 Based System on Multi-DSPs Board

NCTU, EE, Vision Lab

Reference [1] Texas Instruments, Incorporated “Reference Frameworks for eXpressDSP

Software: RF5, An Extensive, High-Density System.” (spru795a) [2] TC Chen, HC Fang, CJ Lian, CH Tsai “Algorithm analysis and architecture design

for HDTV applications - a look at the H.264/AVC video compressor system “IEEE CIRCUITS & DEVICES MAGAZINE MAY/JUNE 2006

[3] Cor Meenderinck, Arnaldo Azevedo and Ben Juurlink “Parallel Scalability of Video Decoders” April 29, 2008.

[4] Denolf, K. De Vleeschouwer, et al,, “Memory centric design of an MPEG-4 video encoder” , IEEE Trans. CSVT, Vol. 15, No. 5, pp. 609-619, May 2005.

[5] Tsu-Ming Liu et al., “A 125μW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications,” ISSCC Digest of Technical Papers, pp. 402-403, Feb. 2006.

[6] T. Wiegand et al., “Overview of H.264/AVC Video Coding Standard”, IEEE Trans. on Circ. and Sys. For Video Technology, Vol. 13, No. 7, pp. 560–576, July 2003.1

[7] VITEC MULTIMEDIA, “MEX User manual Revision 1.7”.

34