implementation and parallelization of h.264 based system on multi-dsps board
DESCRIPTION
Implementation and Parallelization of H.264 Based System on Multi-DSPs Board. 陳奕安 2008.06.11. Outline. System Architecture Multithreading of this system Reference framework 5 Parallelism of H.264 Memory issue. System Architecture. MEX Board 1. PC 1. Capture Frame. H.264 Encode. - PowerPoint PPT PresentationTRANSCRIPT
NCTU, EE, Vision Lab
Implementation and Parallelization of H.264 Based System on Multi-DSPs Board
陳奕安 2008.06.11
1
NCTU, EE, Vision Lab
Outline
System Architecture Multithreading of this system Reference framework 5 Parallelism of H.264 Memory issue
2
NCTU, EE, Vision Lab
System Architecture
PC 2 MEX Board 2
MEX Board 1
CaptureFrame
H.264 Encode
Send to Network
DisplayH.264 Decode
Receive from Network
PC 1
PC 2
3
NCTU, EE, Vision Lab
System Architecture
Input taskH.264 Encode Processing task
TX networking task
RX networking taskH.264 Decode processing task
Output task
4
Camera
Computer
NCTU, EE, Vision Lab
PC
MEX
Host/ MEX Communication
DSP started : fill memory
Initializetransfer
DSP to PCItransferrequest
Start Transfer
Transferfinished
Set DSP FIFO DirectionSet FIFO Full Flag valueDSP FIFO is reset
Start EDMAUnreset DSP1 FIFOClear PCI Interrupt
PCI started :wait for interrupt
Initializetransfer
PCI to DSPstart transferrequest
Wait fortransferfinished
Transferfinished
Set transfer sizeSet PCI FIFO directionSelect DSP data sourcesSet transfer destination addressStart PCI FIFOClear DSP Interrupt
5Data transfer from the 4 DSP (SDRAM) to PCI [7]
NCTU, EE, Vision Lab
Host/ MEX Communication
6
Data
Image
NCTU, EE, Vision Lab
System Architecture
Input taskH.264 Encode Processing task
TX networking task
RX networking taskH.264 Decode processing task
Output task
7
Camera
Computer
NCTU, EE, Vision Lab
Networking of H.264 Video
Application
Video Coding Layer
Network Abstraction Layer
Bitstream Adoption
Packet Adoption
Reconstructed picture
VCL DataParameter Sets
NAL-unit
H.320System
MPEG-2 System
AVC Storage
RTP Payload
Supplemental Enhancement Information
AVC / H.264
Transport
H.264 VCL and NAL[6]
H.264 High Level Architecture
NCTU, EE, Vision Lab
Transport layer
Session layer
Networking of H.264 Video
MACheader
IPheader
UDPheader
RDPheader
VideoPacket
IPheader
UDPheader
RTPheader
VideoPacket
UDPheader
RTPheader
VideoPacket
RTPheader
VideoPacket
VideoPacket
Application layer
Network layer
Data link layer
Physical layer
NAL-Unit of H .264
TMS320C600 Network Developer’s Kit
Video Packetization
NCTU, EE, Vision Lab
System Architecture
Input taskH.264 Encode Processing task
TX networking task
RX networking taskH.264 Decode processing task
Output task
10
Camera
Computer
NCTU, EE, Vision Lab
Input buffers
Output buffers
I/O buffer management
11
Inputing Head
Inputing
Inputing
Tail
Head
Inputing
Tail
Head
Outputing
Tail
Head
Tail
Head Tail
Outputing
Outputing
NCTU, EE, Vision Lab
Input / output buffers
I/O buffer management
12
Tail
Head
Inputing
Tail
Head
Outputing
Inputing
Tail
Head
Outputing
Head
Inputing
Tail
Head
Outputing
Tail
Head
Inputing
Tail
Head
Outputing
Tail
Head
Inputing
Tail
Head
Outputing
Tail
Head
NCTU, EE, Vision Lab
System Architecture
Multithreading of this system
Input taskH.264 Encode Processing task
TX networking task
RX networking taskH.264 Decode processing task
Output task
13
Camera
Computer
NCTU, EE, Vision Lab
Reference framework for DSP Reference framework 5
DSP/BIOS,TMS320 DSP Algorithm Standard
Processing flow of RF5
14
Split Joint
F0
F1
F2
V0
V1
V2
14
cell
channel
task
Fi, Vi XDAIS algorithm
NCTU, EE, Vision Lab
Reference framework for DSP Data communication of RF5
SIO : Task & Device
SCOM : Task & Task
15
device driver
task
SIO object
data buffer
data pointer
writer task
reader task
task
SCOM message
data buffer
data pointerSCOM queue
NCTU, EE, Vision Lab
Data communication of RF5ICC : Cell& Cell
Reference framework for DSP
16
1 2
in out in out
3
in out
data buffer
data pointercell
ICC object describing a buffer
element in an a list of pointersto ICC objects
NCTU, EE, Vision Lab
Application Control of RF5Task Receiving both SCOM messages and
control messages
Reference framework for DSP
17
task
SCOM queue for data messages
SCOM message
MBX mailbox for control messages
NCTU, EE, Vision Lab
The present system
System Architecture
Input task H.264 Encode Processing task
TX networking task
18
Frame i
Frame i+1Slice NAL
Control task
Rx
NCTU, EE, Vision Lab
Multithreading of this system
System Architecture
Input task H.264 Encode Processing task
TX networking task
19
Frame i
Frame i+1
MB
MB
MB
NAL
Control task
Rx
NCTU, EE, Vision Lab
Parallelizing H.264
Task-level DecompositionDivide the algorithm into balance tasksAccelerate each task
Data-level DecompositionGOP-level ParallelismFrame-level ParallelismSlice-level ParallelismMacroblock-level Parallelism
20
NCTU, EE, Vision Lab
H.264 Encoder Block Diagram
21
Fn
(Current)T Q Reorder
Entropyencode
ME
F’n-1
(reference)MC
Choose Intra
prediction
Intraprediction
F’n(reconstructe
d)T -1 Q-1Filter
+
-
Dn
P
Inter
Intra
+
-
D’n
uF’n
X
NAL
NCTU, EE, Vision Lab
H.264 Decoder Block Diagram
22
ReorderEntropydecode
F’n-1
(reference)MC
Intraprediction
F’n(reconstructe
d)T -1 Q-1Filter
P
Inter
Intra
+D’n
uF’n
-
NAL
NCTU, EE, Vision Lab
Task-level Decomposition Task profile for H.264
23
[2]
NCTU, EE, Vision Lab
H.264 data structure
Parallelizing H.264
GOP0 GOP1 GOP2 … GOPn
F0 F1 F2 Fn….
Slice 0Slice 1Slice 2
….Slice 3
Video Sequence
Group of picture
MB0 MB1
Frame
Slice
MB2 … MBn
Y
Cb
Cr
Macroblock 24
NCTU, EE, Vision Lab
Data-level Decomposition GOP-level Parallelism
High latency, large memory Frame-level Parallelism
I, P, B frame imbalance Slice-level Parallelism
Bitrates increase Macroblock-level Parallelism
25
NCTU, EE, Vision Lab
Macroblock-level Parallelism
Spatial parallelism Temporal parallelism Spatial & temporal parallelism Possible data dependencies for macroblock
26
26
Intra Pred.
MV Pred.
Intra Pred.MV Pred.DeblockingFitler
Intra Pred.
MV Pred.
Intra Pred.MV Pred.DeblockingFitler
Current MB
frame i + 1frame i
search window
NCTU, EE, Vision Lab
Macroblock-level Parallelism
Spatial parallelism
27
MB(0,0)
T1
MB(1,0)
T2
MB(2,0)
T3
MB(3,0)
T4
MB(4,0)
T5
MB(0,1)
T3
MB(1,1)
T4
MB(2,1)
T5
MB(3,1)
T6
MB(4,1)
T7
MB(0,2)
T5
MB(1,2)
T6
MB(2,2)
T7
MB(3,2)
T8
MB(4,2)
T9
MB(0,3)
T7
MB(1,3)
T8
MB(2,3)
T9
MB(3,3)
T10
MB(4,3)
T11
MB(0,4)
T9
MB(1,4)
T10
MB(2,4)
T11
MB(3,4)
T12
MB(4,4)
T13
MBs processed
MBs processing
MBs to be process
NCTU, EE, Vision Lab
Macroblock-level Parallelism Temporal parallelism
28
MB(0,0)
T1
MB(1,0)
T2
MB(2,0)
T3
MB(3,0)
T4
MB(4,0)
T5
MB(0,1)
T6
MB(1,1)
T7
MB(2,1)
T8
MB(3,1)
T9
MB(4,1)
T10
MB(0,2)
T11
MB(1,2)
T12
MB(2,2)
T13
MB(3,2)
T14
MB(4,2)
T15
MB(0,3)
T16
MB(1,3)
T17
MB(2,3)
T18
MB(3,3)
T19
MB(4,3)
T20
MB(0,4)
T21
MB(1,4)
T22
MB(2,4)
T23
MB(3,4)
T24
MB(4,4)
T25
MB(0,0)
T1
MB(1,0)
T2
MB(2,0)
T13
MB(3,0)
T14
MB(4,0)
T15
MB(0,1)
T16
MB(1,1)
T17
MB(2,1)
T18
MB(3,1)
T19
MB(4,1)
T20
MB(0,2)
T21
MB(1,2)
T22
MB(2,2)
T23
MB(3,2)
T24
MB(4,2)
T25
MB(0,3)
T26
MB(1,3)
T27
MB(2,3)
T28
MB(3,3)
T29
MB(4,3)
T30
MB(0,4)
T31
MB(1,4)
T32
MB(2,4)
T33
MB(3,4)
T34
MB(4,4)
T35
frame i + 1frame i
MBs processed MBs processing MBs to be process
NCTU, EE, Vision Lab
Macroblock-level Parallelism Spatial & temporal parallelism
29
MB(0,0)
T5
MB(1,0)
T6
MB(2,0)
T7
MB(3,0)
T8
MB(4,0)
T9
MB(0,1)
T7
MB(1,1)
T8
MB(2,1)
T9
MB(3,1)
T10
MB(4,1)
T11
MB(0,2)
T9
MB(1,2)
T10
MB(2,2)
T11
MB(3,2)
T12
MB(4,2)
T13
MB(0,3)
T11
MB(1,3)
T12
MB(2,3)
T13
MB(3,3)
T14
MB(4,3)
T15
MB(0,4)
T13
MB(1,4)
T14
MB(2,4)
T15
MB(3,4)
T16
MB(4,4)
T17
MB(0,0)
T1
MB(1,0)
T2
MB(2,0)
T3
MB(3,0)
T4
MB(4,0)
T5
MB(0,1)
T3
MB(1,1)
T4
MB(2,1)
T5
MB(3,1)
T6
MB(4,1)
T7
MB(0,2)
T5
MB(1,2)
T6
MB(2,2)
T7
MB(3,2)
T8
MB(4,2)
T9
MB(0,3)
T7
MB(1,3)
T8
MB(2,3)
T9
MB(3,3)
T10
MB(4,3)
T11
MB(0,4)
T9
MB(1,4)
T10
MB(2,4)
T11
MB(3,4)
T12
MB(4,4)
T13
frame i + 1frame i
NCTU, EE, Vision Lab
Multithreading of this system
System Architecture
Input task H.264 Encode Processing task
TX networking task
30
Frame i
Frame i+1
MB
MB
MB
NAL
Control task
Rx
NCTU, EE, Vision Lab
Memory Issue
31
L1P Cache Direct Mapped 16Kbytes Total
DM642 DSP Core
L1D Cache 2-way Set Associated 16Kbytes Total
L2 Cache/ M
emory
256Kbytes Total
Two-level cache architecture of DM642
ED
MA
Controller
peripherals Limited memory of DM642 Use memory buffer to reduce memory access
NCTU, EE, Vision Lab
Memory Issue
Memory hierarchy for inter prediction
32
Memory hierarchy [4]
NCTU, EE, Vision Lab
Memory Issue
Slice memory buffer for intra prediction and deblocking filter
Slice Memory [5]
33
NCTU, EE, Vision Lab
Reference [1] Texas Instruments, Incorporated “Reference Frameworks for eXpressDSP
Software: RF5, An Extensive, High-Density System.” (spru795a) [2] TC Chen, HC Fang, CJ Lian, CH Tsai “Algorithm analysis and architecture design
for HDTV applications - a look at the H.264/AVC video compressor system “IEEE CIRCUITS & DEVICES MAGAZINE MAY/JUNE 2006
[3] Cor Meenderinck, Arnaldo Azevedo and Ben Juurlink “Parallel Scalability of Video Decoders” April 29, 2008.
[4] Denolf, K. De Vleeschouwer, et al,, “Memory centric design of an MPEG-4 video encoder” , IEEE Trans. CSVT, Vol. 15, No. 5, pp. 609-619, May 2005.
[5] Tsu-Ming Liu et al., “A 125μW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications,” ISSCC Digest of Technical Papers, pp. 402-403, Feb. 2006.
[6] T. Wiegand et al., “Overview of H.264/AVC Video Coding Standard”, IEEE Trans. on Circ. and Sys. For Video Technology, Vol. 13, No. 7, pp. 560–576, July 2003.1
[7] VITEC MULTIMEDIA, “MEX User manual Revision 1.7”.
34