youn-long lin-update on super hdtv decoder...

42
Update on Super HDTV Decoder Project Youn-Long Lin 林永隆 Department of Computer Science National Tsing Hua University IC-DFN 2007, Rizhao

Upload: dinhnhu

Post on 22-Jul-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • Update on Super HDTV Decoder Project

    Youn-Long Lin Department of Computer Science

    National Tsing Hua University

    IC-DFN 2007, Rizhao

  • YLLIN NTHU-CS 2

    More Pixels

  • YLLIN NTHU-CS 3

    NHK Proposes UHD TV Broadcast

    Super HiVision 7680x4320 pixels at 60 fps (16XHDTV)

    Baseband signal is 24 Gbps. Using 16 MPEG-2 encoding chips, the signal was compressed to 250 Mbps for transmission.

    HDTV signals at present are 1.5 Gbps for baseband and 20 Mbps for compressed signals.

    High Performance compression / decompression and transmission / storage are needed for

    24 Gbps ~300 Mbps

  • YLLIN NTHU-CS 4

    3840x2160 QFHD TV

    7680x4320 UHD TV

    SDTV

    1920x1080 HDTV

  • YLLIN NTHU-CS 5

    Applications

    QFHD1080HD720HDD2CIFQCIF

  • YLLIN NTHU-CS 6

    Video Coding Technology Trend

    H.264 50% 69%

  • YLLIN NTHU-CS 7

    Features of Video Coding Standards

    64kbps ~ 150Mbps64kbps~2Mbps2-15 MbpsUp to 1.5 Mbps

    Transmission rate

    I, P, BI, P, BI, P, BI, P, BPicture type

    Multiple (5) framesOne frameOne frameOne frameReference frames

    pel pel pel pelPixel accuracy

    41 MVs per MBYesYesYesME, MC

    VLC, CAVLC and CABACVLCVLCVLCEntropy coding

    4*4 int transformDCT/ WaveletDCTDCTTransform

    16*16, 16*8, 8*16, 8*8, 8*4, 4*8, 4*4

    16*16, 8*88*88*8Block size

    16*1616*1616*16(frame)16*16MB size

    H.264/AVCMPEG-4MPEG-2MPEG-1Standard

  • YLLIN NTHU-CS 8

    Not all H.264/AVC systems are equal

    32168

    8.872.5411

    55.724.616.95

    Search Range#Ref Frames

    Video Coding with H.264/AVC: Tools, Performance and Complexity, J. Ostermann et al, IEEE CAS Mag., Q1 2004.

    Relative Computational Complexity

  • YLLIN NTHU-CS 9

    Quality vs Bit-rate vs Decoding Throughput

    6530726557042144172316FpsBit Rate (Kbps)QP

    H.264/AVC Baseline Profile Decoder Complexity Analysis, M. Horowitz, IEEE T-CSVT, July 2003

    Decoding Capability of a 600MHz CPU

  • YLLIN NTHU-CS 10

    Our Target

    Single-Chip Decoder for QFHD (3840x2160) H.264/AVC High Profile Video.264 bitstream Video outputCABAD Advanced Entropy CoderMixed 4x4/8x8 Transform Commodity External MemoryPlatform-Based Design

  • YLLIN NTHU-CS 11

    Resolution vs. Needed Frequency

    Clock Frequency (MHz)

    Res

    olut

    ion

    100 200 400 800

    1080 HD

    QFHD

    720 HD

    16 VGA

    50

    Lin [30],Chen[32],Chien [46], Peng[48]Lin [31], Liu[42]Conexant [36]C&S [39]Kawakami [44]

    66 % Frequency Saving

    4 x larger frame size

  • YLLIN NTHU-CS 12

    Frequency Budget

    675.0

    170.0

    75.0

    28.1

    8.3

    2.0

    1.0

    Size

    Digital signageMedical videoSatellite imageSpace exploration

    249 MHzQFHD (3840 x 2160)

    62 MHz1080HD (1920 x 1088)

    Home theater

    30 MHz720HD (1080 x 720)

    Car TVSurveillance10 MHzD2 (720 x 480)

    Mobile TV3 MHzCIF (352 x 288)

    0.8 MHzQCIF (176 x 144)

    Video phone

    0.4 MHzSQCIF (128 x 96)

    ApplicationClock FrequencyResolution

  • YLLIN NTHU-CS 13

    Essential Issues

    MemoryTradeoff Between the Size of Internal Memory and

    Bandwidth of External AccessMassive Parallelism (Pipelining)Macroblock Decoding SchedulingPower

  • YLLIN NTHU-CS 14

    NTHU H.264 Decoder Architecture

    Parser CAVLD/CABAD

    IQ & IT

    MVG

    IPRED

    INTERP

    BSG

    DF

    MAU & AMBA Interface Translator

    H.264 Video Decoder

    CPU Display MemoryController Ethernet

    AHB

    para & predinfo

    reconbs

    residualm

    v&

    ridx

    coeffm

    vdinfo

  • Memory

  • YLLIN NTHU-CS 16

    size vs. b/w in ME

    Memory Bandwidth (MB/s)

    Memory Size (Bytes)

    19658

    240

    1200

    4977

    124929

    151631762

    A

    B

    C

    D

    Full HD 30fps, # of rf =1 , SRV=SRH=64 Level A : 240 Bytes, 19658 MB/sLevel B : 1200 Bytes, 1516MB/sLevel C: 4977 Bytes, 317MB/sLevel D: 124,929 Bytes, 62 MB/s

  • YLLIN NTHU-CS 17

    CB memrf0 mem rf1 mem

    CMBreg

    CB AGrf AG

    rf regarray

    comparator comparator comparator comparator

    MV mem

    IME block diagram

    CMBreg

    CMBreg

    CMBreg

    rf router

    MVGenrf0

    MVGenrf0

    MVGenrf0

    MVGenrf0

    MVGenrf0

    MVGenrf0

    MVGenrf0

    MVGenrf0

    MV AG

  • YLLIN NTHU-CS 18

    size vs. b/w in ME

    Memory Bandwidth (MB/s)

    Memory Size (Bytes)

    19658

    240

    1200

    4977

    124929

    151631762

    A

    B

    C

    D

    ours

  • YLLIN NTHU-CS 19

    Reference-data Pre-fetch System

    No redundant fetching Collecting several MBs motion vectors, and read the same place

    by only one single operation

    Minimize the number of burst initials On average, 2 burst initials per MB (1 for luma, 1 for chroma)

    : a group of sequentially read (burst read)

  • YLLIN NTHU-CS 20

    CABAD

    Reference-data Pre-fetch System (Cont)

    MB7

    MB8

    MB9

    MB10

    Motion VectorGenerator

    Translator

    Reference Region & Index Register

    Region Analyzer/ Searcher

    OES manager

    MB6MB7

    MB4MB5MB6MB7

    MB4MB5

    MB2MB3MB4

    MB1MB2

    MB0MB1MB2

    MB0

    R0

    R1

    R0R1R2R3R4R5R6R7 Buffer

    R2

    MB7 MB6 MB5 MB4 MB3 MB2 MB1 MB0

    R2 Information

    R2 Information

    Interp

    R0/R1 Data

    R2 Data from SDRAM

    MB7 Information

    MB7 MV

    MB7 Region Information

    . . . .

    MAU Interface

  • Massive Parallelism

  • YLLIN NTHU-CS 22

    4

    IQ/IDCT Timing Diagram

    t3 212

    1 1 1 1 1 1

    195

    0~16

    0~15chromaac_6_7

    1

    0~16luma

    ac_0_1

    0~16luma

    ac_14_15

    0~16luma

    ac_0_1

    0~16luma

    ac_14_15

    1 1 1

    4

    1 1 1 1

    4

    0~4dc

    0~15chromaac_0_1

    0~15chromaac_6_7

    0~15chromaac_0_1

    0~15chromaac_6_7

    1 1 1 1 1 4 1 1 1 1

    4

    122 140 144 161

    2

    4 4 4

    219

    0~16luma

    ac_0_1

    0~16

    0~16

    0~16

    0~16

    0~15chromaac_0_1

    0~15

    0~15

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~16

    0~15

    0~15

    0~15

    0~15

    0~16luma

    ac_14_15

    4 4 4 4 4

    4 4 4 4 4 4

    4 4

    4 4

    4

    1 1 1 1 1 1

    4 1

    1

    IDCTstage 1

    coeflag_memread

    coeff_memread

    IQstage 1

    IQ stage 2

    residual_memwrite

    IDCTstage 2 4 4 4 4 4 4 4 4 4

  • YLLIN NTHU-CS 23

    Deblocking Filter Timing Diagram

  • YLLIN NTHU-CS 24

    L31L30 L32 L33

    L21L20 L22 L23

    L11L10 L12 L13

    L01L00 L02 L03

    Strong filter (Bs=4)/ Left delta calculation

    M01M00 M02 M03 R01R00 R02 R03

    M11M10 M12 M13

    M21M20 M22 M23

    M31M30 M32 M33

    R11R10 R12 R13

    R21R20 R22 R23

    R31R30 R32 R33

    Right Weak filter (Bs

  • System-Level Optimization

    Cyclic-Queue-Based IP Interface

  • YLLIN NTHU-CS 26

    Main Controller

    PARSERFSM

    PARSER

    CABADFSM

    CABAD

    CAVLDFSM

    CAVLD

    IQ/ITFSM

    IQ/IT

    MVGFSM

    MVG

    INTERPFSM

    INTERP

    IPREDFSM

    IPRED

    BSGFSM

    BSG

    DFFSM

    DF

    MFUFSM

    MFU

    MAIN CONTROL FSM

    Decoder Controller

    H.264 Video Decoder

    CPU Display MemoryController Ethernet

    AHB

  • YLLIN NTHU-CS 27

    Performance Gap in Elastic Pipeline

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Idle Cycles/MB 0 36.16 89.87 31.85 116.85 42.85

    Procesing Cycles/MB 160.85 124.69 70.98 129 44 118

    Whole CABAD IQ/IT IPRED BSG DF

    Actual

    IdealPattern: pedestrian QP: 28 Resolution:720*480 GOP: IIIFrame #: 30

    About 25% performance drop between the actual and the ideal situation

  • YLLIN NTHU-CS 28

    Elastic Pipeline Decoder Timing Diagram (I Frame)

    CABAD

    PARSER

    IQ/IT

    BSG

    DF

    (time)Header information decode

    Initial context table and condition offset

    IPRED

    MB-Level decode

    Bubble cycles

    100 to 1000 cycles per MB

  • YLLIN NTHU-CS 29

    Elastic Pipeline Decoder Timing Diagram (I Frame)

    CABAD

    PARSER

    IQ/IT

    BSG

    DF

    (time)Header information decode

    Initial context table and condition offset

    IPRED

    MB-Level decode

    Bubble cycles

  • YLLIN NTHU-CS 30

    Timing Diagram after ASAP Scheduling

    CABAD

    PARSER

    IQ/IT

    BSG

    DF

    (time)Header information decode

    Initial context table and condition offset

    IPRED

    MB-Level decode

    Reduced bubble cycles

    However, bubble cycles still exist

  • YLLIN NTHU-CS 31

    Timing Diagram after ASAP Scheduling

    CABAD

    PARSER

    IQ/IT

    BSG

    DF

    (time)Header information decode

    Initial context table and condition offset

    IPRED

    MB-Level decode

    Remaining bubble cycles

  • YLLIN NTHU-CS 32

    Timing Diagram after Cyclic Queue Insertion

    CABAD

    IQ/IT

    BSG

    DF

    (time)Header information decode

    Initial context table and condition offset

    IPRED

    PARSER

    MB-Level decode

    Reduced remaining bubble cycles

    Total reduced bubble cycles

  • YLLIN NTHU-CS 33

    Elastic Pipeline Decoder Timing Diagram (P/B Frame)

    CABAD

    IQ/IT

    MVG

    DF

    (time)Header information decode

    Initial context table and condition offset

    INTERP

    PARSER

    BSG

    MB-Level decode

  • YLLIN NTHU-CS 34

    Cyclic Queue Decoder Timing Diagram (P/B Frame)

    CABAD

    IQ/IT

    MVG

    DF

    (time)Header information decode

    Initial context table and condition offset

    INTERP

    PARSER

    BSG

    MB-Level decode

    Reduced Bubble Cycles

    Reduced Processing Cycles

  • YLLIN NTHU-CS 35

    Comparison of Different Scheduling Methods

    2.62

    5.6 5.6

    8.3

    486

    620 644

    540

    486

    161 159 140

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    550

    600

    650

    Sequential Elastic Pipeline ASAP Ping-Pong ASAP Cyclic-queue

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    SRAM Usage Turnaround Cycle Processing Cycle

    (Cycles/ MB)KB

    Test Pattern: pedestrianResolution: 720*480 QP: 28 GOP: IIIFrame #: 30

  • YLLIN NTHU-CS 36

    Clock Gating

    Manual clock gatingassign gclk = clk & ip_en;

    clk

    cabac_en picrec_en idct_en ipred_en df_en mc_en

    cabac picrec idct ipred df mc

  • YLLIN NTHU-CS 37

    Two Clock Gating Methods

    Yes, FunctionNo, MalfunctionViability

    3790(32.28%)1037(8.9%)# of un-gated register

    7952(67.72%)10619(91.1%)# of gated registers

    61420# of clock gating elements

    Module-basedRegister-based

  • YLLIN NTHU-CS 38

    mfu

    parser cabad idct ipred interp df

    bsgmain_ctrltopamba_wrap mvg

    def

    rtl syn vn nlint gate_sim

    rtl_simfilelist tbench

    Sub IPhd_amba

    Verification Environment

    H264

    filelist

    fpga_lib

    gate_simasic_lib

    synjm11.0

    mem netlist

    rtl_sim

    tbench

    lm_wrap

    nlintvn

    xilinx_mem altera_mem artisan_mem

    Easy Bug Tracing

  • YLLIN NTHU-CS 39

    Verification Flow

    IP Spec

    IP design IP testbenchdesign

    RTL code (3)IP linting

    no error? RTL-Sim

    func. currect?

    CoverageAnalysis

    IP Synthesis& Gate-Sim

    meetcriterion?

    func. currect? IP Delivering

    Sys Spec

    Sys testbenchdesign

    RTL code (2)

    IP/SysIntegration

    RTL code (5)

    IP/SysRTL-Sim

    Sysfail?

    IPfail?

    IP/Sys Synthesis& Gate-Sim

    RTL code (4)

    func.currect?

    HW imageDelivering

    YesNo

    YesNo

    YesNo

    YesNo Yes No

    Yes

    No

    No

    Yes

    SW Spec

    SW design

    C code

    Compilation

    HW image

    Sys building

    Prototype

    SW image

    Design SpecmodifyReference SWSW profiling

    Test dataextraction

    C model

    GoldenTest Data

    (1)

    Sys design

    (2)

    (1)

    (2)(1)

    (3)

    (3)

    (1)

    (1) (3)

    (5)

    (4)

    (4)

    (1)

    SW DesignerSys Designer

    IP Designer

  • YLLIN NTHU-CS 40

    A Multimedia SOC PlatformCPU Accelerator(FPGA)

    USB(PHY)Daughter Board

    ROM/Flash Memory

    SRAMSDRAM

    VIC USB 2.0 Staticmemory SDRAM Controller(4-CH)

    JPEGCodec DMA SRAM PWM WDT TIMER

    APBBridge Capture

    DisplayController

    DAI SSI SD SM UART GPIO 12C

    Audio CodecI2S

    Flash memory with SSI Flash Card Button LED

    Video-InCCIR601 TV/LCD

    High-Speed Bus

    Peripheral Bus

    FPGA

  • YLLIN NTHU-CS 41

    Summary

    Super High Definition Video Capturing, Delivery and Display are on the Horizon

    Massive Parallelism is Essential for Making Consumer Applications Possible

    Tradeoff Among Memory Usage, Bandwidth and Logic Has Profound Impact on the Overall System Performance

    System Design Should Be Adaptable to Content, Quality Variation

  • Thank You!!