hevc encoder r15 final 심동규

Upload: amameede

Post on 07-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

hevc

TRANSCRIPT

  • HEVCEncoderKwangwoonUniversity(KWU)

    Donggyu Sim ([email protected])

    :,

  • Contents

    OverviewofHEVC

    EncodingissuesforHEVCtestmodel(HM)

    ComplexityanalysisofHEVCencoder

    Fastencodingalgorithmsandperformances

    Issuesofparallelprocessing

    Conclusion

  • OVERVIEWOFHEVC

  • BlockdiagramofHEVCstandard

    Typicalblockbasedhybridcodecstructure+additionalenhancedtools

    Fn

    PictureBuffer

    Fn1

    Fn2

    Fn

    Interprediction

    ME MCDCTIFAMVPMerge

    Intraprediction

    Referencesamplepadding

    PlanarDC

    33angularMDIS

    Transform

    TUsize:3232~44

    Residualquadtree

    Quantization

    DeltaQP RDOQ

    Entropycoding

    CABAC

    Loopfilter

    Sampleadaptiveoffset

    Deblockingfilter

    Transform1

    Quantization1

    ++

    Rn

    Rn

    FIGURE. BlockdiagramofHEVCencoder

  • BlockstructureinHEVC

    ThreeblockstructuresaredefinedinHEVC Codingunit(CU) Predictionunit(PU) Transformunit(TU)

    CU3232

    CU1616 CU1616

    CTU64

    CU88 CU88

    CU1616CU88 CU88

    CU1616 CU1616

    CU88 CU88 CU88 CU88

    CU88 CU88 CU88 CU88

    CU1616 CU1616 CU1616

    CU88 CU88

    CU88 CU88

    CTU6464 CTU6464 CTU64

    TUdepth0

    TUdepth1

    TUdepth2

    2N2N 2NN N2N

    nL2N2NnD2NnU nR2N

    NN

    FIGURE. AnexampleofCU,PU,andTUpartitioninHEVC

  • ENCODINGSTRUCTURESOFHEVC

  • DecisionlevelforHEVCencoder

    Sequencelevel Codingstructure(Allintra,Lowdelay,Randomaccess) Profile,tier,level Max/MinCTUsize,CUdepth Max/Min TUsize,TUdepth Toolon/off(SAO,deblocking,WPP,tile)

    Picturelevel #refframe,ratecontrol Tile,slice

    Slice ortilelevel Refframes Deblockingfilterparameters

    CTUlevel CUpartitioning Sampleadaptiveoffsetparameters

    CUlevel PUandTUpartitioning

    PU &TUlevel Predictionmodes,motionvectors cbf,coefficients

    Sequence

    Picture

    CTU

    SliceorTile

    CU

    PU&TU

  • Temporalpredictionstructure (1/3)

    Allintra(AI) Allpictureiscodedasinstantaneousdecodingrefresh(IDR)picture Notemporalpredictionisallowed

    IDRPicture

    time

    0

    QPI

    =POC

    Codingorder 1

    QPI

    2

    QPI

    3

    QPI

    4

    QPI

    5

    QPI

    6

    QPI

    7

    QPI

  • Temporalpredictionstructure (2/3)

    Lowdelay(LD) ThefirstpictureshallbecodedasIDRpicture GeneralizedPandB(GPB) pictureshallbeusedfortheothersuccessivepictures

    TheGPBshallbeabletouseonlythereferencepictures,eachofwhosePOCissmallerthanthecurrentpicture(allreferencepictureinList_0andList_1shallbetemporallypreviousindisplayorderrelativetothecurrentpicture)

    QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicturedependingontemporallayer

    IDRorIntrapicture GPB(GeneralizedPandB)

    picture

    0

    1

    2

    4

    53

    6

    7

    8

    time

    QPI

    QPBL3=QPI+3

    QPBL2=QPI+2

    QPBL3 QPBL3 QPBL3

    QPBL2

    QPBL1=QPI+1 QPBL1:Depth==0:Depth==1:Depth==2

    =POC

    Codingorder

  • Temporalpredictionstructure(3/3)

    Randomaccess(RA) HierarchicalBstructureshallbeusedforcoding IDR Intrapictureorcleanrandomaccess(CRA) pictureshallbeinsertedcyclicallyperaboutone

    secondinrandomaccesspoint QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture

    dependingontemporallayer

    IDRorIntrapicture

    GPB(GeneralizedPandB)picture

    0

    4

    3

    2

    75 8

    1

    time

    ReferencedBPicture

    NonreferencedBPicture

    8

    4

    1

    2

    3 5

    6

    7

    0

    QPI

    QPBL4=QPI+4 QPBL4 QPBL4 QPBL4

    QPBL3=QPI+3 QPBL3

    QPBL2=QPI+2

    QPBL1=QPI+1

    POC

    Codingorder

    :Depth==0:Depth==1:Depth==2:Depth==3

  • Picturepartitioning

    Picture :Apicturecontainsanarrayofluma samplesinmonochromeformatoranarrayofluma samplesandtwocorrespondingarraysofchroma samplesin4:2:0,4:2:2,and4:4:4colorformat

    Codingorderofcodingtreeunit(CTU) israsterscanorder CTU:AnNxN blockofluma samplestogetherwithtwocorrespondingblockofchroma

    samples Analogoustomacroblock inpreviousstandards Themaximumallowedsizeoftheluma blockinaCTUisspecifiedtobe64x64 inMainprofile

    *CTU&CTB:TheCTUconsistsofaluma codingtreeblock(CTB)andthecorrespondingchroma CTBsandsyntaxelements

    30

    17

    FIGURE. ExampleofapicturedividedintoCTUs

    Example)ClassB(19201080) BQTerraceCTUsize:64643017CTUpartition

  • Picturepartitioning

    Aslice isasequenceofcodingtreeunits(CTUs)

    Unlikeslices,tilesarealwaysrectangularandalwayscontainanintegernumberofcodingtreeunitsincodingtreeunitrasterscan

    Atleastoneofthefollowingconditionsshouldbetrueforeachsliceandtileinapicture AllCTBsinaslicebelongtothesametile,orallCTBsinatilebelongtothesameslice

    FIGURE. Apicturewith3017codingtreeunitsthatispartitionedintothreeslices

    FIGURE. Apicturewith3017 codingtreeunitsthatispartitionedintothreetiles

  • Codingunit(CU)andcodingtreestructure

    Codingunit(CU):theleafnodeofaquadtreestructure Squareblocks Size:from88uptothesizeofCTU SizeofCTUisspecifiedinsequenceparameterset(SPS)

    Thequadtreepartitioningstructureallowsrecursivesplittingintofourequallysizednodes

    TABLE.SyntaxforsizeofCTUinSPSseq_parameter_set_rbsp() { Descriptorlog2_min_coding_block_size_minus3 ue(v)log2_diff_max_min_coding_block_size_minus2 ue(v)

    }

    FIGURE. Exampleofcodingtreestructure

    CU3232

    CU1616 CU1616

    CU88 CU88

    CU1616CU88 CU88

    CU1616 CU1616

    CU88 CU88 CU88 CU88

    CU88 CU88 CU88 CU88

    CU1616 CU1616 CU1616

    CU88 CU88

    CU88 CU88

    8x8~64x64

  • ExampleofCUquadtreestructure

    Codingunitquadtreestructure StartingfromCTU,eachCUcanbesplitinto4smallerCUs

    64CU:split_coding_unit_flag(1)32CU:split_coding_unit_flag(0)32CU:split_coding_unit_flag(1)16CU:split_coding_unit_flag(0)16CU:split_coding_unit_flag(0)16CU:split_coding_unit_flag(1)

    FIGURE. ExampleofCUquadtreestructure

    CU3232

    CU1616 CU1616

    CU88 CU88

    CU1616CU88 CU88

    CU1616 CU1616

    CU88 CU88 CU88 CU88

    CU88 CU88 CU88 CU88

    CU1616 CU1616 CU1616

    CU88 CU88

    CU88 CU88

    TABLE.SyntaxforCUsplitflagincodingtreecoding_tree(x0, y0, log2CbSize, ctDepth) { Descriptor

    if(x0+(1 log2MinCbSize && NumPCMBlock == 0 )split_coding_unit_flag[x0][y0] ae(v)

    }

  • Codingunit(CU)decision

    Codingunitquadtreestructure StartingfromCTU,eachCUcanbesplitinto4smallerCUs

    BestCURDcostcalculationforeachCUlevel

    CompetitionofthebestCUanditssubpartitionedCUs

    CUsize

    3232

    23 10

    3232

    2 5

    3232

    44 15

    3232

    65 20

    1616

    8 2

    1616

    3 1

    1616

    13 3

    1616

    18 4

    885

    884

    887

    886

    6464

    1 21

    8810

    889

    8812

    8811

  • Predictionunit(PU)types

    Predictionunit(PU):aregionusedforcarryingtheinformationrelatedtothepredictionprocesses

    2PUtypesforIntraprediction 2N2N,(SmallestCU:additionallyNN)

    8PUtypesforInterprediction SmallestCU:

    8x8:2N2N,N2N,2NN Others:2N2N,N2N,2NN,NN

    Others:2N2N,N2N,2NN,nL2N,nR2N,2NnU,2NnD

    FIGURE. PUpartitionsinHEVC

    2N2N N2N 2NN NN

    2NnD2NnUnR2NnL2N

  • Predictionunit(PU)types

    CurrentCUsizeSCUsize

    AMPenableflag

    CurrentCUsize==CurrentCUsize==SCUsize

    AMPenableflag

    CurrentCUsize==88

    Intra2N2N

    Inter2N2N

    Inter2NN

    InterN2N

    Intra2N2N

    Inter2N2N

    Inter2NN

    InterN2N

    Intra2N2N

    IntraNN

    Intra2N2N

    IntraNN

    No Yes

    YesYesNo No

    InterAMP

    Inter2N2N

    Inter2NN

    InterN2N

    InterNN

    Inter2N2N

    Inter2NN

    InterN2N

  • Transformunit(TU)andtransformtreestructure

    Transformunit(TU):aregionsharingthetransformandquantizationprocesses Squareshape Size:from4x4upto32x32 AvailabletransformblocksizesandmaxtransformhierarchydeptharespecifiedinSPS

    RootofTUquadtreeisCUwhichtheTUbelongto

    FIGURE. TUquadtreestructureinHEVC

    TABLE.SyntaxforsizeofTUinSPS

    32

    32

    seq_parameter_set_rbsp() { Descriptor

    log2_min_transform_block_size_minus_2 ue(v)

    log2_diff_max_min_transform_block_size ue(v)

    max_transform_hierarchy_depth_inter ue(v)

    max_transform_hierarchy_depth_intra ue(v)

    }

    4x4~32x32

  • INTER/INTRAPREDICTIONANDPU/TUDECISION

  • OverallofHMencodingprocess

    Sequence

    Picture

    CTUdecisionsinasliceoratile

    Deblocking filterSAOEntropycoding

    CUpartitioningdecision

    PU&TUpartitioningdecision

    RDOprocess

  • 32323232 3232 3232

    16161616 1616 1616

    8888

    8888

    6464

    8888

    8888

    Inter2N2N InterNN

    InterN2NInter2NNInterAMP

    Intra2N2N IntraNN

    IntraPCM

    CUsizeSCU

    CompressCUCompressCU CompressCU CompressCU

    Finish

    No

    Yes

    CompressCU

    Mergeskip

    RDOprocesstodecidePU&TU

  • Intrapredictionflow

    Predictionmodes Luma (35modes)

    Planar,DC,Angularprediction(33directions) Chroma(5modes)

    Planar,DC,Vertical,Horizontal,DM Filtering

    MDIS(Modedependentintrasmoothing) DCfiltering,Ver/Hor filtering

    3MPM

    2N2NPU

    MDIS

    Intraprediction

    Referencesamplepadding

    RDcost,Intra_mode

    Bestmodedecision

    N

    Y

    Mode

  • Fastintraprediction&TUdecisioninHM

    IntrapredictionstepinHM1)Roughpredictionmodedecision

    35prediction SelectNpredictionmodes

    Distortion(SATD)+lamda *modebits #ofcandidatepredictionmodes:Nmodes+MPM(3)

    2)Bestintrapredictionmodedecisionwithtransform Transform(RQTdepth=1) 1bestintramodedecision

    3)BestRQTdecisionwithRDcosts RQTdepth=3

    35modes

    Nmode+MPM

    1Bestmode

    BestmodeRDcost

  • Interprediction

    Skip:Mergeskip

    Nonskip Unidirectionalprediction Bidirectionalprediction Halfpel/Quarterpel motionrefinement

    DCTIF(8tap/4tap) Merge

    Mergeskip

    Inter2N2NInter2NNInterN2N

    Bestmodedecision

    Unidirectionalprediction

    Bidirectionalprediction

    Merge

    Bestmodedecision

    Cur.CU

    RDcost,Bestmode

    Spatialcandidatesderivation

    Temporalcandidatederivation

    Additionalcandidatesderivation

    RDcostcalculationAMP

    (nL2N,nR2N,2NnU,2NnD)

    FIGURE. Flowchart Interprediction

  • Interprediction

    Intercodingmode Mergeskipmode(CUlevel)

    skip_flag=1 andmerge_idx Noreferenceindex Nomotionvector Noresidual

    Mergemode(PUlevel) skip_flag=0,pred_mode_flag,and part_mode merge_flag=1andmerge_idx Noreferenceindexandmotionvector no_residual_syntax_flag:Residualisencodedornot

    GeneralPUmodes skip_flag=0pred_mode_flag,andpart_mode merge_flag=0 ref_idx_lx andmvp_lx_flag basedonAMVP(x=0or1) MVDisencoded no_residual_syntax_flag:Residualisencodedornot

  • InterpredictionflowBEGIN input : current PU part mode for a CU

    FOR PU partition

    FOR List = 0 to 1 DOFOR 0 to refidx DO

    Motion estimation (diamond search, SR : 64)Decide best RD-cost for uni-prediction

    ENDFORENDFOR

    IF bi-directional prediction THENFOR iteration = 0 to 3 DO

    FOR 0 to refidx DOMotion estimation (full search, SR : 4)Decide best RD-cost for bi-prediction

    ENDFORENDFOR

    ENDIFENDFOR

    Merge

    RD-cost competition among uni/bi-prediction and merge

    END output : inter prediction syntax

    Fastencoderdecision(FEN)SubsampledSADforintegerME

    UsesubsampledSADwhenrows>8forintegerMEOnly1iterationforbipredictivemotionsearch

    defaultnumber:4

    FastDecisionforMergeRDcost(FDM)Aftermergewithmergeidx X,ifallcbf iszerothenmerge

    processisterminated

    FIGURE. Pseudo code - Inter prediction flow

    time

    Cur

    CurrentPU

    Uniprediction

    Biprediction

    LIST_0 LIST_1

  • Biprediction

    SearchP0 andP1whichproduceminimumerrorwithO R =(O P),where P =(P0+P1)/2

    PracticalBipredictivesearch1)SearchP1whichproduceminimum2Rwith(2O P0)

    R =O (P0+P1)/2 2R=(2O P0) P12)SearchP0whichproduceminimumerrorwith(2O P1)

    R =O (P0+P1)/2 2R=(2O P1) P0

    BipredSearchRange :4 FEN:1(iteration:1)

    P0 P1O

    List1Reference

    List0Reference Currentframe

  • Example)Biprediction

    Bidirectionalprediction

    Iteration:2

    Iteration:3

    Unidirectionalprediction

    P1O

    List1Reference

    List0Reference Currentframe

    Searchrange:64

    P0 P1O

    BipredSearchRange :4

    P0 P1O

    BipredSearchRange :4

    Iteration :1

    P0 P1O

    BipredSearchRange :4

    Iteration:4

    P0 P1O

    BipredSearchRange :4

    P02OR0

    P12OR1

    P02OR0

    P12OR1

  • Motionestimation(Integerpel)

    Practicalmotionestimation(diamondsearch) Firstsearch &earlytermination

    Max3(default)moreroundsafterarecentbestmatch

    Rasterrefinementsearch Ifintegerpel distanceisbiggerthan5,thenconducttherasterrefinementsearch.

    Starrefinementsearch&earlytermination Diamondsearchwiththecenterofthebestmatchfromtheearlytwosteps Max2roundsafterthebestmatch

    FIGURE. Rasterrefinementsearch

    3

    3 2 32 1 2

    3 2 1 0 1 2 3 2 1 2

    3 2 3

    3

    FIGURE. Firstsearch&startrefinement

  • Motionestimation(Subpel refinement)

    Integerpel motionsearch Costfunction:SAD

    Subpel motionrefinement Costfunction:SATD Halfpel refinement Quarterpel refinement

    FIGURE. Integerpel motionsearch

    FIGURE. Halfpel motionsearch

    FIGURE.Quarterpel motionsearch

    Searchrange

    S

    e

    a

    r

    c

    h

    r

    a

    n

    g

    e

    Integerpel

    Halfpel

    Quarterpel

  • Interpolation

    DCTIFinHEVC Fixed8tap(7tap)and4tapinterpolationfiltersbasedonDCT 2Dseparablefilter

    8*Horizontal1Dfilter+1*Vertical1Dfilter

    Component Filter()

    Luma1/4 {1,4,10, 58,17,5,1,0}

    1/2 {1, 4,11,40,40,11,4,1}

    Chroma

    1/8 {2,58, 10,2}

    3/8 {6,46,28,4}

    1/4 {4,54,16,2}

    1/2 {4, 36,36,4}

    FIGURE. Integerandfractionalsamplepositionsforluma andchroma interpolation

    TABLE.Interpolationfiltercoefficients A-1,-1 A0,-1 a0,-1 b0,-1 c0,-1 A1,-1

    A-1,0 A0,0 A1,0

    A-1,1 A0,1 A1,1a0,1 b0,1 c0,1

    a0,0 b0,0 c0,0

    d0,0

    h0,0

    n0,0

    e0,0

    i0,0

    p0,0

    f0,0

    j0,0

    q0,0

    g0,0

    k0,0

    r0,0

    d-1,0

    h-1,0

    n-1,0

    d1,0

    h1,0

    n1,0

    A2,-1

    A2,0

    A2,1

    d2,0

    h2,0

    n2,0

    A-1,2 A0,2 A1,2a0,2 b0,2 c0,2 A2,2

    B0,0 ae0,0 ag0,0 ah0,0ab0,0 ac0,0 ad0,0 af0,0 B1,0

    B1,1B0,1

    be0,0 bg0,0 bh0,0bb0,0 bc0,0 bd0,0 bf0,0ba0,0

    ce0,0 cg0,0 ch0,0cb0,0 cc0,0 cd0,0 cf0,0ca0,0

    de0,0 dg0,0 dh0,0db0,0 dc0,0 dd0,0 df0,0da0,0

    ee0,0 eg0,0 eh0,0eb0,0 ec0,0 ed0,0 ef0,0ea0,0

    fe0,0 fg0,0 fh0,0fb0,0 fc0,0 fd0,0 ff0,0fa0,0

    ge0,0 gg0,0 gh0,0gb0,0 gc0,0 gd0,0 gf0,0ga0,0

    he0,0 hg0,0 hh0,0hb0,0 hc0,0 hd0,0 hf0,0ha0,0

    ah-1,0

    bh-1,0

    ch-1,0

    dh-1,0

    eh-1,0

    fh-1,0

    gh-1,0

    hh-1,0

    he0,-1 hg0,-1 hh0,-1hb0,-1 hc0,-1 hd0,-1 hf0,-1ha0,-1

    ba1,0

    ca1,0

    da1,0

    ea1,0

    fa1,0

    ga1,0

    ha1,0

    ae0,1 ag0,1 ah0,1ab0,1 ac0,1 ad0,1 af0,1

  • Inter2N2N InterNN

    InterN2NInter2NNInterAMP

    Intra2N2N IntraNN

    IntraPCM

    CUsizeSCU

    CompressCUCompressCU CompressCU CompressCU

    Finish

    No

    Yes

    CompressCU

    Mergeskip

    ExampleofPUdecision

    BipredictionRDcost=SAD/SATD+*Bmode

    =9000

    BipredictionRDcost=SSE+*Bmode

    =8500

    MergeRDcost=SAD/SATD+*Bmode

    =11000

    UnipredictionRDcost=SAD/SATD+*Bmode

    =12000

    Vs.

    Vs.

    Example

    NoTUdecisionNoreconstruction

    TUdecisionReconstruction

  • TUdecisionflow(Inter)

    Residualquadtree

    2N2N N2N 2NN NN

    2NnD2NnUnR2NnL2N TUdepth:0

    TUdepth:1

    TUdepth:2

    T/QIT/IQ(recon)RDcost(SSE+*Bmode)

    Original Predictor Residual

  • TUdecisionflow(Intra)

    Example)intra_pred_mode =10(verticalmode)

    Referencesamples

    Predictiondirection

    IntrapredictionusingreferencesamplesT/QIT/IQRDcost(SSE+*Bmode)

    Predictiondirection

    Referencesample(afteraboveblockisreconstructed)

    TUdepth:N

    TUdepth:N+1

    Residual

  • Transform

    ImplementationoftransforminHEVC Matrixmultiplication

    Straightforward/Fewcodelines Hugenumberofoperations,butSIMDfriendly

    Partialbutterflyimplementation Utilizessymmetry/antisymmetrypropertiesofbasisvectors Lessmultiplications/additions Increasenumberofcodelines

    Matrixmultiplication

    Matrixmultiplication

    Matrixmultiplication

    Matrixmultiplication

  • PartitioningsyntaxforaCTU

    Syntax

    CU3232

    CU1616 CU1616

    CU88 CU88

    CU1616CU88 CU88

    CU1616 CU1616

    CU88 CU88 CU88 CU88

    CU88 CU88 CU88 CU88

    CU1616 CU1616 CU1616

    CU88 CU88

    CU88 CU88

    64CU:split_coding_unit_flag(1)32CU:split_coding_unit_flag(0)

    32CU:split_coding_unit_flag(1)16CU:split_coding_unit_flag(0)

    16CU:split_coding_unit_flag(0)

    16CU:split_coding_unit_flag(1)

    32x32TU:splitflag(1)

    16x16TU:splitflag(0)

    16x16TU:splitflag(1)8x8TU:splitflag(0)8x8TU:splitflag(0)8x8TU:splitflag(0)8x8TU:splitflag(0)

    16x16TU:splitflag(1)8x8TU:splitflag(0)8x8TU:splitflag(1)4x4TU:splitflag(0)4x4TU:splitflag(0)4x4TU:splitflag(0)4x4TU:splitflag(0)

    FIGURE. ExampleofTUquadtreestructure

    PUpartition&Pred_mode info

    TUsplitflags&Coefficients

    PUpartition&Pred_mode info

    TUsplitflags&Coefficients

    FIGURE. ExampleofCUquadtreestructure

    SKIPflag(mergeidx) Predictionmodeflag(intraor inter) PUpartsize(2Nx2N,2NxN,Nx2N,NxN,

    AMP) Predictioninfo.(Intramodeormv and

    ref.idx.,mergeidx,AMVPidx)

    PUpartition&Pred_mode info

    TUsplitflags&Coefficients

  • ENCODINGPROCESSOFLOOPFILTER

  • Inloopfilter

    InHEVC,twoprocessingsteps,adeblocking filter(DBF)andasampleadaptiveoffset(SAO) operationareapplied

    DBF:similartotheDBFoftheH.264/AVCstandard SAO:appliedadaptivelytoallsamplessatisfyingcertainconditions(whiletheDBFisonlyapplied

    tothesampleslocatedatblockboundaries)

    On/offsyntaxesforinloopfilters1. slice_disable_deblocking_filter_flag :slicelevelon/off2. sample_adaptive_offset_enabled_flag :slicelevelon/off

  • Deblocking filter(DBF)

    Basically,deblocking filterofHEVCissimilartothatofH.264/AVC Inloopfiltering

    Codingperformanceforinterframe Framebasedfiltering On/offcontrolisprovided

    Adaptivefiltering boundarystrength

    Filteringontheblockboundaries transformandpredictionboundary

    Sequentialfilteringforverticalandhorizontaledges Samplevaluesmodifiedduringfilteringofverticaledgesareusedasinputforthefilteringof

    thehorizontaledges

  • Deblocking filter(DBF)

    FeaturesofHEVCdeblocking filtercomparedtoH.264/AVC FortheTUsandPUswithedgeslessthan8samplesineitherverticalorhorizontaldirection,only

    theedgeslyingonthe88samplegridarefiltered

    verticaledges>horizontalfiltering

    horizontaledges>verticalfiltering2

    1 verticaledges>horizontalfiltering

    horizontaledges>verticalfiltering2

    1

    [e.g. 16x16Codingunit]

    H.264/AVC HEVC

    (a) H.264/AVC (b)HEVCFIGURE. DerivationprocessfortheboundaryfilterstrengthinAVCandHEVC

  • ProcessingflowofDBF

    Boundarydecision Threekindsofboundariesinvolvinginthefiltering

    CU,TU,PUboundary CUboundariesarealwaysinvolvedinthefiltering TUboundaryat88blockgridandPUboundarybetween

    eachPUinsideCUareinvolvedinthefiltering [Except]PUboundaryisinsideTU,theboundaryshall

    notbefiltered

    Bs calculation Bs iscalculatedin44blockbasis>remappedto88grid TwoBs arebelongto8pixelsconsistingalinein44grid,

    maximumBs isselectedasBs forboundariesin88grid

    Boundarydecision

    Bs calculation(44>88)

    ,tc decision

    filteron/offdecision

    Strong/weakfilterselection

    Strongfiltering Weakfiltering

    FIGURE.Overallprocessingflowofdeblocking filterprocess

  • Overviewofsampleadaptiveoffset(1/2)

    Artifacts Blockingartifacts,ringingartifacts,colorbiases,andblurringartifacts Alargertransformcouldintroducemoreartifacts

    HEVC:4x4~32x32transform Artifactsareexistatmediumandlowbitrates

    Alargenumberofinterpolationtapscanalsoleadtomoreseriousringingartifacts HEVC:8tap(luma),4tap(chroma)

    Sampleadaptiveoffset Toreducesampledistortion(reconstructedpixels originalpixels) Average3.5%BDratereduction (with1%encodingtimeincrease,2.5%decodingtimeincrease)

    SAOislocatedafterDFandalsobelongstoinloopfiltering

  • Overviewofsampleadaptiveoffset(2/2)

    SAOfeatures EachcolorcomponentmayhasitsownSAOparameters TwoSAOtypes

    Edgeoffset(EO;4EOclasses) Bandoffset(BO;1BOclass)

    SAOmerging(leftCTUoraboveCTU) SAOmergeinformationissharedforthreecolorcomponents

    SAOobjectandsubjectiveresults

    SAOisenabled(QP=32)

    SAOisdisabled(QP=32)

    Anchor:DisablingSAOTest:EnablingSAO

    CTUsizeinLuma: 64x64CTUBoundary:option1

    YDBrate

    Allintra(AI)

    Randomaccess(RA)

    Low delayB(LB)

    LowdelayP(LP)

    ClassSummary

    Class A 0.6% 2.3%

    ClassB 0.5% 2.1% 2.0% 11.1%

    ClassC 0.5% 1.1% 1.8% 7.1%

    ClassD 0.4% 0.3% 0.7% 4.4%

    ClassE 0.6% 2.3% 11.0%

    ClassF 1.5% 2.6% 5.7% 12.3%

    OverallSummary

    All 0.7% 1.7% 2.5% 9.2%

    Enc.Time(%) 101% 100% 100% 100%

    Dec.Time(%) 103% 103% 102% 102%

  • EdgeoffsetofSAO

    Four1Ddirectionalpatterns horizontal,vertical,135 diagonal,45 diagonal

    OnlyoneEOclasscanbeselectedforeachCTBofwhichEOisenabled EachsampleinsidetheCTBisclassifiedintooneoffivecategories

    Oneedgeoffsetisencodedforeachcategory(4offsetsaretransmittedinthecaseofEO) Noinformationforclassificationoffivecategories(encoderanddecoderusesamerules)

    a c b

    a

    c

    b

    a

    c

    b

    a

    c

    bFIGURE. Four1DdirectionalpatternsforEOsampleclassification

    Category Condition

    1 cb

    0 Noneoftheabove(SAOisnotapplied)

    pixelindexx1 x x+1

    p

    i

    x

    e

    l

    l

    e

    v

    e

    l

    category1

    pixelindexx1 x x+1

    p

    i

    x

    e

    l

    l

    e

    v

    e

    l

    category2

    pixelindexx1 x x+1

    p

    i

    x

    e

    l

    l

    e

    v

    e

    l

    pixelindexx1 x x+1

    p

    i

    x

    e

    l

    l

    e

    v

    e

    l

    category3

    pixelindexx1 x x+1

    p

    i

    x

    e

    l

    l

    e

    v

    e

    l

    pixelindexx1 x x+1

    p

    i

    x

    e

    l

    l

    e

    v

    e

    l

    category4

    Positiveedgeoffset Negativeedgeoffset

    TABLE.Sampleclassificationrulesforedgeoffset

  • BandoffsetofSAO

    BOimpliesoneoffsetisaddedtoallsamplesofthesameband Thesamplevaluerangeisequallydividedinto32bands For8bitsamplesrangingfrom0to255,thewidthofabandis8

    Onlyoffsetsoffourconsecutivebandsandthestartingbandpositionaresignaledtothedecoder

    Theaveragedifferencebetweentheoriginalsamplesandreconstructedsamplesinabandissignaledtothedecoder

    Four offsetsaretransmittedinthecaseofBO

    0 max

    Thefirstbandforwhichoffsetistransmitted

    Four offsetsaretransmittedforfourconsecutivebands

  • AfastdistortionestimationforSAO

    Distortionshavetobecalculatedmanytimes Letk,s(k),andx(k)besamplepositions,originalsamples,andpreSAOsamples,

    respectively DistortionbetweenoriginalsamplesandpreSAOsamples

    DistortionbetweenoriginalsamplesandpostSAOsamples

    h istheoffsetforthesamplesetandN isthenumberofsamplesintheset,thedeltadistortionisdefined(NandEcanbecalculatedonlyonce)

    Ck

    pre kxksD2))()(((

    Ck

    post hkxksD2)))(()((

    Ck

    prepost hENhkxkshhDDD 2)))()((2(22

    Ck

    kxksE ))()((RDJ

  • Offsetrefinement

    Initialoffsetvalue,hisE/N Allthenumbersbetweenzeroandoffsetareusedforoffsetrefinementprocess

    0

    1

    2

    3

    4

    5

    6

    Initialoffset

    0

    1

    2

    3

    4

    5

    6

    Initialoffset

    Ck

    kxksE ))()((

  • EncodingflowofSAOinHM

    CTUbasedprocessing

    BO 32 band sum of difference, pixel count

    EO class0 category Sum of difference, pixel count

    EO class1 category Sum of difference, pixel count

    EO class2 category Sum of difference, pixel count

    EO class3 category Sum of difference, pixel count

    EO class0 rdcost rdcost0 = distortion + rate( A fast distortion estimation, offset refinement )EO class1 rdcost rdcost1 = distortion + rate( A fast distortion estimation, offset refinement )EO class2 rdcost rdcost2 = distortion + rate( A fast distortion estimation, offset refinement )EO class3 rdcost rdcost3 = distortion + rate( A fast distortion estimation, offset refinement )

    BO band position ( A fast distortion estimation, offset refinement )

    Rdcost type (BO, EO class0, EO class1, EO class2, EO class3)

    BO rdcost rdcostBO = distortion + rate

    Left merge, up merge rdcost

    E

    N

    FIGURE. Flowchart Sampleadaptiveoffset

    Compressslice

    Deblocking filter(DBF)

    Sampleadaptiveoffset(SAO)

    Encodeslice

    RDOofSAO

    ProcessSAO

    1)CalculateSAOstatistics

    2)CalculateSAORDcost

    3)Mergeleftorup

    1)CalculateSAOstatistics 2)CalculateSAORDcost

  • Slicelevelon/offcontrolofSAO

    Hierarchicalquantizationparameter(QP)settingsforeachgroupofpictures

    Aslicelevelon/offdecisionalgorithm Fordepth=0picture,SAOisalwaysenabledinthesliceheader Otherdepth

    Ifthepreviouspicture(thelastpictureofdepthN1indecodingorder)disablesSAOformorethan75%ofCTUs,thecurrentpicturewillearlyterminatetheSAOencodingprocessanddisableSAOinallsliceheaders

    8k

    (8k+4)Depth=0

    Depth=1

    Depth=2

    Depth=3

    AhigherQP

    (8k+2)

    (8k+1) (8k+3) (8k+5) (8k+7)

    (8k+6)

  • CTUbasedencodingissuesaboutSAO

    SinceSAOisafterDF,theSAOparameterscannotbepreciselyestimateduntilthedeblocked samplesareavailable

    InCTUbasedencoder,thedeblocked samplesoftherightcolumnsandthebottomrowsinthecurrentCTUmaybeunavailable

    TwopracticalCTUbasedSAOdecisions Case1.Avoidingusingthebottomrowsandrightcolumns(currentHM) Case2.Usenondeblockfilteredpixelsforthebottomrows

    andrightcoloumns (JCTVCJ0139)

    TABLE.AverageBDratesofenablingSAOversusdisablingSAOfordifferentCTUsizes

    deblockfilteredpixels

    nondeblockfilteredpixels

    CTUSizeinLuma

    Option1:SkiprightandbottomsamplesintheCTUduringparameterestimation

    Option 2:UsepredeblockedsamplesnearrightandbottomboundariesintheCTUduring

    parameterestimation

    Y Cb Cr Y Cb Cr

    6464 3.5% 4.8% 5.8% 3.3% 5.3% 6.6%

    3232 2.0% 1.1% 1.5% 2.5% 2.0% 2.7%

    1616 0.0% 0.3% 0.3% 0.8% 0.4% 0.1%

  • COMPLEXITYANALYSISOFHEVCENCODER

  • ComplexityanalysisofHMencoder

    Testsequences Sequence:ClassB(19201080),ClassC(832480)

    ClassB:Kimono,ParkScene,Cactus,BasketballDrive,BQTerrace

    ClassC:BasketballDrill,BQMall,PartyScene,RaceHorse

    QP:22,27,32,37 Mainprofile Randomaccess,lowdelay

    Testenvironment HM7.0software IntelCoreTM [email protected] 4GBmemory Windows7(64bit) Analysistool:IntelVtuneTM AmplifierXE

    FIGURE. ClassB BasketballDrive

    FIGURE. ClassC BQMall

  • ProfilingresultofHEVCencoder

    Class ModuleQP

    22 27 32 37

    B

    Entropy 6.6 3.4 1.0 0.9

    Intra 3.3 2.2 2.1 1.4

    Inter 68.4 78.1 83.9 85.7

    TR+Q 20.4 15.2 11.7 10.6

    Loopfilter 0.2 0.2 0.2 0.1

    etc 1.2 1.1 1.3 1.5

    C

    Entropy 6.5 3.9 2.8 1.3

    Intra 2.9 2.7 2.2 1.8

    Inter 68.8 74.9 79.8 83.3

    TR+Q 20.7 17.0 13.9 12.4

    Loopfilter 0.2 0.2 0.2 0.1

    etc 1.0 1.5 1.4 1.2

    Class ModuleQP

    22 27 32 37

    B

    Entropy 6.1 2.8 0.4 0.3

    Intra 3.4 2.0 1.2 1.2

    Inter 71.3 81.2 87.3 89.1

    TR+Q 18.6 13.0 9.9 8.5

    Loopfilter 0.2 0.2 0.2 0.1

    etc 0.8 1.2 0.8 0.9

    C

    Entropy 5.3 3.1 1.1 0.4

    Intra 3.0 2.5 1.8 1.5

    Inter 72.6 79.1 83.5 87.2

    TR+Q 18.2 14.9 12.1 10.1

    Loopfilter 0.2 0.2 0.2 0.1

    etc 1.1 0.6 1.6 1.0

    TABLE. ComplexityratioofHM7.0encoder(RA) TABLE. ComplexityratioofHM7.0encoder(LD)

  • Loopfilter:0.10.2%

    Interprediction:7781%

    Intraprediction:12%

    Entropycoding:24%

    Tr +Q:1416%

    ComplexityportionsofHMencoder

    Fn

    PictureBuffer

    Fn1

    Fn2

    Fn

    Interprediction

    ME MCDCTIFAMVPMerge

    Intraprediction

    Referencesamplepadding

    PlanarDC

    33angularMDIS

    Transform

    TUsize:3232~44

    Residualquadtree

    Quantization

    DeltaQP RDOQ

    Entropycoding

    CABAC

    Loopfilter

    Sampleadaptiveoffset

    Deblockingfilter

    Transform1

    Quantization1

    ++

    Rn

    Rn

    Interprediction

    Transform+Q

    Intraprediction

    Loopfilter

    Entropycoding

    etcFIGURE. HEVCencoderblockdiagram andprofilingresult

  • ComplexityportionsforCUsizesandmodes

    FIGURE. ExampleofCUquadtreestructure

    CU3232

    CU1616 CU1616

    CU88 CU88

    CU1616CU88 CU88

    CU1616 CU1616

    CU88 CU88 CU88 CU88

    CU88 CU88 CU88 CU88

    CU1616 CU1616 CU1616

    CU88 CU88

    CU88 CU88

    TABLE. ComplexityportionsforCUsizesandmodes

    Size Mode RA(%) LD(%) Average (%)

    64x64

    Intra 2.1 1.0 1.6

    Inter 19.0 31.9 25.5

    Skip 3.9 3.4 3.7

    32x32

    Intra 1.9 0.7 1.3

    Inter 25.0 27.4 26.2

    Skip 4.5 3.2 3.9

    16x16

    Intra 2.3 0.2 1.3

    Inter 17.0 12.5 14.8

    Skip 3.2 1.7 2.5

    8x8

    Intra 2.4 0.4 1.4

    Inter 8.7 4.9 6.8

    Skip 1.7 0.6 1.2

  • SelectedratiosofCU,PUandTUCU size PUmode

    ClassB ClassC

    22 27 32 37 22 27 32 37

    64x64

    Merge skip 10.6 26.6 43.3 55.2 11.7 20.6 30.6 39.5

    Inter2Nx2N 4.5 7.1 7.2 6.0 5.8 7.5 6.7 5.5

    InterNx2N 1.4 2.2 1.8 1.3 1.6 1.8 1.7 1.7

    Inter2NxN 1.5 1.9 1.3 0.9 1.2 1.0 0.8 0.7

    InterAMP 1.2 1.4 1.0 0.7 1.0 1.1 1.0 1.1

    Intra 2Nx2N 0.3 0.4 0.6 1.0 0.0 0.0 0.0 0.1

    32x32

    Merge skip 9.9 12.4 19.9 8.4 12.2 13.5 15.2 16.8

    Inter2Nx2N 8.1 6.9 4.6 3.1 9.1 7.2 5.4 4.3

    InterNx2N 1.8 1.4 0.9 0.4 2.2 1.9 1.9 1.7

    Inter2NxN 1.7 1.3 0.7 1.0 1.4 1.0 0.9 0.8

    InterAMP 4.4 2.9 1.6 0.6 4.2 3.5 3.1 2.6

    Intra 2Nx2N 2.3 2.3 2.6 2.6 0.2 0.4 0.7 1.1

    16x16

    Merge skip 6.8 5.6 3.9 2.9 8.0 7.7 7.3 6.1

    Inter2Nx2N 9.1 3.7 1.7 0.8 6.9 4.8 3.1 2.0

    InterNx2N 1.6 0.7 0.3 0.1 2.0 1.4 1.0 0.6

    Inter2NxN 1.7 0.6 0.2 0.1 1.2 0.8 0.5 0.3

    InterAMP 4.1 1.4 0.5 0.2 4.1 2.7 1.7 0.9

    Intra 2Nx2N 2.6 2.1 1.7 1.4 1.2 1.6 1.8 1.7

    8x8

    Mergeskip 2.8 1.9 1.2 0.9 3.9 3.3 2.3 1.4

    Inter2Nx2N 5.8 1.3 0.4 0.1 4.9 2.5 1.1 0.4

    InterNx2N 0.3 0.2 0.1 0.0 1.2 0.7 0.3 0.1

    Inter2NxN 0.4 0.2 0.1 0.0 0.7 0.4 0.2 0.1

    Intra2Nx2N 2.9 1.2 0.1 0.5 2.1 1.7 1.2 0.8

    IntraNxN 0.8 0.6 0.7 0.2 1.9 1.1 0.6 0.3

    Class SizeQP

    22 27 32 37

    B

    32x32 33.5 55.0 63.0 65.7

    16x16 19.8 20.9 20.1 19.7

    8x8 36.2 15.5 10.7 10.0

    4x4 10.5 8.5 6.2 4.5

    C

    32x32 35.7 43.4 49.2 52.2

    16x16 27.7 27.7 27.5 29.0

    8x8 21.7 18.1 15.8 13.9

    4x4 14.8 10.8 7.5 4.9

    TABLE. SelectedratioofTU

    TABLE. SelectedratioofCUsizeandPUmode

  • BDBRvs.EncodingtimedependingonCTUsize

    CTUsize:32x32 3.33.4%BDbitrate 7879%encodingtime

    CTUsize:16x16 15.417.5%BDbitrate 5054%encodingtime

    CTUsize:16x16Enc T:50.8%BDbitrate:17.53%

    CTUsize:32x32Enc T:79.22%BDbitrate:3.31%

    CTUsize:64x64(Reference)

    CTUsize:16x16Enc T:54.7%BDbitrate:15.43%

    CTUsize:32x32Enc T:78.92%BDbitrate:3.43%

    SW:HM7.1Seq :ClassBcfg :Randomaccess&Lowdelay

  • BDBRvs.EncodingtimedependingonTUsize

    Transformsize 1616to44oncase

    3.23.5%BDbitrate 96%encodingtime

    88to44oncase 10.211.2%BDbitrate 9192%encodingtime

    MaxTUsize:8x8Quadtreemaxdepth:1Enc T:92.4%BDbitrate:11.2%

    MaxTUsize:8x8Quadtreemaxdepth:1Enc T:91.4%BDbitrate:10.24%

    MaxTUsize:16x16Quadtreemaxdepth:2Enc T:96.8%BDbitrate:3.2%

    MaxTUsize:16x16Quadtreemaxdepth:2Enc T:96.5%BDbitrate:3.5%

    MaxTUsize:32x32Quadtreemaxdepth:3(Reference)

    SW:HM7.1Seq :ClassBcfg :Randomaccess&Lowdelay

  • Toolon/offtest

  • FastencodingalgorithmsinHMsoftware

    Contents note

    FastEncodingSetting:FEN,JCTVCA0124

    EarlyCUtermination SubsampledSADOperation SimpleBiprediction(Thenumberofiteration4>1)

    FastDecisionforMergeRDCost:FDM,JCTVCH178 2Nx2NMerge CBF earlytermination PUlevel

    RoughModeDecision(forIntra):RMD,JCTVCC311/D283

    35 Intramode SATD RD RD RD FullRQT

    PUlevel

    AMPSpeedup:AMPS,JCTVCE316 AMP MEorMerge PUlevelCBFFastModeSetting:CFM,JCTVCF045 PU CBF 0 PU ME PUlevelEarlyCUSetting:ECU,JCTVCF092 CU Skip, CU CUlevelEarlySkipDetectionSetting:ESD,JCTVCG543 Inter2Nx2N EarlySkipDetection CUlevel

    TABLE. FastencodingalgorithmsinHMsoftware

  • IPSL

  • HMencoderforFHD(BQTerrace.seq)

    CPU

    Compress Slice- Interpolation filter (IF)

    - Motion estimation (ME)- Transform-Quantization (TR-Q)

    - Intra prediction- MV derivation- Mode decision

    - Entropy encoding (CABAC update)

    DBF

    SAO

    Encode Slice

    - Entropy encoding

    Oneframe:57930ms

    For real-time?33.33ms

    IF:21548.62msRDOQ:2645.55msTR:1687.37msITR:653.2829ms

    DBF:9.42msSAO:77.33ms

    Inteli7CPU,2.xGHz

  • KWHEVCencoder

    ANSICHEVCencodersoftwarebasedonHMencoder Cleanupfunctionsandvariables Nonrecursivefunctioncall

    Minimummemoryallocationandbandwidth Explicitminimummemoryallocations(usingstaticmemory) Removalofcoderelatedtoduplicatevariablesandstructuretoavoid

    redundantmemorycopy Removalofunnecessarymemoryallocation

    Softwareoptimization SIMDimplementation(Costfunction,transform,interpolation,deblocking,..) Framelevelinterpolationfilter

    Parallelprocessing SlicelevelparallelprocessingusingOpenMP MotionestimationusingCUDA

  • PerformanceofKWHEVC

    1) Cconverting:18%ATSgain(anyBDBR,BDPSNRloss)2) +SIMD+FramelevelIF:2speedup(anyBDBR,BDPSNRloss)3) +Fastmodedecision:5speedup(12%BDBRloss)4) +Slicelevelparallel:20speedup(46%BDBRloss)5) +CUDAME&MD(lowdelay P,adjustmentConfig.):200speedup

    (1520%BDBRloss){Inteli7(3.3GHz),GeForce660}=>10fps

    200

    Class Sequence Frame QP FPS

    B

    Kimono 240

    22 5.7427 7.2532 8.3837 9.40

    ParkScene 240

    22 5.5127 7.5232 8.8737 10.03

    Cactus 500

    22 5.1927 7.7032 9.0937 10.09

    BasketballDrive 500

    22 4.8027 6.7132 8.0937 9.18

    BQTerrace 600

    22 4.1427 7.6832 9.6037 10.62

    C

    BasketballDrill 500

    22 14.8627 19.0732 23.6037 28.12

    BQMall 600

    22 14.8127 19.8832 24.9137 29.20

    PartyScene 500

    22 11.0927 16.4632 22.0337 27.60

    RaceHorses 300

    22 10.4827 14.6032 19.4637 24.49FIGURE. Encodingspeedintermsofthedevelopmentsteps

    TABLE. EncodingspeedofKWHEVC

  • Comparisonofdecodercomplexity

    HM10.0(C++)vs.KWHEVCdecoder(C89) Cconversion Softwareoptimization

    SequencesDecodingperformance

    HM10.0(sec) FPS

    KWHEVC(sec) FPS Ratio

    BQTerrace_1920x1080_60_qp22.bin 98.271 6.11 71.007 8.45 1.38

    BQTerrace_1920x1080_60_qp27.bin 46.531 12.89 30.778 19.49 1.51

    BQTerrace_1920x1080_60_qp32.bin 32.737 18.33 19.234 31.19 1.70

    BQTerrace_1920x1080_60_qp37.bin 28.189 21.28 15.912 37.71 1.77

    Cactus_1920x1080_50_qp22.bin 51.355 9.74 36.270 13.79 1.42

    Cactus_1920x1080_50_qp27.bin 31.371 15.94 20.155 24.81 1.56

    Cactus_1920x1080_50_qp32.bin 25.506 19.60 15.381 32.51 1.66

    Cactus_1920x1080_50_qp37.bin 21.933 22.80 12.792 39.09 1.71

  • ParallelismandSIMDprocessing

    Parallelism Decodercannotexpectthetileorslicepartitioningofpictures Decodershouldconsiderworstbitstreams Theentropydecodercannotbeparallelized CTUbased2Dwavefrontparallelprocessingisapromisingwayfor

    parallelism Deblocking filterandSAOaremoreproperfortheparallelism

    Lessdatadependency

    SIMDprocessing Inversetransform(X=ATYA) Motioncompensation

    About40%ofdecodercomplexity 8tapand4tapfilters

  • PerformanceoftheoptimizedKWHEVCdecoder

    SIMDandparallelization Pixelreconstruction,interpolation(partial) Tasklevelparallelism(entropy,pixeldecoding) Datalevelparallelism(deblocking filter)

    2.934.98

    2.28Mbps

  • Conclusion

    OverviewofHEVC EncodingparametersforHEVCtestmodel(HM) ComplexityanalysisofHEVCencoder Fastencodingalgorithmsandperformances Issuesofparallelprocessing

  • HEVC

    :,:

    1. HEVC2. 3. 4. HEVC 5. 6. 7. 8. 9. 10. 11. CABAC12. 13. 14. 15. HEVC A. 2013