hevc encoder r15 final 심동규
DESCRIPTION
hevcTRANSCRIPT
-
HEVCEncoderKwangwoonUniversity(KWU)
Donggyu Sim ([email protected])
:,
-
Contents
OverviewofHEVC
EncodingissuesforHEVCtestmodel(HM)
ComplexityanalysisofHEVCencoder
Fastencodingalgorithmsandperformances
Issuesofparallelprocessing
Conclusion
-
OVERVIEWOFHEVC
-
BlockdiagramofHEVCstandard
Typicalblockbasedhybridcodecstructure+additionalenhancedtools
Fn
PictureBuffer
Fn1
Fn2
Fn
Interprediction
ME MCDCTIFAMVPMerge
Intraprediction
Referencesamplepadding
PlanarDC
33angularMDIS
Transform
TUsize:3232~44
Residualquadtree
Quantization
DeltaQP RDOQ
Entropycoding
CABAC
Loopfilter
Sampleadaptiveoffset
Deblockingfilter
Transform1
Quantization1
++
Rn
Rn
FIGURE. BlockdiagramofHEVCencoder
-
BlockstructureinHEVC
ThreeblockstructuresaredefinedinHEVC Codingunit(CU) Predictionunit(PU) Transformunit(TU)
CU3232
CU1616 CU1616
CTU64
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
CTU6464 CTU6464 CTU64
TUdepth0
TUdepth1
TUdepth2
2N2N 2NN N2N
nL2N2NnD2NnU nR2N
NN
FIGURE. AnexampleofCU,PU,andTUpartitioninHEVC
-
ENCODINGSTRUCTURESOFHEVC
-
DecisionlevelforHEVCencoder
Sequencelevel Codingstructure(Allintra,Lowdelay,Randomaccess) Profile,tier,level Max/MinCTUsize,CUdepth Max/Min TUsize,TUdepth Toolon/off(SAO,deblocking,WPP,tile)
Picturelevel #refframe,ratecontrol Tile,slice
Slice ortilelevel Refframes Deblockingfilterparameters
CTUlevel CUpartitioning Sampleadaptiveoffsetparameters
CUlevel PUandTUpartitioning
PU &TUlevel Predictionmodes,motionvectors cbf,coefficients
Sequence
Picture
CTU
SliceorTile
CU
PU&TU
-
Temporalpredictionstructure (1/3)
Allintra(AI) Allpictureiscodedasinstantaneousdecodingrefresh(IDR)picture Notemporalpredictionisallowed
IDRPicture
time
0
QPI
=POC
Codingorder 1
QPI
2
QPI
3
QPI
4
QPI
5
QPI
6
QPI
7
QPI
-
Temporalpredictionstructure (2/3)
Lowdelay(LD) ThefirstpictureshallbecodedasIDRpicture GeneralizedPandB(GPB) pictureshallbeusedfortheothersuccessivepictures
TheGPBshallbeabletouseonlythereferencepictures,eachofwhosePOCissmallerthanthecurrentpicture(allreferencepictureinList_0andList_1shallbetemporallypreviousindisplayorderrelativetothecurrentpicture)
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicturedependingontemporallayer
IDRorIntrapicture GPB(GeneralizedPandB)
picture
0
1
2
4
53
6
7
8
time
QPI
QPBL3=QPI+3
QPBL2=QPI+2
QPBL3 QPBL3 QPBL3
QPBL2
QPBL1=QPI+1 QPBL1:Depth==0:Depth==1:Depth==2
=POC
Codingorder
-
Temporalpredictionstructure(3/3)
Randomaccess(RA) HierarchicalBstructureshallbeusedforcoding IDR Intrapictureorcleanrandomaccess(CRA) pictureshallbeinsertedcyclicallyperaboutone
secondinrandomaccesspoint QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
IDRorIntrapicture
GPB(GeneralizedPandB)picture
0
4
3
2
75 8
1
time
ReferencedBPicture
NonreferencedBPicture
8
4
1
2
3 5
6
7
0
QPI
QPBL4=QPI+4 QPBL4 QPBL4 QPBL4
QPBL3=QPI+3 QPBL3
QPBL2=QPI+2
QPBL1=QPI+1
POC
Codingorder
:Depth==0:Depth==1:Depth==2:Depth==3
-
Picturepartitioning
Picture :Apicturecontainsanarrayofluma samplesinmonochromeformatoranarrayofluma samplesandtwocorrespondingarraysofchroma samplesin4:2:0,4:2:2,and4:4:4colorformat
Codingorderofcodingtreeunit(CTU) israsterscanorder CTU:AnNxN blockofluma samplestogetherwithtwocorrespondingblockofchroma
samples Analogoustomacroblock inpreviousstandards Themaximumallowedsizeoftheluma blockinaCTUisspecifiedtobe64x64 inMainprofile
*CTU&CTB:TheCTUconsistsofaluma codingtreeblock(CTB)andthecorrespondingchroma CTBsandsyntaxelements
30
17
FIGURE. ExampleofapicturedividedintoCTUs
Example)ClassB(19201080) BQTerraceCTUsize:64643017CTUpartition
-
Picturepartitioning
Aslice isasequenceofcodingtreeunits(CTUs)
Unlikeslices,tilesarealwaysrectangularandalwayscontainanintegernumberofcodingtreeunitsincodingtreeunitrasterscan
Atleastoneofthefollowingconditionsshouldbetrueforeachsliceandtileinapicture AllCTBsinaslicebelongtothesametile,orallCTBsinatilebelongtothesameslice
FIGURE. Apicturewith3017codingtreeunitsthatispartitionedintothreeslices
FIGURE. Apicturewith3017 codingtreeunitsthatispartitionedintothreetiles
-
Codingunit(CU)andcodingtreestructure
Codingunit(CU):theleafnodeofaquadtreestructure Squareblocks Size:from88uptothesizeofCTU SizeofCTUisspecifiedinsequenceparameterset(SPS)
Thequadtreepartitioningstructureallowsrecursivesplittingintofourequallysizednodes
TABLE.SyntaxforsizeofCTUinSPSseq_parameter_set_rbsp() { Descriptorlog2_min_coding_block_size_minus3 ue(v)log2_diff_max_min_coding_block_size_minus2 ue(v)
}
FIGURE. Exampleofcodingtreestructure
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
8x8~64x64
-
ExampleofCUquadtreestructure
Codingunitquadtreestructure StartingfromCTU,eachCUcanbesplitinto4smallerCUs
64CU:split_coding_unit_flag(1)32CU:split_coding_unit_flag(0)32CU:split_coding_unit_flag(1)16CU:split_coding_unit_flag(0)16CU:split_coding_unit_flag(0)16CU:split_coding_unit_flag(1)
FIGURE. ExampleofCUquadtreestructure
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
TABLE.SyntaxforCUsplitflagincodingtreecoding_tree(x0, y0, log2CbSize, ctDepth) { Descriptor
if(x0+(1 log2MinCbSize && NumPCMBlock == 0 )split_coding_unit_flag[x0][y0] ae(v)
}
-
Codingunit(CU)decision
Codingunitquadtreestructure StartingfromCTU,eachCUcanbesplitinto4smallerCUs
BestCURDcostcalculationforeachCUlevel
CompetitionofthebestCUanditssubpartitionedCUs
CUsize
3232
23 10
3232
2 5
3232
44 15
3232
65 20
1616
8 2
1616
3 1
1616
13 3
1616
18 4
885
884
887
886
6464
1 21
8810
889
8812
8811
-
Predictionunit(PU)types
Predictionunit(PU):aregionusedforcarryingtheinformationrelatedtothepredictionprocesses
2PUtypesforIntraprediction 2N2N,(SmallestCU:additionallyNN)
8PUtypesforInterprediction SmallestCU:
8x8:2N2N,N2N,2NN Others:2N2N,N2N,2NN,NN
Others:2N2N,N2N,2NN,nL2N,nR2N,2NnU,2NnD
FIGURE. PUpartitionsinHEVC
2N2N N2N 2NN NN
2NnD2NnUnR2NnL2N
-
Predictionunit(PU)types
CurrentCUsizeSCUsize
AMPenableflag
CurrentCUsize==CurrentCUsize==SCUsize
AMPenableflag
CurrentCUsize==88
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
Intra2N2N
IntraNN
Intra2N2N
IntraNN
No Yes
YesYesNo No
InterAMP
Inter2N2N
Inter2NN
InterN2N
InterNN
Inter2N2N
Inter2NN
InterN2N
-
Transformunit(TU)andtransformtreestructure
Transformunit(TU):aregionsharingthetransformandquantizationprocesses Squareshape Size:from4x4upto32x32 AvailabletransformblocksizesandmaxtransformhierarchydeptharespecifiedinSPS
RootofTUquadtreeisCUwhichtheTUbelongto
FIGURE. TUquadtreestructureinHEVC
TABLE.SyntaxforsizeofTUinSPS
32
32
seq_parameter_set_rbsp() { Descriptor
log2_min_transform_block_size_minus_2 ue(v)
log2_diff_max_min_transform_block_size ue(v)
max_transform_hierarchy_depth_inter ue(v)
max_transform_hierarchy_depth_intra ue(v)
}
4x4~32x32
-
INTER/INTRAPREDICTIONANDPU/TUDECISION
-
OverallofHMencodingprocess
Sequence
Picture
CTUdecisionsinasliceoratile
Deblocking filterSAOEntropycoding
CUpartitioningdecision
PU&TUpartitioningdecision
RDOprocess
-
32323232 3232 3232
16161616 1616 1616
8888
8888
6464
8888
8888
Inter2N2N InterNN
InterN2NInter2NNInterAMP
Intra2N2N IntraNN
IntraPCM
CUsizeSCU
CompressCUCompressCU CompressCU CompressCU
Finish
No
Yes
CompressCU
Mergeskip
RDOprocesstodecidePU&TU
-
Intrapredictionflow
Predictionmodes Luma (35modes)
Planar,DC,Angularprediction(33directions) Chroma(5modes)
Planar,DC,Vertical,Horizontal,DM Filtering
MDIS(Modedependentintrasmoothing) DCfiltering,Ver/Hor filtering
3MPM
2N2NPU
MDIS
Intraprediction
Referencesamplepadding
RDcost,Intra_mode
Bestmodedecision
N
Y
Mode
-
Fastintraprediction&TUdecisioninHM
IntrapredictionstepinHM1)Roughpredictionmodedecision
35prediction SelectNpredictionmodes
Distortion(SATD)+lamda *modebits #ofcandidatepredictionmodes:Nmodes+MPM(3)
2)Bestintrapredictionmodedecisionwithtransform Transform(RQTdepth=1) 1bestintramodedecision
3)BestRQTdecisionwithRDcosts RQTdepth=3
35modes
Nmode+MPM
1Bestmode
BestmodeRDcost
-
Interprediction
Skip:Mergeskip
Nonskip Unidirectionalprediction Bidirectionalprediction Halfpel/Quarterpel motionrefinement
DCTIF(8tap/4tap) Merge
Mergeskip
Inter2N2NInter2NNInterN2N
Bestmodedecision
Unidirectionalprediction
Bidirectionalprediction
Merge
Bestmodedecision
Cur.CU
RDcost,Bestmode
Spatialcandidatesderivation
Temporalcandidatederivation
Additionalcandidatesderivation
RDcostcalculationAMP
(nL2N,nR2N,2NnU,2NnD)
FIGURE. Flowchart Interprediction
-
Interprediction
Intercodingmode Mergeskipmode(CUlevel)
skip_flag=1 andmerge_idx Noreferenceindex Nomotionvector Noresidual
Mergemode(PUlevel) skip_flag=0,pred_mode_flag,and part_mode merge_flag=1andmerge_idx Noreferenceindexandmotionvector no_residual_syntax_flag:Residualisencodedornot
GeneralPUmodes skip_flag=0pred_mode_flag,andpart_mode merge_flag=0 ref_idx_lx andmvp_lx_flag basedonAMVP(x=0or1) MVDisencoded no_residual_syntax_flag:Residualisencodedornot
-
InterpredictionflowBEGIN input : current PU part mode for a CU
FOR PU partition
FOR List = 0 to 1 DOFOR 0 to refidx DO
Motion estimation (diamond search, SR : 64)Decide best RD-cost for uni-prediction
ENDFORENDFOR
IF bi-directional prediction THENFOR iteration = 0 to 3 DO
FOR 0 to refidx DOMotion estimation (full search, SR : 4)Decide best RD-cost for bi-prediction
ENDFORENDFOR
ENDIFENDFOR
Merge
RD-cost competition among uni/bi-prediction and merge
END output : inter prediction syntax
Fastencoderdecision(FEN)SubsampledSADforintegerME
UsesubsampledSADwhenrows>8forintegerMEOnly1iterationforbipredictivemotionsearch
defaultnumber:4
FastDecisionforMergeRDcost(FDM)Aftermergewithmergeidx X,ifallcbf iszerothenmerge
processisterminated
FIGURE. Pseudo code - Inter prediction flow
time
Cur
CurrentPU
Uniprediction
Biprediction
LIST_0 LIST_1
-
Biprediction
SearchP0 andP1whichproduceminimumerrorwithO R =(O P),where P =(P0+P1)/2
PracticalBipredictivesearch1)SearchP1whichproduceminimum2Rwith(2O P0)
R =O (P0+P1)/2 2R=(2O P0) P12)SearchP0whichproduceminimumerrorwith(2O P1)
R =O (P0+P1)/2 2R=(2O P1) P0
BipredSearchRange :4 FEN:1(iteration:1)
P0 P1O
List1Reference
List0Reference Currentframe
-
Example)Biprediction
Bidirectionalprediction
Iteration:2
Iteration:3
Unidirectionalprediction
P1O
List1Reference
List0Reference Currentframe
Searchrange:64
P0 P1O
BipredSearchRange :4
P0 P1O
BipredSearchRange :4
Iteration :1
P0 P1O
BipredSearchRange :4
Iteration:4
P0 P1O
BipredSearchRange :4
P02OR0
P12OR1
P02OR0
P12OR1
-
Motionestimation(Integerpel)
Practicalmotionestimation(diamondsearch) Firstsearch &earlytermination
Max3(default)moreroundsafterarecentbestmatch
Rasterrefinementsearch Ifintegerpel distanceisbiggerthan5,thenconducttherasterrefinementsearch.
Starrefinementsearch&earlytermination Diamondsearchwiththecenterofthebestmatchfromtheearlytwosteps Max2roundsafterthebestmatch
FIGURE. Rasterrefinementsearch
3
3 2 32 1 2
3 2 1 0 1 2 3 2 1 2
3 2 3
3
FIGURE. Firstsearch&startrefinement
-
Motionestimation(Subpel refinement)
Integerpel motionsearch Costfunction:SAD
Subpel motionrefinement Costfunction:SATD Halfpel refinement Quarterpel refinement
FIGURE. Integerpel motionsearch
FIGURE. Halfpel motionsearch
FIGURE.Quarterpel motionsearch
Searchrange
S
e
a
r
c
h
r
a
n
g
e
Integerpel
Halfpel
Quarterpel
-
Interpolation
DCTIFinHEVC Fixed8tap(7tap)and4tapinterpolationfiltersbasedonDCT 2Dseparablefilter
8*Horizontal1Dfilter+1*Vertical1Dfilter
Component Filter()
Luma1/4 {1,4,10, 58,17,5,1,0}
1/2 {1, 4,11,40,40,11,4,1}
Chroma
1/8 {2,58, 10,2}
3/8 {6,46,28,4}
1/4 {4,54,16,2}
1/2 {4, 36,36,4}
FIGURE. Integerandfractionalsamplepositionsforluma andchroma interpolation
TABLE.Interpolationfiltercoefficients A-1,-1 A0,-1 a0,-1 b0,-1 c0,-1 A1,-1
A-1,0 A0,0 A1,0
A-1,1 A0,1 A1,1a0,1 b0,1 c0,1
a0,0 b0,0 c0,0
d0,0
h0,0
n0,0
e0,0
i0,0
p0,0
f0,0
j0,0
q0,0
g0,0
k0,0
r0,0
d-1,0
h-1,0
n-1,0
d1,0
h1,0
n1,0
A2,-1
A2,0
A2,1
d2,0
h2,0
n2,0
A-1,2 A0,2 A1,2a0,2 b0,2 c0,2 A2,2
B0,0 ae0,0 ag0,0 ah0,0ab0,0 ac0,0 ad0,0 af0,0 B1,0
B1,1B0,1
be0,0 bg0,0 bh0,0bb0,0 bc0,0 bd0,0 bf0,0ba0,0
ce0,0 cg0,0 ch0,0cb0,0 cc0,0 cd0,0 cf0,0ca0,0
de0,0 dg0,0 dh0,0db0,0 dc0,0 dd0,0 df0,0da0,0
ee0,0 eg0,0 eh0,0eb0,0 ec0,0 ed0,0 ef0,0ea0,0
fe0,0 fg0,0 fh0,0fb0,0 fc0,0 fd0,0 ff0,0fa0,0
ge0,0 gg0,0 gh0,0gb0,0 gc0,0 gd0,0 gf0,0ga0,0
he0,0 hg0,0 hh0,0hb0,0 hc0,0 hd0,0 hf0,0ha0,0
ah-1,0
bh-1,0
ch-1,0
dh-1,0
eh-1,0
fh-1,0
gh-1,0
hh-1,0
he0,-1 hg0,-1 hh0,-1hb0,-1 hc0,-1 hd0,-1 hf0,-1ha0,-1
ba1,0
ca1,0
da1,0
ea1,0
fa1,0
ga1,0
ha1,0
ae0,1 ag0,1 ah0,1ab0,1 ac0,1 ad0,1 af0,1
-
Inter2N2N InterNN
InterN2NInter2NNInterAMP
Intra2N2N IntraNN
IntraPCM
CUsizeSCU
CompressCUCompressCU CompressCU CompressCU
Finish
No
Yes
CompressCU
Mergeskip
ExampleofPUdecision
BipredictionRDcost=SAD/SATD+*Bmode
=9000
BipredictionRDcost=SSE+*Bmode
=8500
MergeRDcost=SAD/SATD+*Bmode
=11000
UnipredictionRDcost=SAD/SATD+*Bmode
=12000
Vs.
Vs.
Example
NoTUdecisionNoreconstruction
TUdecisionReconstruction
-
TUdecisionflow(Inter)
Residualquadtree
2N2N N2N 2NN NN
2NnD2NnUnR2NnL2N TUdepth:0
TUdepth:1
TUdepth:2
T/QIT/IQ(recon)RDcost(SSE+*Bmode)
Original Predictor Residual
-
TUdecisionflow(Intra)
Example)intra_pred_mode =10(verticalmode)
Referencesamples
Predictiondirection
IntrapredictionusingreferencesamplesT/QIT/IQRDcost(SSE+*Bmode)
Predictiondirection
Referencesample(afteraboveblockisreconstructed)
TUdepth:N
TUdepth:N+1
Residual
-
Transform
ImplementationoftransforminHEVC Matrixmultiplication
Straightforward/Fewcodelines Hugenumberofoperations,butSIMDfriendly
Partialbutterflyimplementation Utilizessymmetry/antisymmetrypropertiesofbasisvectors Lessmultiplications/additions Increasenumberofcodelines
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
-
PartitioningsyntaxforaCTU
Syntax
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
64CU:split_coding_unit_flag(1)32CU:split_coding_unit_flag(0)
32CU:split_coding_unit_flag(1)16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(1)
32x32TU:splitflag(1)
16x16TU:splitflag(0)
16x16TU:splitflag(1)8x8TU:splitflag(0)8x8TU:splitflag(0)8x8TU:splitflag(0)8x8TU:splitflag(0)
16x16TU:splitflag(1)8x8TU:splitflag(0)8x8TU:splitflag(1)4x4TU:splitflag(0)4x4TU:splitflag(0)4x4TU:splitflag(0)4x4TU:splitflag(0)
FIGURE. ExampleofTUquadtreestructure
PUpartition&Pred_mode info
TUsplitflags&Coefficients
PUpartition&Pred_mode info
TUsplitflags&Coefficients
FIGURE. ExampleofCUquadtreestructure
SKIPflag(mergeidx) Predictionmodeflag(intraor inter) PUpartsize(2Nx2N,2NxN,Nx2N,NxN,
AMP) Predictioninfo.(Intramodeormv and
ref.idx.,mergeidx,AMVPidx)
PUpartition&Pred_mode info
TUsplitflags&Coefficients
-
ENCODINGPROCESSOFLOOPFILTER
-
Inloopfilter
InHEVC,twoprocessingsteps,adeblocking filter(DBF)andasampleadaptiveoffset(SAO) operationareapplied
DBF:similartotheDBFoftheH.264/AVCstandard SAO:appliedadaptivelytoallsamplessatisfyingcertainconditions(whiletheDBFisonlyapplied
tothesampleslocatedatblockboundaries)
On/offsyntaxesforinloopfilters1. slice_disable_deblocking_filter_flag :slicelevelon/off2. sample_adaptive_offset_enabled_flag :slicelevelon/off
-
Deblocking filter(DBF)
Basically,deblocking filterofHEVCissimilartothatofH.264/AVC Inloopfiltering
Codingperformanceforinterframe Framebasedfiltering On/offcontrolisprovided
Adaptivefiltering boundarystrength
Filteringontheblockboundaries transformandpredictionboundary
Sequentialfilteringforverticalandhorizontaledges Samplevaluesmodifiedduringfilteringofverticaledgesareusedasinputforthefilteringof
thehorizontaledges
-
Deblocking filter(DBF)
FeaturesofHEVCdeblocking filtercomparedtoH.264/AVC FortheTUsandPUswithedgeslessthan8samplesineitherverticalorhorizontaldirection,only
theedgeslyingonthe88samplegridarefiltered
verticaledges>horizontalfiltering
horizontaledges>verticalfiltering2
1 verticaledges>horizontalfiltering
horizontaledges>verticalfiltering2
1
[e.g. 16x16Codingunit]
H.264/AVC HEVC
(a) H.264/AVC (b)HEVCFIGURE. DerivationprocessfortheboundaryfilterstrengthinAVCandHEVC
-
ProcessingflowofDBF
Boundarydecision Threekindsofboundariesinvolvinginthefiltering
CU,TU,PUboundary CUboundariesarealwaysinvolvedinthefiltering TUboundaryat88blockgridandPUboundarybetween
eachPUinsideCUareinvolvedinthefiltering [Except]PUboundaryisinsideTU,theboundaryshall
notbefiltered
Bs calculation Bs iscalculatedin44blockbasis>remappedto88grid TwoBs arebelongto8pixelsconsistingalinein44grid,
maximumBs isselectedasBs forboundariesin88grid
Boundarydecision
Bs calculation(44>88)
,tc decision
filteron/offdecision
Strong/weakfilterselection
Strongfiltering Weakfiltering
FIGURE.Overallprocessingflowofdeblocking filterprocess
-
Overviewofsampleadaptiveoffset(1/2)
Artifacts Blockingartifacts,ringingartifacts,colorbiases,andblurringartifacts Alargertransformcouldintroducemoreartifacts
HEVC:4x4~32x32transform Artifactsareexistatmediumandlowbitrates
Alargenumberofinterpolationtapscanalsoleadtomoreseriousringingartifacts HEVC:8tap(luma),4tap(chroma)
Sampleadaptiveoffset Toreducesampledistortion(reconstructedpixels originalpixels) Average3.5%BDratereduction (with1%encodingtimeincrease,2.5%decodingtimeincrease)
SAOislocatedafterDFandalsobelongstoinloopfiltering
-
Overviewofsampleadaptiveoffset(2/2)
SAOfeatures EachcolorcomponentmayhasitsownSAOparameters TwoSAOtypes
Edgeoffset(EO;4EOclasses) Bandoffset(BO;1BOclass)
SAOmerging(leftCTUoraboveCTU) SAOmergeinformationissharedforthreecolorcomponents
SAOobjectandsubjectiveresults
SAOisenabled(QP=32)
SAOisdisabled(QP=32)
Anchor:DisablingSAOTest:EnablingSAO
CTUsizeinLuma: 64x64CTUBoundary:option1
YDBrate
Allintra(AI)
Randomaccess(RA)
Low delayB(LB)
LowdelayP(LP)
ClassSummary
Class A 0.6% 2.3%
ClassB 0.5% 2.1% 2.0% 11.1%
ClassC 0.5% 1.1% 1.8% 7.1%
ClassD 0.4% 0.3% 0.7% 4.4%
ClassE 0.6% 2.3% 11.0%
ClassF 1.5% 2.6% 5.7% 12.3%
OverallSummary
All 0.7% 1.7% 2.5% 9.2%
Enc.Time(%) 101% 100% 100% 100%
Dec.Time(%) 103% 103% 102% 102%
-
EdgeoffsetofSAO
Four1Ddirectionalpatterns horizontal,vertical,135 diagonal,45 diagonal
OnlyoneEOclasscanbeselectedforeachCTBofwhichEOisenabled EachsampleinsidetheCTBisclassifiedintooneoffivecategories
Oneedgeoffsetisencodedforeachcategory(4offsetsaretransmittedinthecaseofEO) Noinformationforclassificationoffivecategories(encoderanddecoderusesamerules)
a c b
a
c
b
a
c
b
a
c
bFIGURE. Four1DdirectionalpatternsforEOsampleclassification
Category Condition
1 cb
0 Noneoftheabove(SAOisnotapplied)
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category1
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category2
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category3
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category4
Positiveedgeoffset Negativeedgeoffset
TABLE.Sampleclassificationrulesforedgeoffset
-
BandoffsetofSAO
BOimpliesoneoffsetisaddedtoallsamplesofthesameband Thesamplevaluerangeisequallydividedinto32bands For8bitsamplesrangingfrom0to255,thewidthofabandis8
Onlyoffsetsoffourconsecutivebandsandthestartingbandpositionaresignaledtothedecoder
Theaveragedifferencebetweentheoriginalsamplesandreconstructedsamplesinabandissignaledtothedecoder
Four offsetsaretransmittedinthecaseofBO
0 max
Thefirstbandforwhichoffsetistransmitted
Four offsetsaretransmittedforfourconsecutivebands
-
AfastdistortionestimationforSAO
Distortionshavetobecalculatedmanytimes Letk,s(k),andx(k)besamplepositions,originalsamples,andpreSAOsamples,
respectively DistortionbetweenoriginalsamplesandpreSAOsamples
DistortionbetweenoriginalsamplesandpostSAOsamples
h istheoffsetforthesamplesetandN isthenumberofsamplesintheset,thedeltadistortionisdefined(NandEcanbecalculatedonlyonce)
Ck
pre kxksD2))()(((
Ck
post hkxksD2)))(()((
Ck
prepost hENhkxkshhDDD 2)))()((2(22
Ck
kxksE ))()((RDJ
-
Offsetrefinement
Initialoffsetvalue,hisE/N Allthenumbersbetweenzeroandoffsetareusedforoffsetrefinementprocess
0
1
2
3
4
5
6
Initialoffset
0
1
2
3
4
5
6
Initialoffset
Ck
kxksE ))()((
-
EncodingflowofSAOinHM
CTUbasedprocessing
BO 32 band sum of difference, pixel count
EO class0 category Sum of difference, pixel count
EO class1 category Sum of difference, pixel count
EO class2 category Sum of difference, pixel count
EO class3 category Sum of difference, pixel count
EO class0 rdcost rdcost0 = distortion + rate( A fast distortion estimation, offset refinement )EO class1 rdcost rdcost1 = distortion + rate( A fast distortion estimation, offset refinement )EO class2 rdcost rdcost2 = distortion + rate( A fast distortion estimation, offset refinement )EO class3 rdcost rdcost3 = distortion + rate( A fast distortion estimation, offset refinement )
BO band position ( A fast distortion estimation, offset refinement )
Rdcost type (BO, EO class0, EO class1, EO class2, EO class3)
BO rdcost rdcostBO = distortion + rate
Left merge, up merge rdcost
E
N
FIGURE. Flowchart Sampleadaptiveoffset
Compressslice
Deblocking filter(DBF)
Sampleadaptiveoffset(SAO)
Encodeslice
RDOofSAO
ProcessSAO
1)CalculateSAOstatistics
2)CalculateSAORDcost
3)Mergeleftorup
1)CalculateSAOstatistics 2)CalculateSAORDcost
-
Slicelevelon/offcontrolofSAO
Hierarchicalquantizationparameter(QP)settingsforeachgroupofpictures
Aslicelevelon/offdecisionalgorithm Fordepth=0picture,SAOisalwaysenabledinthesliceheader Otherdepth
Ifthepreviouspicture(thelastpictureofdepthN1indecodingorder)disablesSAOformorethan75%ofCTUs,thecurrentpicturewillearlyterminatetheSAOencodingprocessanddisableSAOinallsliceheaders
8k
(8k+4)Depth=0
Depth=1
Depth=2
Depth=3
AhigherQP
(8k+2)
(8k+1) (8k+3) (8k+5) (8k+7)
(8k+6)
-
CTUbasedencodingissuesaboutSAO
SinceSAOisafterDF,theSAOparameterscannotbepreciselyestimateduntilthedeblocked samplesareavailable
InCTUbasedencoder,thedeblocked samplesoftherightcolumnsandthebottomrowsinthecurrentCTUmaybeunavailable
TwopracticalCTUbasedSAOdecisions Case1.Avoidingusingthebottomrowsandrightcolumns(currentHM) Case2.Usenondeblockfilteredpixelsforthebottomrows
andrightcoloumns (JCTVCJ0139)
TABLE.AverageBDratesofenablingSAOversusdisablingSAOfordifferentCTUsizes
deblockfilteredpixels
nondeblockfilteredpixels
CTUSizeinLuma
Option1:SkiprightandbottomsamplesintheCTUduringparameterestimation
Option 2:UsepredeblockedsamplesnearrightandbottomboundariesintheCTUduring
parameterestimation
Y Cb Cr Y Cb Cr
6464 3.5% 4.8% 5.8% 3.3% 5.3% 6.6%
3232 2.0% 1.1% 1.5% 2.5% 2.0% 2.7%
1616 0.0% 0.3% 0.3% 0.8% 0.4% 0.1%
-
COMPLEXITYANALYSISOFHEVCENCODER
-
ComplexityanalysisofHMencoder
Testsequences Sequence:ClassB(19201080),ClassC(832480)
ClassB:Kimono,ParkScene,Cactus,BasketballDrive,BQTerrace
ClassC:BasketballDrill,BQMall,PartyScene,RaceHorse
QP:22,27,32,37 Mainprofile Randomaccess,lowdelay
Testenvironment HM7.0software IntelCoreTM [email protected] 4GBmemory Windows7(64bit) Analysistool:IntelVtuneTM AmplifierXE
FIGURE. ClassB BasketballDrive
FIGURE. ClassC BQMall
-
ProfilingresultofHEVCencoder
Class ModuleQP
22 27 32 37
B
Entropy 6.6 3.4 1.0 0.9
Intra 3.3 2.2 2.1 1.4
Inter 68.4 78.1 83.9 85.7
TR+Q 20.4 15.2 11.7 10.6
Loopfilter 0.2 0.2 0.2 0.1
etc 1.2 1.1 1.3 1.5
C
Entropy 6.5 3.9 2.8 1.3
Intra 2.9 2.7 2.2 1.8
Inter 68.8 74.9 79.8 83.3
TR+Q 20.7 17.0 13.9 12.4
Loopfilter 0.2 0.2 0.2 0.1
etc 1.0 1.5 1.4 1.2
Class ModuleQP
22 27 32 37
B
Entropy 6.1 2.8 0.4 0.3
Intra 3.4 2.0 1.2 1.2
Inter 71.3 81.2 87.3 89.1
TR+Q 18.6 13.0 9.9 8.5
Loopfilter 0.2 0.2 0.2 0.1
etc 0.8 1.2 0.8 0.9
C
Entropy 5.3 3.1 1.1 0.4
Intra 3.0 2.5 1.8 1.5
Inter 72.6 79.1 83.5 87.2
TR+Q 18.2 14.9 12.1 10.1
Loopfilter 0.2 0.2 0.2 0.1
etc 1.1 0.6 1.6 1.0
TABLE. ComplexityratioofHM7.0encoder(RA) TABLE. ComplexityratioofHM7.0encoder(LD)
-
Loopfilter:0.10.2%
Interprediction:7781%
Intraprediction:12%
Entropycoding:24%
Tr +Q:1416%
ComplexityportionsofHMencoder
Fn
PictureBuffer
Fn1
Fn2
Fn
Interprediction
ME MCDCTIFAMVPMerge
Intraprediction
Referencesamplepadding
PlanarDC
33angularMDIS
Transform
TUsize:3232~44
Residualquadtree
Quantization
DeltaQP RDOQ
Entropycoding
CABAC
Loopfilter
Sampleadaptiveoffset
Deblockingfilter
Transform1
Quantization1
++
Rn
Rn
Interprediction
Transform+Q
Intraprediction
Loopfilter
Entropycoding
etcFIGURE. HEVCencoderblockdiagram andprofilingresult
-
ComplexityportionsforCUsizesandmodes
FIGURE. ExampleofCUquadtreestructure
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
TABLE. ComplexityportionsforCUsizesandmodes
Size Mode RA(%) LD(%) Average (%)
64x64
Intra 2.1 1.0 1.6
Inter 19.0 31.9 25.5
Skip 3.9 3.4 3.7
32x32
Intra 1.9 0.7 1.3
Inter 25.0 27.4 26.2
Skip 4.5 3.2 3.9
16x16
Intra 2.3 0.2 1.3
Inter 17.0 12.5 14.8
Skip 3.2 1.7 2.5
8x8
Intra 2.4 0.4 1.4
Inter 8.7 4.9 6.8
Skip 1.7 0.6 1.2
-
SelectedratiosofCU,PUandTUCU size PUmode
ClassB ClassC
22 27 32 37 22 27 32 37
64x64
Merge skip 10.6 26.6 43.3 55.2 11.7 20.6 30.6 39.5
Inter2Nx2N 4.5 7.1 7.2 6.0 5.8 7.5 6.7 5.5
InterNx2N 1.4 2.2 1.8 1.3 1.6 1.8 1.7 1.7
Inter2NxN 1.5 1.9 1.3 0.9 1.2 1.0 0.8 0.7
InterAMP 1.2 1.4 1.0 0.7 1.0 1.1 1.0 1.1
Intra 2Nx2N 0.3 0.4 0.6 1.0 0.0 0.0 0.0 0.1
32x32
Merge skip 9.9 12.4 19.9 8.4 12.2 13.5 15.2 16.8
Inter2Nx2N 8.1 6.9 4.6 3.1 9.1 7.2 5.4 4.3
InterNx2N 1.8 1.4 0.9 0.4 2.2 1.9 1.9 1.7
Inter2NxN 1.7 1.3 0.7 1.0 1.4 1.0 0.9 0.8
InterAMP 4.4 2.9 1.6 0.6 4.2 3.5 3.1 2.6
Intra 2Nx2N 2.3 2.3 2.6 2.6 0.2 0.4 0.7 1.1
16x16
Merge skip 6.8 5.6 3.9 2.9 8.0 7.7 7.3 6.1
Inter2Nx2N 9.1 3.7 1.7 0.8 6.9 4.8 3.1 2.0
InterNx2N 1.6 0.7 0.3 0.1 2.0 1.4 1.0 0.6
Inter2NxN 1.7 0.6 0.2 0.1 1.2 0.8 0.5 0.3
InterAMP 4.1 1.4 0.5 0.2 4.1 2.7 1.7 0.9
Intra 2Nx2N 2.6 2.1 1.7 1.4 1.2 1.6 1.8 1.7
8x8
Mergeskip 2.8 1.9 1.2 0.9 3.9 3.3 2.3 1.4
Inter2Nx2N 5.8 1.3 0.4 0.1 4.9 2.5 1.1 0.4
InterNx2N 0.3 0.2 0.1 0.0 1.2 0.7 0.3 0.1
Inter2NxN 0.4 0.2 0.1 0.0 0.7 0.4 0.2 0.1
Intra2Nx2N 2.9 1.2 0.1 0.5 2.1 1.7 1.2 0.8
IntraNxN 0.8 0.6 0.7 0.2 1.9 1.1 0.6 0.3
Class SizeQP
22 27 32 37
B
32x32 33.5 55.0 63.0 65.7
16x16 19.8 20.9 20.1 19.7
8x8 36.2 15.5 10.7 10.0
4x4 10.5 8.5 6.2 4.5
C
32x32 35.7 43.4 49.2 52.2
16x16 27.7 27.7 27.5 29.0
8x8 21.7 18.1 15.8 13.9
4x4 14.8 10.8 7.5 4.9
TABLE. SelectedratioofTU
TABLE. SelectedratioofCUsizeandPUmode
-
BDBRvs.EncodingtimedependingonCTUsize
CTUsize:32x32 3.33.4%BDbitrate 7879%encodingtime
CTUsize:16x16 15.417.5%BDbitrate 5054%encodingtime
CTUsize:16x16Enc T:50.8%BDbitrate:17.53%
CTUsize:32x32Enc T:79.22%BDbitrate:3.31%
CTUsize:64x64(Reference)
CTUsize:16x16Enc T:54.7%BDbitrate:15.43%
CTUsize:32x32Enc T:78.92%BDbitrate:3.43%
SW:HM7.1Seq :ClassBcfg :Randomaccess&Lowdelay
-
BDBRvs.EncodingtimedependingonTUsize
Transformsize 1616to44oncase
3.23.5%BDbitrate 96%encodingtime
88to44oncase 10.211.2%BDbitrate 9192%encodingtime
MaxTUsize:8x8Quadtreemaxdepth:1Enc T:92.4%BDbitrate:11.2%
MaxTUsize:8x8Quadtreemaxdepth:1Enc T:91.4%BDbitrate:10.24%
MaxTUsize:16x16Quadtreemaxdepth:2Enc T:96.8%BDbitrate:3.2%
MaxTUsize:16x16Quadtreemaxdepth:2Enc T:96.5%BDbitrate:3.5%
MaxTUsize:32x32Quadtreemaxdepth:3(Reference)
SW:HM7.1Seq :ClassBcfg :Randomaccess&Lowdelay
-
Toolon/offtest
-
FastencodingalgorithmsinHMsoftware
Contents note
FastEncodingSetting:FEN,JCTVCA0124
EarlyCUtermination SubsampledSADOperation SimpleBiprediction(Thenumberofiteration4>1)
FastDecisionforMergeRDCost:FDM,JCTVCH178 2Nx2NMerge CBF earlytermination PUlevel
RoughModeDecision(forIntra):RMD,JCTVCC311/D283
35 Intramode SATD RD RD RD FullRQT
PUlevel
AMPSpeedup:AMPS,JCTVCE316 AMP MEorMerge PUlevelCBFFastModeSetting:CFM,JCTVCF045 PU CBF 0 PU ME PUlevelEarlyCUSetting:ECU,JCTVCF092 CU Skip, CU CUlevelEarlySkipDetectionSetting:ESD,JCTVCG543 Inter2Nx2N EarlySkipDetection CUlevel
TABLE. FastencodingalgorithmsinHMsoftware
-
IPSL
-
HMencoderforFHD(BQTerrace.seq)
CPU
Compress Slice- Interpolation filter (IF)
- Motion estimation (ME)- Transform-Quantization (TR-Q)
- Intra prediction- MV derivation- Mode decision
- Entropy encoding (CABAC update)
DBF
SAO
Encode Slice
- Entropy encoding
Oneframe:57930ms
For real-time?33.33ms
IF:21548.62msRDOQ:2645.55msTR:1687.37msITR:653.2829ms
DBF:9.42msSAO:77.33ms
Inteli7CPU,2.xGHz
-
KWHEVCencoder
ANSICHEVCencodersoftwarebasedonHMencoder Cleanupfunctionsandvariables Nonrecursivefunctioncall
Minimummemoryallocationandbandwidth Explicitminimummemoryallocations(usingstaticmemory) Removalofcoderelatedtoduplicatevariablesandstructuretoavoid
redundantmemorycopy Removalofunnecessarymemoryallocation
Softwareoptimization SIMDimplementation(Costfunction,transform,interpolation,deblocking,..) Framelevelinterpolationfilter
Parallelprocessing SlicelevelparallelprocessingusingOpenMP MotionestimationusingCUDA
-
PerformanceofKWHEVC
1) Cconverting:18%ATSgain(anyBDBR,BDPSNRloss)2) +SIMD+FramelevelIF:2speedup(anyBDBR,BDPSNRloss)3) +Fastmodedecision:5speedup(12%BDBRloss)4) +Slicelevelparallel:20speedup(46%BDBRloss)5) +CUDAME&MD(lowdelay P,adjustmentConfig.):200speedup
(1520%BDBRloss){Inteli7(3.3GHz),GeForce660}=>10fps
200
Class Sequence Frame QP FPS
B
Kimono 240
22 5.7427 7.2532 8.3837 9.40
ParkScene 240
22 5.5127 7.5232 8.8737 10.03
Cactus 500
22 5.1927 7.7032 9.0937 10.09
BasketballDrive 500
22 4.8027 6.7132 8.0937 9.18
BQTerrace 600
22 4.1427 7.6832 9.6037 10.62
C
BasketballDrill 500
22 14.8627 19.0732 23.6037 28.12
BQMall 600
22 14.8127 19.8832 24.9137 29.20
PartyScene 500
22 11.0927 16.4632 22.0337 27.60
RaceHorses 300
22 10.4827 14.6032 19.4637 24.49FIGURE. Encodingspeedintermsofthedevelopmentsteps
TABLE. EncodingspeedofKWHEVC
-
Comparisonofdecodercomplexity
HM10.0(C++)vs.KWHEVCdecoder(C89) Cconversion Softwareoptimization
SequencesDecodingperformance
HM10.0(sec) FPS
KWHEVC(sec) FPS Ratio
BQTerrace_1920x1080_60_qp22.bin 98.271 6.11 71.007 8.45 1.38
BQTerrace_1920x1080_60_qp27.bin 46.531 12.89 30.778 19.49 1.51
BQTerrace_1920x1080_60_qp32.bin 32.737 18.33 19.234 31.19 1.70
BQTerrace_1920x1080_60_qp37.bin 28.189 21.28 15.912 37.71 1.77
Cactus_1920x1080_50_qp22.bin 51.355 9.74 36.270 13.79 1.42
Cactus_1920x1080_50_qp27.bin 31.371 15.94 20.155 24.81 1.56
Cactus_1920x1080_50_qp32.bin 25.506 19.60 15.381 32.51 1.66
Cactus_1920x1080_50_qp37.bin 21.933 22.80 12.792 39.09 1.71
-
ParallelismandSIMDprocessing
Parallelism Decodercannotexpectthetileorslicepartitioningofpictures Decodershouldconsiderworstbitstreams Theentropydecodercannotbeparallelized CTUbased2Dwavefrontparallelprocessingisapromisingwayfor
parallelism Deblocking filterandSAOaremoreproperfortheparallelism
Lessdatadependency
SIMDprocessing Inversetransform(X=ATYA) Motioncompensation
About40%ofdecodercomplexity 8tapand4tapfilters
-
PerformanceoftheoptimizedKWHEVCdecoder
SIMDandparallelization Pixelreconstruction,interpolation(partial) Tasklevelparallelism(entropy,pixeldecoding) Datalevelparallelism(deblocking filter)
2.934.98
2.28Mbps
-
Conclusion
OverviewofHEVC EncodingparametersforHEVCtestmodel(HM) ComplexityanalysisofHEVCencoder Fastencodingalgorithmsandperformances Issuesofparallelprocessing
-
HEVC
:,:
1. HEVC2. 3. 4. HEVC 5. 6. 7. 8. 9. 10. 11. CABAC12. 13. 14. 15. HEVC A. 2013