a near real time decoding for ldpc based distributed video coding using cuda

A near real time decoding for LDPC based distributed video

coding using CUDACUDA 架構下針對低密度奇偶校驗碼為基礎之分散式編碼的近即時解碼設計

CMLab, CSIE, NTU1

Su, Tse-Chung 蘇則仲Advisor: Prof. Wu, Ja-Ling 吳家麟教授

2011/6/9

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU2

Conventional Video Codec MPEG-2, H.264, HEVC(H.265)

CMLab, CSIE, NTU4

ENCODER DECODERLightweightHeavyweight

Distributed Video Coding(DVC)

A new paradigm for video compression

CMLab, CSIE, NTU5

ENCODER DECODERLightweight Heavyweight

Application of DVC Video conferencing with mobile

devices

CMLab, CSIE, NTU7

DVC to H.264 Transcoder

CloudComputational Resource

DVC encoder(Low Complexity)

H.264 decoder(Low Complexity)

DVC encoded bitstream

H.264 encoded bitstream

Realtime system

Distributed Video Coding

D. Varodayan, A. Aaron, and B. Girod, “Rate-Adaptive Codes for Distributed Source Coding,”EURASIP Signal Processing Journal, Special Issue on Distributed Source Coding,,November 2006.

Channel Encoder

Channel Decoder

LDPCEncoder

LDPCDecoder

Decoding Complexity of DVC

Our DVC codec (state-of-the-art) Parallelized with OpenMP

and CUDA 12 core + GPGPU(Fermi)

~1FPS

CMLab, CSIE, NTU12

DECODERHeavyweight

Amdahl's law Maximum speedup can be

reached by improving the most critical part of the system LDPC decoding in the DVC decoder.

CMLab, CSIE, NTU13LDPCA Others

86%~94%

LDPCA Others

29%~36%15.39 FPS

QCIF



CMLab, CSIE, NTU14

LDPC decodingSum-Product Algorithm

(Message Passing)

Side Information(real number)+ 0 - 1

4 6 7

甲乙丙

3 521

decode outputhard decision

a b c d e f g

a25 b25 c25 d25 e25 f25 g25

Vertical processing

Horizontalprocessing

a1 b1 c1 d1 e1 f1 g1

1 2 3 4 5 6 7

甲乙丙

0

1

10 1 1From DVC encoder

(syndrome bits)

a b c d e f g

Kschischang, F.R., Frey, B.J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory

Sum-Product AlgorithmVertical Processing

CMLab, CSIE, NTU16

A B C D E

F G IH J

K L OM N

0

1

1 Z

P

a b c d e f g

P = K + F + a

Z = F + P + a

CMLab, CSIE, NTU17

Sum-Product AlgorithmHorizontal Processing

0

1

1

P Q R S T

U V XW Y

Z A DB C

H

a b c d e f g

KHmag=φ (𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑆|)+𝜑 (|𝑇|) )

K mag=φ (𝜑 (|𝑃|)+𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑇|) )

LDPC Accumulate (LDPCA) codes

22

Rate adaptivity

D. Varodayan et al., "Rate-adaptive codes for distributed source coding," EURASIP Signal Processing Journal, Special Section on Distributed Source Coding, 2006

65 L

DP

C c

odes

3

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA

(Kernel Design) Early Stop Detection Mechanism Using


CMLab, CSIE, NTU25

Vertical Processing Kernel (VPK) Column degree is constant 3

regular LDPC Shared memory

Horizontal Processing Kernel (HPK) Each message can be update by one thread (SIMD) Variable row degree in each LDPC code Data structure: Circular link list

CUDA thread Block(0)Shared Memory


Previous CUDA implementation

A B C D E

F G H

I J LK

1

Message data

Index data

2 3 4 0 6 7 5 9 10 11 8

CUDA thread 3

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

58

Pai, Y.-S., Cheng, H.-P., Shen, Y.-C. and Wu, J.-L. 2010. Fast decoding for ldpc based distributed video coding. In Proc. of ACM international conference on Multimedia

CUDA ImplementationStrategy 1

28

Reduction of Φ Function in HPKTexture memory in VPK

Global Memory Texture Binding

Global Memory

Texture Binding in VPK

CMLab, CSIE, NTU29

29

58

A B C D E F G H I J K L

t0

Speedup on both 1.x and 2.x compute capability

Non-coalescing readt0

LDPCA decoding time in previous CUDA implementation

LDPC(n,m) 100 iterations Decoding timeHPK+VPK

(1584, 48) 8.29 ms+1.49 ms(1584, 192) 3.40 ms +1.52 ms(1584, 336) 3.04 ms +1.53 ms(1584, 480) 2.31 ms +1.55 ms

(1584, 624) 2.29 ms +1.54 ms

(1584, 768) 2.00 ms +1.52 ms

(1584, 912) 1.82 ms +1.52 ms

(1584, 1056) 1.81 ms +1.52 ms

(1584, 1200) 1.79 ms +1.50 ms

(1584, 1344) 1.79 ms +1.51 ms

(1584, 1488) 1.78 ms +1.56 ms

CMLab, CSIE, NTU30

……

…

Global Memory


Reduction of Φ Function in HPK

CMLab, CSIE, NTU31

1 2 3 4 0 6 7 5 9 10 11 8


1 2 3 4 0 6 7 5

A B C D E F G H

t1 t2 t3 t4 t5 t6 t7t0Copy to shared memory

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Global Memory


Reduction of Φ Function in HPK

CMLab, CSIE, NTU32

1 2 3 4 0 6 7 5 9 10 11 8


1 2 3 4 0 6 7 5

𝝋 (|𝑨|) 𝝋 (|𝑩|) 𝝋 (|𝑪|) 𝝋 (|𝑫|) 𝝋 (|𝑬|) 𝝋 (|𝑭|) 𝝋 (|𝑮|) 𝝋 (|𝑯|)

t0 t1 t2 t3 t4 t5 t6 t7Calculate functions before copying to shared memory

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Number of φ(x): row degree

2

33

Previous Implementation 124.47 sec

Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x

StepSpeedupLDPCA Time Cumulative

Speedup

LDPCA Performance -- foreman sequence (QCIF)

34

Parallel Partial Reductionin HPK


t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2Values (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict free

Parallel Reduction

CMLab, CSIE, NTU

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Computation Overlapping in HPK

CMLab, CSIE, NTU37

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|))t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 3:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐸|) )t 4 :φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|) )

t 0 :φ (Mag−𝜑 (|𝐴|) )t 1 :φ (Mag−𝜑 (|𝐵|) )t 2:φ (Mag−𝜑 (|𝐶|) )t 3:φ (Mag−𝜑 (|𝐷|) )t 4 :φ (Mag−𝜑 (|𝐸|) )

=

Magnitude

t1 t2 t3 t4t0

Parallel partial reduction


φ(|A|)= 0.1

φ(|B|) = 0.2

φ(|C|)

= 0.4

φ(|D|)= 0.7

φ(|E|) = 0.3 0 0 0 φ(|F|)=

0.1φ(|G|)=0.7

φ(|H|)=0.4

0t1 t2 t3 t0 t1 t2 t3 t8

0.4 0.2 0.4 0.7 0.3 0 0 0 0.5 0.7 0.4 0t9 t8 t9

t8 t8

0.8 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0

1.7 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0t0 t0

t0

t1 t0 t1t0

Global Memory

A B C D E 0 0 0 F G H 0 I J K L

Log(

row

Deg

) = 3

Mag0 Mag1

(8,0)

(8,1)

(8,2)

(8,3)

(8,4)

(0,0)

(0,0)

(0,0)

(4,0)

(4,1)

(4,2)

(0,0)

(4,0)

(4,1)

(4,2)

(4,3)

rowDeg = 8 rowDeg = 4

index

message

idle threads

Parallel Partial Reduction

CMLab, CSIE, NTU

t0 t1 t2 t3 t4 t8 t9 t10

39

Check Node Re-orderingCompletely Unrolling


I J K L M

CMLab, CSIE, NTU40

A B C D E F G HShared MemoryCUDA thread Block(0) CUDA thread Block(1)

CUDA thread Block(0) CUDA thread Block(1)

rowDeg = 4rowDeg = 8 rowDeg =8

1 2 3 4 5 60

012

Variable node

Check nodeCheck node

Variable node1 2 3 4 5 60

0 1 2

Check Node Re-ordering

A B C D E I J K L M F G H

3 3

23

Redundant if else & __syncthreads()

int i = threadIdx.x;Int half = rowDeg >> 1;float myMag = s_mag[i] ;char mySign = s_sign[i] ;do{ if(rowPos < half){ s_mag[i] += s_mag[i+half]; s_sign[i] ^= s_sign[i+half]; } half >>= 1; __syncthreads();}while(half);Int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;

int i = threadIdx.x;float myMag = s_mag[i] ;char mySign = s_sign[i] ;If(rowDeg==16){ s_mag[i] += s_mag[i+8]; s_sign[i] ^= s_sign[i+8]; s_mag[i] += s_mag[i+4]; s_sign[i] ^= s_sign[i+4]; s_mag[i] += s_mag[i+2]; s_sign[i] ^= s_sign[i+2]; s_mag[i] += s_mag[i+1]; s_sign[i] ^= s_sign[i+1]; }else if ( rowDeg == 8 ){ ….. }int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;

Branch divergence

harmperformance

No branch divergence

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Completely unrolling

42

Combination of VPK and HPK


Kernel Launch Overhead

CMLab, CSIE, NTU43

NVIDIA CUDA PROGRAMMING GUIDE(3.2) 5.2.1

1. Parallelism is broken (Implicit Inter-Block Synchronization)2. Extra global memory traffic

HPKVPK HPKVPK HPKVPK

VPK+HPK=UMK

VPK+HPK=UMK

VPK+HPK=UMK

45


Previous Implementation124.47 sec

Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x

Strategy 2:PPR in HPK 40.66 sec 1.30x 3.06x

Strategy 3:Merge HPK & VPK 28.80 sec 1.41x 4.32x

Strategy 4:Check Node Re-ordering & Completely Unrolling

22.29 sec 1.29x 5.58x

StepSpeedup

Time CumulativeSpeedup


CUDA (CUDA API) Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU48

49

Early Stop Detection


UMK UMKUMK UMKUMK UMKUMK UMK

Early Stop Detection in Sum-Product Algorithm

CMLab, CSIE, NTU50

SPA iteration 1 SPA iteration 2time

SPA iteration 100. . .

GPUCPU

time

UMK EDK

PCI-E transfer . . .

CPUHorizontal Processing + Vertical Processing

EEarly stop detection

Early stop Detection

Kernel

Transmit codeword&decoded info

Check decode info

iter.1 Check iter. 1 iter.2 Check iter. 2

Terminated at iteration 301. Successfully decoded2. Converge to wrong codeword

UMK EDK

PCI-E transfer

Combination of EDK and UMK The SPA algorithm is memory intensive in CUDA The index data of UMK is also used by early stop

detection (EDK) EDK+UMK = EDUMK

14% additional complexity in terms of execution time

CMLab, CSIE, NTU51

0

1

1

a b c d e f g

UMK UMKUMK UMKUMK UMKGPUCPU

time

UMK EDK

PCI-E transfer . . .UMK ED

KPCI-E

transfer

Concurrent Kernel Execution and Data Transfer

CMLab, CSIE, NTU52

UMKGPU UMK UMK

PCI-E transfer

time

EDUMK UMK EDUMK

PCI-E transfer

UMK UMK UMK

Early Stop Detection for iter.1Run UMK for iter.2 iter.3

iter.1

Early Stop Detection for iter.5Run UMK for iter.6

Receive decode info & codewordfor iter.1

Receive decode info &codeword

for iter.5

iter.9

CPU

Ideal Timeline

Practical CUDA Implementation for Early Stopping Detection

Use 1 CPU thread, 1 GPU Use CUDA Driver API instead of Runtime API Nearly no Stream Management instructions

cudaStreamSynchronize(), cudaStreamQuery(), or cudaStreamWaitEvent()

CMLab, CSIE, NTU53

. . .Stream 2

UMK UMK

Stream 0

UMKUMKPCI-E transfer

time

EDUMKUMK UMK UMKPCI-E transfer

EDUMK

#overlap = 3

EDUMKStream 1

host

~~~~~~~~~

~~~~~~~~~Explicit synchronization

#overlap = 3

Speed-up ratio of early stop detection

CMLab, CSIE, NTU55

Total number of LDPCA iterations

20000Fix iteration

10000

Early stop detection

10%overhead

1.8xActual

speedup

2.0xTheoretical

speedup

Overhead on CPU

5%

Overhead on GPU Using Runtime API

20% 7%

Overhead on GPU Using driver API


Previous Implementation 124.47 sec Strategy 1:(fix 100 Iter)reduce φ, texture memory 52.94 sec 2.35x 2.35x

Strategy 2:(fix 100 Iter)PPR in HPK 40.66 sec 1.30x 3.06x

Strategy 3:(fix 100 Iter)Merge HPK & VPK 28.80 sec 1.41x 4.32x

Strategy 4:(fix 100 Iter)Check Node Re-ordering & Completely Unrolling

22.29 sec 1.29x 5.58x

Strategy 5:(max 100 Iter)Early Stop Detection (Driver API) 10.86 sec 2.02x 11.27x

StepSpeedup

Time CumulativeSpeedup

449.63x faster than sequential program!



CMLab, CSIE, NTU59

Test condition 12 CPU, 24 processor

Intel(R) Xeon(R) CPU X5650 @ 2.67GHz GPU: Tesla M2050

14 (MP) x 32 (Cores/MP) = 448 (Cores) CUDA capability 2.0 Shared memory: 48K Maximum threads in block: 1024 Concurrent copy and execution Concurrent kernel execution

Test condition Test sequences:

QCIF, 15Hz, all frames GOP size: 8 Qindex: 8 Bitrate and PSNR: only luminance

componentCMLab, CSIE, NTU61

Soccer Foreman Coastguard Hall MonitorHigh LowMotion

Speedup Ratio of LDPCA decoder Using CUDA

15.39 FPS

1.14 fps

7.14 FPS

0.96 fps

4.99 FPS

0.79 fps

10.29 FPS

1.05 fps7.43 ↑

6.32 ↑13.5 ↑

9.8 ↑15.35 ↑LDPCA

22.51 ↑LDPCA

12.88 ↑LDPCA

36.91 ↑LDPCA

0.2% bit rate↑

LDPCA decoding time comparison

100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)

9800GTX 1.93~1.83ms 1.09~1.27ms 3.26~3.34ms 1.87~2.12ms

Tesla T10 1.23~1.26ms 0.67~0.70ms 2.39~2.52ms 1.27~1.34ms

Tesla C2050 0.55~0.60ms 0.29~0.31ms 1.25~1.34ms 0.65~0.69ms

100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)

GTX260 35ms 18ms 46ms 24ms

GeForce 9800

GTX+

Tesla

C1060

GeForce

GTX260

Tesla

C2050

Compute Capability 1.1 1.3 1.3 2.0

MP x Cores/MP 16x8 30x8 27x8 14x32

Ryanggeun, O., Jongbin, P. and Byeungwoo, J. 2010. Fast implementation of wyner-ziv video codec using gpgpu. In Proc. of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting , 1-5.

Realtime Decoding Quality27.44db, 76kbps

39.46db, 147.64kbps 35.34db, 263.52 kbps

29.21db, 93.17 kbpsOriginal Sequence

Original Sequence

Original Sequence

Original Sequence

Conclusion Fully parallelized LDPCA decoder using

CUDA with various features The proposed early stop detection

mechanism reduces the latency between the CPU and the GPU

Videos in surveillance sequence (e.g. hall monitor) can be decoded in real-time with negligible RD performance loss

CMLab, CSIE, NTU72

Future Work Bitplane level parallelization for LDPCA

UV component Frame level parallelization

Vitor Silva

a2 b2 c2 d2 e2 f2 g2 a3 b3 c3 d3 e3 f3 g3

4 6 7

1 2 303 13 03

Soft input

3 521

Vertical processing

Horizontalprocessing

a1 b1 c1 d1 e1 f1 g1

syndrome

02 12 0201 11 01

Thank You

CMLab, CSIE, NTU74

a near real time decoding for ldpc based distributed video coding using cuda

Documents