a near real time decoding for ldpc based distributed video coding using cuda

49
for LDPC based distributed video coding using CUDA CUDA 架架架架架 架架架架架架架架架架架架架架架架架架 架架架架架架架 CMLab, CSIE, NTU 1 Su, Tse-Chung 架架架 Advisor: Prof. Wu, Ja-Ling 架架架 架架 2011/6/9

Upload: archer

Post on 18-Feb-2016

68 views

Category:

Documents


0 download

DESCRIPTION

A near real time decoding for LDPC based distributed video coding using CUDA. CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計. Su, Tse -Chung 蘇則仲 Advisor: Prof. Wu, Ja -Ling 吳家麟 教授 2011/6/9. Outline. Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A near real time decoding for  LDPC based distributed video coding  using CUDA

A near real time decoding for LDPC based distributed video

coding using CUDACUDA 架構下針對低密度奇偶校驗碼為基礎之分散式編碼的近即時解碼設計

CMLab, CSIE, NTU1

Su, Tse-Chung 蘇則仲Advisor: Prof. Wu, Ja-Ling 吳家麟 教授

2011/6/9

Page 2: A near real time decoding for  LDPC based distributed video coding  using CUDA

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU2

Page 3: A near real time decoding for  LDPC based distributed video coding  using CUDA

Conventional Video Codec MPEG-2, H.264, HEVC(H.265)

CMLab, CSIE, NTU4

ENCODER DECODERLightweightHeavyweight

Page 4: A near real time decoding for  LDPC based distributed video coding  using CUDA

Distributed Video Coding(DVC)

A new paradigm for video compression

CMLab, CSIE, NTU5

ENCODER DECODERLightweight Heavyweight

Page 5: A near real time decoding for  LDPC based distributed video coding  using CUDA

Application of DVC Video conferencing with mobile

devices

CMLab, CSIE, NTU7

DVC to H.264 Transcoder

CloudComputational Resource

DVC encoder(Low Complexity)

H.264 decoder(Low Complexity)

DVC encoded bitstream

H.264 encoded bitstream

Realtime system

Page 6: A near real time decoding for  LDPC based distributed video coding  using CUDA

Distributed Video Coding

D. Varodayan, A. Aaron, and B. Girod, “Rate-Adaptive Codes for Distributed Source Coding,”EURASIP Signal Processing Journal, Special Issue on Distributed Source Coding,,November 2006.

Channel Encoder

Channel Decoder

LDPCEncoder

LDPCDecoder

Page 7: A near real time decoding for  LDPC based distributed video coding  using CUDA

Decoding Complexity of DVC

Our DVC codec (state-of-the-art) Parallelized with OpenMP

and CUDA 12 core + GPGPU(Fermi)

~1FPS

CMLab, CSIE, NTU12

DECODERHeavyweight

Page 8: A near real time decoding for  LDPC based distributed video coding  using CUDA

Amdahl's law Maximum speedup can be

reached by improving the most critical part of the system LDPC decoding in the DVC decoder.

CMLab, CSIE, NTU13LDPCA Others

86%~94%

LDPCA Others

29%~36%15.39 FPS

QCIF

Page 9: A near real time decoding for  LDPC based distributed video coding  using CUDA

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU14

Page 10: A near real time decoding for  LDPC based distributed video coding  using CUDA

LDPC decodingSum-Product Algorithm

(Message Passing)

Side Information(real number)+ 0 - 1

4 6 7

甲 乙 丙

3 521

decode outputhard decision

a b c d e f g

a25 b25 c25 d25 e25 f25 g25

Vertical processing

Horizontalprocessing

a1 b1 c1 d1 e1 f1 g1

1 2 3 4 5 6 7

甲 乙 丙

0

1

10 1 1From DVC encoder

(syndrome bits)

a b c d e f g

Kschischang, F.R., Frey, B.J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory

Page 11: A near real time decoding for  LDPC based distributed video coding  using CUDA

Sum-Product AlgorithmVertical Processing

CMLab, CSIE, NTU16

A B C D E

F G IH J

K L OM N

0

1

1 Z

P

a b c d e f g

P = K + F + a

Z = F + P + a

Page 12: A near real time decoding for  LDPC based distributed video coding  using CUDA

CMLab, CSIE, NTU17

Sum-Product AlgorithmHorizontal Processing

0

1

1

P Q R S T

U V XW Y

Z A DB C

H

a b c d e f g

KHmag=φ (𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑆|)+𝜑 (|𝑇|) )

K mag=φ (𝜑 (|𝑃|)+𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑇|) )

Page 13: A near real time decoding for  LDPC based distributed video coding  using CUDA

LDPC Accumulate (LDPCA) codes

22

Rate adaptivity

D. Varodayan et al., "Rate-adaptive codes for distributed source coding," EURASIP Signal Processing Journal, Special Section on Distributed Source Coding, 2006

Page 14: A near real time decoding for  LDPC based distributed video coding  using CUDA

65 L

DP

C c

odes

3

Page 15: A near real time decoding for  LDPC based distributed video coding  using CUDA

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA

(Kernel Design) Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU25

Page 16: A near real time decoding for  LDPC based distributed video coding  using CUDA

Vertical Processing Kernel (VPK) Column degree is constant 3

regular LDPC Shared memory

Horizontal Processing Kernel (HPK) Each message can be update by one thread (SIMD) Variable row degree in each LDPC code Data structure: Circular link list

CUDA thread Block(0)Shared Memory

CUDA thread Block(1)Shared Memory

Previous CUDA implementation

A B C D E

F G H

I J LK

1

Message data

Index data

2 3 4 0 6 7 5 9 10 11 8

CUDA thread 3

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

58

Pai, Y.-S., Cheng, H.-P., Shen, Y.-C. and Wu, J.-L. 2010. Fast decoding for ldpc based distributed video coding. In Proc. of ACM international conference on Multimedia

Page 17: A near real time decoding for  LDPC based distributed video coding  using CUDA

CUDA ImplementationStrategy 1

28

Reduction of Φ Function in HPKTexture memory in VPK

Page 18: A near real time decoding for  LDPC based distributed video coding  using CUDA

Global Memory Texture Binding

Global Memory

Texture Binding in VPK

CMLab, CSIE, NTU29

29

58

A B C D E F G H I J K L

t0

Speedup on both 1.x and 2.x compute capability

Non-coalescing readt0

Page 19: A near real time decoding for  LDPC based distributed video coding  using CUDA

LDPCA decoding time in previous CUDA implementation

LDPC(n,m) 100 iterations Decoding timeHPK+VPK

(1584, 48) 8.29 ms+1.49 ms(1584, 192) 3.40 ms +1.52 ms(1584, 336) 3.04 ms +1.53 ms(1584, 480) 2.31 ms +1.55 ms

(1584, 624) 2.29 ms +1.54 ms

(1584, 768) 2.00 ms +1.52 ms

(1584, 912) 1.82 ms +1.52 ms

(1584, 1056) 1.81 ms +1.52 ms

(1584, 1200) 1.79 ms +1.50 ms

(1584, 1344) 1.79 ms +1.51 ms

(1584, 1488) 1.78 ms +1.56 ms

CMLab, CSIE, NTU30

……

Page 20: A near real time decoding for  LDPC based distributed video coding  using CUDA

Global Memory

CUDA thread Block(0)Shared Memory

Reduction of Φ Function in HPK

CMLab, CSIE, NTU31

1 2 3 4 0 6 7 5 9 10 11 8

A B C D E F G H I J K L

1 2 3 4 0 6 7 5

A B C D E F G H

t1 t2 t3 t4 t5 t6 t7t0Copy to shared memory

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Page 21: A near real time decoding for  LDPC based distributed video coding  using CUDA

Global Memory

CUDA thread Block(0)Shared Memory

Reduction of Φ Function in HPK

CMLab, CSIE, NTU32

1 2 3 4 0 6 7 5 9 10 11 8

A B C D E F G H I J K L

1 2 3 4 0 6 7 5

𝝋 (|𝑨|) 𝝋 (|𝑩|) 𝝋 (|𝑪|) 𝝋 (|𝑫|) 𝝋 (|𝑬|) 𝝋 (|𝑭|) 𝝋 (|𝑮|) 𝝋 (|𝑯|)

t0 t1 t2 t3 t4 t5 t6 t7Calculate functions before copying to shared memory

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Number of φ(x): row degree

2

Page 22: A near real time decoding for  LDPC based distributed video coding  using CUDA

33

Previous Implementation 124.47 sec

Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x

StepSpeedupLDPCA Time Cumulative

Speedup

LDPCA Performance -- foreman sequence (QCIF)

Page 23: A near real time decoding for  LDPC based distributed video coding  using CUDA

34

Parallel Partial Reductionin HPK

CUDA ImplementationStrategy 2

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )

Page 24: A near real time decoding for  LDPC based distributed video coding  using CUDA

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2Values (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict free

Parallel Reduction

CMLab, CSIE, NTU

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Page 25: A near real time decoding for  LDPC based distributed video coding  using CUDA

Computation Overlapping in HPK

CMLab, CSIE, NTU37

t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|))t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 3:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐸|) )t 4 :φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|) )

t 0 :φ (Mag−𝜑 (|𝐴|) )t 1 :φ (Mag−𝜑 (|𝐵|) )t 2:φ (Mag−𝜑 (|𝐶|) )t 3:φ (Mag−𝜑 (|𝐷|) )t 4 :φ (Mag−𝜑 (|𝐸|) )

=

Magnitude

t1 t2 t3 t4t0

Parallel partial reduction

Page 26: A near real time decoding for  LDPC based distributed video coding  using CUDA

CUDA thread Block(0)Shared Memory

φ(|A|)= 0.1

φ(|B|) = 0.2

φ(|C|)

= 0.4

φ(|D|)= 0.7

φ(|E|) = 0.3 0 0 0 φ(|F|)=

0.1φ(|G|)=0.7

φ(|H|)=0.4

0t1 t2 t3 t0 t1 t2 t3 t8

0.4 0.2 0.4 0.7 0.3 0 0 0 0.5 0.7 0.4 0t9 t8 t9

t8 t8

0.8 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0

1.7 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0t0 t0

t0

t1 t0 t1t0

Global Memory

A B C D E 0 0 0 F G H 0 I J K L

Log(

row

Deg

) = 3

Mag0 Mag1

(8,0)

(8,1)

(8,2)

(8,3)

(8,4)

(0,0)

(0,0)

(0,0)

(4,0)

(4,1)

(4,2)

(0,0)

(4,0)

(4,1)

(4,2)

(4,3)

rowDeg = 8 rowDeg = 4

index

message

idle threads

Parallel Partial Reduction

CMLab, CSIE, NTU

t0 t1 t2 t3 t4 t8 t9 t10

Page 27: A near real time decoding for  LDPC based distributed video coding  using CUDA

39

Check Node Re-orderingCompletely Unrolling

CUDA ImplementationStrategy 3

Page 28: A near real time decoding for  LDPC based distributed video coding  using CUDA

I J K L M

CMLab, CSIE, NTU40

A B C D E F G HShared MemoryCUDA thread Block(0) CUDA thread Block(1)

CUDA thread Block(0) CUDA thread Block(1)

rowDeg = 4rowDeg = 8 rowDeg =8

1 2 3 4 5 60

012

Variable node

Check nodeCheck node

Variable node1 2 3 4 5 60

0 1 2

Check Node Re-ordering

A B C D E I J K L M F G H

3 3

23

Page 29: A near real time decoding for  LDPC based distributed video coding  using CUDA

Redundant if else & __syncthreads()

int i = threadIdx.x;Int half = rowDeg >> 1;float myMag = s_mag[i] ;char mySign = s_sign[i] ;do{ if(rowPos < half){ s_mag[i] += s_mag[i+half]; s_sign[i] ^= s_sign[i+half]; } half >>= 1; __syncthreads();}while(half);Int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;

int i = threadIdx.x;float myMag = s_mag[i] ;char mySign = s_sign[i] ;If(rowDeg==16){ s_mag[i] += s_mag[i+8]; s_sign[i] ^= s_sign[i+8]; s_mag[i] += s_mag[i+4]; s_sign[i] ^= s_sign[i+4]; s_mag[i] += s_mag[i+2]; s_sign[i] ^= s_sign[i+2]; s_mag[i] += s_mag[i+1]; s_sign[i] ^= s_sign[i+1]; }else if ( rowDeg == 8 ){ ….. }int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;

Branch divergence

harmperformance

No branch divergence

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Completely unrolling

Page 30: A near real time decoding for  LDPC based distributed video coding  using CUDA

42

Combination of VPK and HPK

CUDA ImplementationStrategy 4

Page 31: A near real time decoding for  LDPC based distributed video coding  using CUDA

Kernel Launch Overhead

CMLab, CSIE, NTU43

NVIDIA CUDA PROGRAMMING GUIDE(3.2) 5.2.1

1. Parallelism is broken (Implicit Inter-Block Synchronization)2. Extra global memory traffic

HPKVPK HPKVPK HPKVPK

VPK+HPK=UMK

VPK+HPK=UMK

VPK+HPK=UMK

Page 32: A near real time decoding for  LDPC based distributed video coding  using CUDA

45

LDPCA Performance -- foreman sequence (QCIF)

Previous Implementation124.47 sec

Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x

Strategy 2:PPR in HPK 40.66 sec 1.30x 3.06x

Strategy 3:Merge HPK & VPK 28.80 sec 1.41x 4.32x

Strategy 4:Check Node Re-ordering & Completely Unrolling

22.29 sec 1.29x 5.58x

StepSpeedup

Time CumulativeSpeedup

Page 33: A near real time decoding for  LDPC based distributed video coding  using CUDA

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA (CUDA API) Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU48

Page 34: A near real time decoding for  LDPC based distributed video coding  using CUDA

49

Early Stop Detection

CUDA ImplementationStrategy 5

Page 35: A near real time decoding for  LDPC based distributed video coding  using CUDA

UMK UMKUMK UMKUMK UMKUMK UMK

Early Stop Detection in Sum-Product Algorithm

CMLab, CSIE, NTU50

SPA iteration 1 SPA iteration 2time

SPA iteration 100. . .

GPUCPU

time

UMK EDK

PCI-E transfer . . .

CPUHorizontal Processing + Vertical Processing

EEarly stop detection

Early stop Detection

Kernel

Transmit codeword&decoded info

Check decode info

iter.1 Check iter. 1 iter.2 Check iter. 2

Terminated at iteration 301. Successfully decoded2. Converge to wrong codeword

UMK EDK

PCI-E transfer

Page 36: A near real time decoding for  LDPC based distributed video coding  using CUDA

Combination of EDK and UMK The SPA algorithm is memory intensive in CUDA The index data of UMK is also used by early stop

detection (EDK) EDK+UMK = EDUMK

14% additional complexity in terms of execution time

CMLab, CSIE, NTU51

0

1

1

a b c d e f g

UMK UMKUMK UMKUMK UMKGPUCPU

time

UMK EDK

PCI-E transfer . . .UMK ED

KPCI-E

transfer

Page 37: A near real time decoding for  LDPC based distributed video coding  using CUDA

Concurrent Kernel Execution and Data Transfer

CMLab, CSIE, NTU52

UMKGPU UMK UMK

PCI-E transfer

time

EDUMK UMK EDUMK

PCI-E transfer

UMK UMK UMK

Early Stop Detection for iter.1Run UMK for iter.2 iter.3

iter.1

Early Stop Detection for iter.5Run UMK for iter.6

Receive decode info & codewordfor iter.1

Receive decode info &codeword

for iter.5

iter.9

CPU

Ideal Timeline

Page 38: A near real time decoding for  LDPC based distributed video coding  using CUDA

Practical CUDA Implementation for Early Stopping Detection

Use 1 CPU thread, 1 GPU Use CUDA Driver API instead of Runtime API Nearly no Stream Management instructions

cudaStreamSynchronize(), cudaStreamQuery(), or cudaStreamWaitEvent()

CMLab, CSIE, NTU53

. . .Stream 2

UMK UMK

Stream 0

UMKUMKPCI-E transfer

time

EDUMKUMK UMK UMKPCI-E transfer

EDUMK

#overlap = 3

EDUMKStream 1

host

~~~~~~~~~

~~~~~~~~~Explicit synchronization

#overlap = 3

Page 39: A near real time decoding for  LDPC based distributed video coding  using CUDA

Speed-up ratio of early stop detection

CMLab, CSIE, NTU55

Total number of LDPCA iterations

20000Fix iteration

10000

Early stop detection

10%overhead

1.8xActual

speedup

2.0xTheoretical

speedup

Overhead on CPU

5%

Overhead on GPU Using Runtime API

20% 7%

Overhead on GPU Using driver API

Page 40: A near real time decoding for  LDPC based distributed video coding  using CUDA

LDPCA Performance -- foreman sequence (QCIF)

Previous Implementation 124.47 sec Strategy 1:(fix 100 Iter)reduce φ, texture memory 52.94 sec 2.35x 2.35x

Strategy 2:(fix 100 Iter)PPR in HPK 40.66 sec 1.30x 3.06x

Strategy 3:(fix 100 Iter)Merge HPK & VPK 28.80 sec 1.41x 4.32x

Strategy 4:(fix 100 Iter)Check Node Re-ordering & Completely Unrolling

22.29 sec 1.29x 5.58x

Strategy 5:(max 100 Iter)Early Stop Detection (Driver API) 10.86 sec 2.02x 11.27x

StepSpeedup

Time CumulativeSpeedup

449.63x faster than sequential program!

Page 41: A near real time decoding for  LDPC based distributed video coding  using CUDA

Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using

CUDA Evaluation of Decoding speed Conclusions and future work

CMLab, CSIE, NTU59

Page 42: A near real time decoding for  LDPC based distributed video coding  using CUDA

Test condition 12 CPU, 24 processor

Intel(R) Xeon(R) CPU X5650 @ 2.67GHz GPU: Tesla M2050

14 (MP) x 32 (Cores/MP) = 448 (Cores) CUDA capability 2.0 Shared memory: 48K Maximum threads in block: 1024 Concurrent copy and execution Concurrent kernel execution

Page 43: A near real time decoding for  LDPC based distributed video coding  using CUDA

Test condition Test sequences:

QCIF, 15Hz, all frames GOP size: 8 Qindex: 8 Bitrate and PSNR: only luminance

componentCMLab, CSIE, NTU61

Soccer Foreman Coastguard Hall MonitorHigh LowMotion

Page 44: A near real time decoding for  LDPC based distributed video coding  using CUDA

Speedup Ratio of LDPCA decoder Using CUDA

15.39 FPS

1.14 fps

7.14 FPS

0.96 fps

4.99 FPS

0.79 fps

10.29 FPS

1.05 fps7.43 ↑

6.32 ↑13.5 ↑

9.8 ↑15.35 ↑LDPCA

22.51 ↑LDPCA

12.88 ↑LDPCA

36.91 ↑LDPCA

0.2% bit rate↑

Page 45: A near real time decoding for  LDPC based distributed video coding  using CUDA

LDPCA decoding time comparison

100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)

9800GTX 1.93~1.83ms 1.09~1.27ms 3.26~3.34ms 1.87~2.12ms

Tesla T10 1.23~1.26ms 0.67~0.70ms 2.39~2.52ms 1.27~1.34ms

Tesla C2050 0.55~0.60ms 0.29~0.31ms 1.25~1.34ms 0.65~0.69ms

100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)

GTX260 35ms 18ms 46ms 24ms

GeForce 9800

GTX+

Tesla

C1060

GeForce

GTX260

Tesla

C2050

Compute Capability 1.1 1.3 1.3 2.0

MP x Cores/MP 16x8 30x8 27x8 14x32

Ryanggeun, O., Jongbin, P. and Byeungwoo, J. 2010. Fast implementation of wyner-ziv video codec using gpgpu. In Proc. of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting , 1-5.

Page 46: A near real time decoding for  LDPC based distributed video coding  using CUDA

Realtime Decoding Quality27.44db, 76kbps

39.46db, 147.64kbps 35.34db, 263.52 kbps

29.21db, 93.17 kbpsOriginal Sequence

Original Sequence

Original Sequence

Original Sequence

Page 47: A near real time decoding for  LDPC based distributed video coding  using CUDA

Conclusion Fully parallelized LDPCA decoder using

CUDA with various features The proposed early stop detection

mechanism reduces the latency between the CPU and the GPU

Videos in surveillance sequence (e.g. hall monitor) can be decoded in real-time with negligible RD performance loss

CMLab, CSIE, NTU72

Page 48: A near real time decoding for  LDPC based distributed video coding  using CUDA

Future Work Bitplane level parallelization for LDPCA

UV component Frame level parallelization

Vitor Silva

a2 b2 c2 d2 e2 f2 g2 a3 b3 c3 d3 e3 f3 g3

4 6 7

1 2 303 13 03

Soft input

3 521

Vertical processing

Horizontalprocessing

a1 b1 c1 d1 e1 f1 g1

syndrome

02 12 0201 11 01

Page 49: A near real time decoding for  LDPC based distributed video coding  using CUDA

Thank You

CMLab, CSIE, NTU74