a near real time decoding for ldpc based distributed video coding using cuda
DESCRIPTION
A near real time decoding for LDPC based distributed video coding using CUDA. CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計. Su, Tse -Chung 蘇則仲 Advisor: Prof. Wu, Ja -Ling 吳家麟 教授 2011/6/9. Outline. Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA - PowerPoint PPT PresentationTRANSCRIPT
A near real time decoding for LDPC based distributed video
coding using CUDACUDA 架構下針對低密度奇偶校驗碼為基礎之分散式編碼的近即時解碼設計
CMLab, CSIE, NTU1
Su, Tse-Chung 蘇則仲Advisor: Prof. Wu, Ja-Ling 吳家麟 教授
2011/6/9
Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using
CUDA Evaluation of Decoding speed Conclusions and future work
CMLab, CSIE, NTU2
Conventional Video Codec MPEG-2, H.264, HEVC(H.265)
CMLab, CSIE, NTU4
ENCODER DECODERLightweightHeavyweight
Distributed Video Coding(DVC)
A new paradigm for video compression
CMLab, CSIE, NTU5
ENCODER DECODERLightweight Heavyweight
Application of DVC Video conferencing with mobile
devices
CMLab, CSIE, NTU7
DVC to H.264 Transcoder
CloudComputational Resource
DVC encoder(Low Complexity)
H.264 decoder(Low Complexity)
DVC encoded bitstream
H.264 encoded bitstream
Realtime system
Distributed Video Coding
D. Varodayan, A. Aaron, and B. Girod, “Rate-Adaptive Codes for Distributed Source Coding,”EURASIP Signal Processing Journal, Special Issue on Distributed Source Coding,,November 2006.
Channel Encoder
Channel Decoder
LDPCEncoder
LDPCDecoder
Decoding Complexity of DVC
Our DVC codec (state-of-the-art) Parallelized with OpenMP
and CUDA 12 core + GPGPU(Fermi)
~1FPS
CMLab, CSIE, NTU12
DECODERHeavyweight
Amdahl's law Maximum speedup can be
reached by improving the most critical part of the system LDPC decoding in the DVC decoder.
CMLab, CSIE, NTU13LDPCA Others
86%~94%
LDPCA Others
29%~36%15.39 FPS
QCIF
Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using
CUDA Evaluation of Decoding speed Conclusions and future work
CMLab, CSIE, NTU14
LDPC decodingSum-Product Algorithm
(Message Passing)
Side Information(real number)+ 0 - 1
4 6 7
甲 乙 丙
3 521
decode outputhard decision
a b c d e f g
a25 b25 c25 d25 e25 f25 g25
Vertical processing
Horizontalprocessing
a1 b1 c1 d1 e1 f1 g1
1 2 3 4 5 6 7
甲 乙 丙
0
1
10 1 1From DVC encoder
(syndrome bits)
a b c d e f g
Kschischang, F.R., Frey, B.J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory
Sum-Product AlgorithmVertical Processing
CMLab, CSIE, NTU16
A B C D E
F G IH J
K L OM N
0
1
1 Z
P
a b c d e f g
P = K + F + a
Z = F + P + a
CMLab, CSIE, NTU17
Sum-Product AlgorithmHorizontal Processing
0
1
1
P Q R S T
U V XW Y
Z A DB C
H
a b c d e f g
KHmag=φ (𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑆|)+𝜑 (|𝑇|) )
K mag=φ (𝜑 (|𝑃|)+𝜑 (|𝑄|)+𝜑 (|𝑅|)+𝜑 (|𝑇|) )
LDPC Accumulate (LDPCA) codes
22
Rate adaptivity
D. Varodayan et al., "Rate-adaptive codes for distributed source coding," EURASIP Signal Processing Journal, Special Section on Distributed Source Coding, 2006
65 L
DP
C c
odes
3
Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA
(Kernel Design) Early Stop Detection Mechanism Using
CUDA Evaluation of Decoding speed Conclusions and future work
CMLab, CSIE, NTU25
Vertical Processing Kernel (VPK) Column degree is constant 3
regular LDPC Shared memory
Horizontal Processing Kernel (HPK) Each message can be update by one thread (SIMD) Variable row degree in each LDPC code Data structure: Circular link list
CUDA thread Block(0)Shared Memory
CUDA thread Block(1)Shared Memory
Previous CUDA implementation
A B C D E
F G H
I J LK
1
Message data
Index data
2 3 4 0 6 7 5 9 10 11 8
CUDA thread 3
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
58
Pai, Y.-S., Cheng, H.-P., Shen, Y.-C. and Wu, J.-L. 2010. Fast decoding for ldpc based distributed video coding. In Proc. of ACM international conference on Multimedia
CUDA ImplementationStrategy 1
28
Reduction of Φ Function in HPKTexture memory in VPK
Global Memory Texture Binding
Global Memory
Texture Binding in VPK
CMLab, CSIE, NTU29
29
58
A B C D E F G H I J K L
t0
Speedup on both 1.x and 2.x compute capability
Non-coalescing readt0
LDPCA decoding time in previous CUDA implementation
LDPC(n,m) 100 iterations Decoding timeHPK+VPK
(1584, 48) 8.29 ms+1.49 ms(1584, 192) 3.40 ms +1.52 ms(1584, 336) 3.04 ms +1.53 ms(1584, 480) 2.31 ms +1.55 ms
(1584, 624) 2.29 ms +1.54 ms
(1584, 768) 2.00 ms +1.52 ms
(1584, 912) 1.82 ms +1.52 ms
(1584, 1056) 1.81 ms +1.52 ms
(1584, 1200) 1.79 ms +1.50 ms
(1584, 1344) 1.79 ms +1.51 ms
(1584, 1488) 1.78 ms +1.56 ms
CMLab, CSIE, NTU30
……
…
Global Memory
CUDA thread Block(0)Shared Memory
Reduction of Φ Function in HPK
CMLab, CSIE, NTU31
1 2 3 4 0 6 7 5 9 10 11 8
A B C D E F G H I J K L
1 2 3 4 0 6 7 5
A B C D E F G H
t1 t2 t3 t4 t5 t6 t7t0Copy to shared memory
t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )
Global Memory
CUDA thread Block(0)Shared Memory
Reduction of Φ Function in HPK
CMLab, CSIE, NTU32
1 2 3 4 0 6 7 5 9 10 11 8
A B C D E F G H I J K L
1 2 3 4 0 6 7 5
𝝋 (|𝑨|) 𝝋 (|𝑩|) 𝝋 (|𝑪|) 𝝋 (|𝑫|) 𝝋 (|𝑬|) 𝝋 (|𝑭|) 𝝋 (|𝑮|) 𝝋 (|𝑯|)
t0 t1 t2 t3 t4 t5 t6 t7Calculate functions before copying to shared memory
t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )
Number of φ(x): row degree
2
33
Previous Implementation 124.47 sec
Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x
StepSpeedupLDPCA Time Cumulative
Speedup
LDPCA Performance -- foreman sequence (QCIF)
34
Parallel Partial Reductionin HPK
CUDA ImplementationStrategy 2
t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2Values (shared memory)
0 1 2 3 4 5 6 7
8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2Values
0 1 2 3
8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values
0 1
21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values
Thread IDs
Step 1 Stride 8
Step 2 Stride 4
Step 3 Stride 2
Step 4 Stride 1
Thread IDs
Thread IDs
Thread IDs
Sequential addressing is conflict free
Parallel Reduction
CMLab, CSIE, NTU
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology
Computation Overlapping in HPK
CMLab, CSIE, NTU37
t 0 :φ (𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 1 :φ (𝜑 (|𝐴|)+𝜑 (|𝐶|)+𝜑 (|𝐷|)+𝜑 (|𝐸|))t 2:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐷|)+𝜑 (|𝐸|) )t 3:φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐸|) )t 4 :φ (𝜑 (|𝐴|)+𝜑 (|𝐵|)+𝜑 (|𝐶|)+𝜑 (|𝐷|) )
t 0 :φ (Mag−𝜑 (|𝐴|) )t 1 :φ (Mag−𝜑 (|𝐵|) )t 2:φ (Mag−𝜑 (|𝐶|) )t 3:φ (Mag−𝜑 (|𝐷|) )t 4 :φ (Mag−𝜑 (|𝐸|) )
=
Magnitude
t1 t2 t3 t4t0
Parallel partial reduction
CUDA thread Block(0)Shared Memory
φ(|A|)= 0.1
φ(|B|) = 0.2
φ(|C|)
= 0.4
φ(|D|)= 0.7
φ(|E|) = 0.3 0 0 0 φ(|F|)=
0.1φ(|G|)=0.7
φ(|H|)=0.4
0t1 t2 t3 t0 t1 t2 t3 t8
0.4 0.2 0.4 0.7 0.3 0 0 0 0.5 0.7 0.4 0t9 t8 t9
t8 t8
0.8 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0
1.7 0.9 0.4 0.7 0.3 0 0 0 1.2 0.7 0.4 0t0 t0
t0
t1 t0 t1t0
Global Memory
A B C D E 0 0 0 F G H 0 I J K L
Log(
row
Deg
) = 3
Mag0 Mag1
(8,0)
(8,1)
(8,2)
(8,3)
(8,4)
(0,0)
(0,0)
(0,0)
(4,0)
(4,1)
(4,2)
(0,0)
(4,0)
(4,1)
(4,2)
(4,3)
rowDeg = 8 rowDeg = 4
index
message
idle threads
Parallel Partial Reduction
CMLab, CSIE, NTU
t0 t1 t2 t3 t4 t8 t9 t10
39
Check Node Re-orderingCompletely Unrolling
CUDA ImplementationStrategy 3
I J K L M
CMLab, CSIE, NTU40
A B C D E F G HShared MemoryCUDA thread Block(0) CUDA thread Block(1)
CUDA thread Block(0) CUDA thread Block(1)
rowDeg = 4rowDeg = 8 rowDeg =8
1 2 3 4 5 60
012
Variable node
Check nodeCheck node
Variable node1 2 3 4 5 60
0 1 2
Check Node Re-ordering
A B C D E I J K L M F G H
3 3
23
Redundant if else & __syncthreads()
int i = threadIdx.x;Int half = rowDeg >> 1;float myMag = s_mag[i] ;char mySign = s_sign[i] ;do{ if(rowPos < half){ s_mag[i] += s_mag[i+half]; s_sign[i] ^= s_sign[i+half]; } half >>= 1; __syncthreads();}while(half);Int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;
int i = threadIdx.x;float myMag = s_mag[i] ;char mySign = s_sign[i] ;If(rowDeg==16){ s_mag[i] += s_mag[i+8]; s_sign[i] ^= s_sign[i+8]; s_mag[i] += s_mag[i+4]; s_sign[i] ^= s_sign[i+4]; s_mag[i] += s_mag[i+2]; s_sign[i] ^= s_sign[i+2]; s_mag[i] += s_mag[i+1]; s_sign[i] ^= s_sign[i+1]; }else if ( rowDeg == 8 ){ ….. }int base = i - rowPos;myMag = s_mag[base] - myMag;mySign = s_sign[base] ^ mySign;
Branch divergence
harmperformance
No branch divergence
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology
Completely unrolling
42
Combination of VPK and HPK
CUDA ImplementationStrategy 4
Kernel Launch Overhead
CMLab, CSIE, NTU43
NVIDIA CUDA PROGRAMMING GUIDE(3.2) 5.2.1
1. Parallelism is broken (Implicit Inter-Block Synchronization)2. Extra global memory traffic
HPKVPK HPKVPK HPKVPK
VPK+HPK=UMK
VPK+HPK=UMK
VPK+HPK=UMK
45
LDPCA Performance -- foreman sequence (QCIF)
Previous Implementation124.47 sec
Strategy 1:reduce φ, texture memory 52.94 sec 2.35x 2.35x
Strategy 2:PPR in HPK 40.66 sec 1.30x 3.06x
Strategy 3:Merge HPK & VPK 28.80 sec 1.41x 4.32x
Strategy 4:Check Node Re-ordering & Completely Unrolling
22.29 sec 1.29x 5.58x
StepSpeedup
Time CumulativeSpeedup
Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using
CUDA (CUDA API) Evaluation of Decoding speed Conclusions and future work
CMLab, CSIE, NTU48
49
Early Stop Detection
CUDA ImplementationStrategy 5
UMK UMKUMK UMKUMK UMKUMK UMK
Early Stop Detection in Sum-Product Algorithm
CMLab, CSIE, NTU50
SPA iteration 1 SPA iteration 2time
SPA iteration 100. . .
GPUCPU
time
UMK EDK
PCI-E transfer . . .
CPUHorizontal Processing + Vertical Processing
EEarly stop detection
Early stop Detection
Kernel
Transmit codeword&decoded info
Check decode info
iter.1 Check iter. 1 iter.2 Check iter. 2
Terminated at iteration 301. Successfully decoded2. Converge to wrong codeword
UMK EDK
PCI-E transfer
Combination of EDK and UMK The SPA algorithm is memory intensive in CUDA The index data of UMK is also used by early stop
detection (EDK) EDK+UMK = EDUMK
14% additional complexity in terms of execution time
CMLab, CSIE, NTU51
0
1
1
a b c d e f g
UMK UMKUMK UMKUMK UMKGPUCPU
time
UMK EDK
PCI-E transfer . . .UMK ED
KPCI-E
transfer
Concurrent Kernel Execution and Data Transfer
CMLab, CSIE, NTU52
UMKGPU UMK UMK
PCI-E transfer
time
EDUMK UMK EDUMK
PCI-E transfer
UMK UMK UMK
Early Stop Detection for iter.1Run UMK for iter.2 iter.3
iter.1
Early Stop Detection for iter.5Run UMK for iter.6
Receive decode info & codewordfor iter.1
Receive decode info &codeword
for iter.5
iter.9
CPU
Ideal Timeline
Practical CUDA Implementation for Early Stopping Detection
Use 1 CPU thread, 1 GPU Use CUDA Driver API instead of Runtime API Nearly no Stream Management instructions
cudaStreamSynchronize(), cudaStreamQuery(), or cudaStreamWaitEvent()
CMLab, CSIE, NTU53
. . .Stream 2
UMK UMK
Stream 0
UMKUMKPCI-E transfer
time
EDUMKUMK UMK UMKPCI-E transfer
EDUMK
#overlap = 3
EDUMKStream 1
host
~~~~~~~~~
~~~~~~~~~Explicit synchronization
#overlap = 3
Speed-up ratio of early stop detection
CMLab, CSIE, NTU55
Total number of LDPCA iterations
20000Fix iteration
10000
Early stop detection
10%overhead
1.8xActual
speedup
2.0xTheoretical
speedup
Overhead on CPU
5%
Overhead on GPU Using Runtime API
20% 7%
Overhead on GPU Using driver API
LDPCA Performance -- foreman sequence (QCIF)
Previous Implementation 124.47 sec Strategy 1:(fix 100 Iter)reduce φ, texture memory 52.94 sec 2.35x 2.35x
Strategy 2:(fix 100 Iter)PPR in HPK 40.66 sec 1.30x 3.06x
Strategy 3:(fix 100 Iter)Merge HPK & VPK 28.80 sec 1.41x 4.32x
Strategy 4:(fix 100 Iter)Check Node Re-ordering & Completely Unrolling
22.29 sec 1.29x 5.58x
Strategy 5:(max 100 Iter)Early Stop Detection (Driver API) 10.86 sec 2.02x 11.27x
StepSpeedup
Time CumulativeSpeedup
449.63x faster than sequential program!
Outline Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using
CUDA Evaluation of Decoding speed Conclusions and future work
CMLab, CSIE, NTU59
Test condition 12 CPU, 24 processor
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz GPU: Tesla M2050
14 (MP) x 32 (Cores/MP) = 448 (Cores) CUDA capability 2.0 Shared memory: 48K Maximum threads in block: 1024 Concurrent copy and execution Concurrent kernel execution
Test condition Test sequences:
QCIF, 15Hz, all frames GOP size: 8 Qindex: 8 Bitrate and PSNR: only luminance
componentCMLab, CSIE, NTU61
Soccer Foreman Coastguard Hall MonitorHigh LowMotion
Speedup Ratio of LDPCA decoder Using CUDA
15.39 FPS
1.14 fps
7.14 FPS
0.96 fps
4.99 FPS
0.79 fps
10.29 FPS
1.05 fps7.43 ↑
6.32 ↑13.5 ↑
9.8 ↑15.35 ↑LDPCA
22.51 ↑LDPCA
12.88 ↑LDPCA
36.91 ↑LDPCA
0.2% bit rate↑
LDPCA decoding time comparison
100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)
9800GTX 1.93~1.83ms 1.09~1.27ms 3.26~3.34ms 1.87~2.12ms
Tesla T10 1.23~1.26ms 0.67~0.70ms 2.39~2.52ms 1.27~1.34ms
Tesla C2050 0.55~0.60ms 0.29~0.31ms 1.25~1.34ms 0.65~0.69ms
100 iteration(QCIF) 50 iteration(QCIF) 100 iteration(CIF) 50 iteration(CIF)
GTX260 35ms 18ms 46ms 24ms
GeForce 9800
GTX+
Tesla
C1060
GeForce
GTX260
Tesla
C2050
Compute Capability 1.1 1.3 1.3 2.0
MP x Cores/MP 16x8 30x8 27x8 14x32
Ryanggeun, O., Jongbin, P. and Byeungwoo, J. 2010. Fast implementation of wyner-ziv video codec using gpgpu. In Proc. of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting , 1-5.
Realtime Decoding Quality27.44db, 76kbps
39.46db, 147.64kbps 35.34db, 263.52 kbps
29.21db, 93.17 kbpsOriginal Sequence
Original Sequence
Original Sequence
Original Sequence
Conclusion Fully parallelized LDPCA decoder using
CUDA with various features The proposed early stop detection
mechanism reduces the latency between the CPU and the GPU
Videos in surveillance sequence (e.g. hall monitor) can be decoded in real-time with negligible RD performance loss
CMLab, CSIE, NTU72
Future Work Bitplane level parallelization for LDPCA
UV component Frame level parallelization
Vitor Silva
a2 b2 c2 d2 e2 f2 g2 a3 b3 c3 d3 e3 f3 g3
4 6 7
1 2 303 13 03
Soft input
3 521
Vertical processing
Horizontalprocessing
a1 b1 c1 d1 e1 f1 g1
syndrome
02 12 0201 11 01
Thank You
CMLab, CSIE, NTU74