dukki hong 1 youngduke seo 1 youngsik kim 2 kwon-taek kwon 3 sang-oak woo 3 seok-yoon jung 3...
TRANSCRIPT
Effective Non-Blocking Cache Architecture for High-Performance Texture Mapping
Dukki Hong1 Youngduke Seo1 Youngsik Kim2 Kwon-Taek Kwon3
Sang-Oak Woo3 Seok-Yoon Jung3 Kyoungwoo Lee4 Woo-Chan Park1
1Media Processor Lab., Sejong University2Korea Polytechnic University
3SAIT of Samsung Electronics Co., Ltd.4Yonsei University
http://rayman.sejong.ac.kr
October 3, 2013
2
Introduction Related Work◦Texture mapping◦Non-blocking Scheme
Proposed Non-Blocking Texture Cache◦The Proposed Architecture◦Buffers for Non-blocking scheme◦Execution Flow of The NBTC
Experimental Results Conclusion
October 3, 2013
Contents
3
Texture mapping◦ Core technique for 3D graphics◦ Maps texture images to the surface
Problem: a huge amount of memory access is required◦ Major bottleneck in graphics pipelines◦ Modern GPUs generally use texture caches to solve this problem
Improving texture cache performance◦ Improving cache hit rates◦ Reducing miss penalty◦ Reducing cache access time
October 3, 2013
Introduction
4
The visual quality of mobile 3D games have evolved enough to compare with PC games.◦ Detailed texture images
ex) Infinity blade : 2048 [GDC 2011]
◦ Demand high texture mapping throughput
October 3, 2013
Mobile 3D games
<Epic Games: Infinity Blade Series> <Gameloft: Asphalt Series>
5
Improving texture cache performance◦ Improving cache hit rates◦ Reducing miss penalty◦ Reducing cache access time
In this presentation, we introduce a non-blocking texture cache (NBTC) architecture◦ Out-of-order (OOO) execution◦ Conditional in-order (IO) completion
the same screen coordinate to support the standard API effectively
October 3, 2013
Our approach
“Our approach”
6October 3, 2013
What is the texture mapping?
<Texture> <Object> <Texture Mapped Object>
Texture mapping is that glue n-D images onto geometrical objects◦ To increase realism
<Results of the texture filtering>
Texture mapping
Texture filtering Texture filtering is a operation for re-
ducing artifacts of texture aliasing caused by the texture mappingBi-linear filtering : four samples per texture accessTri-linear filtering : eight samples per texture access
7
Cache performance study ◦ In [Hakura and Gupta 1997], the performance of a texture cache was mea-
sured with regard to various benchmarks◦ In [Igehy et al. 1999], the performance of a texture cache was studied with re-
gard to multiple pixel pipelines
Pre-fetching scheme◦ In [Igehy et al. 1998], the latency generated during texture cache misses can
be hidden by applying an explicit pre-fetching scheme
Survey of texture cache◦ The introduction of a texture cache and the integration of texture cache archi-
tectures into modern GPUs were studied in [Doggett 2012]
October 3, 2013
Related Work
8
Non-blocking cache (NBC)◦ allows the following cache request while a cache miss is handled
Reducing the miss-induced processor stalls
◦ Kroft firstly published a NBC using missing information/status holding reg-isters (MSHR) that keep track of multiple miss information [Kroft 1981]
Related Work: Non-blocking Scheme
Word 0validbit
Word 0desti-nation
Comparator
Word 0format
Blockvalidbit
Word 1validbit
Word 1desti-nation
Word 1format
Blockrequestaddress
Word nvalidbit
Word ndesti-nation
Word nformat
⁞ ⁞ ⁞
<Kroft’s MSHR>
Miss PenaltyCPUCPU
Miss Miss PenaltyMiss PenaltyMiss
Hitstall only when result needed
<Non-blocking Cache with MSHR>
Miss PenaltyCPU
Miss
<Blocking Cache>
Performance study with regard to non-blocking cache◦ Comparison with four different MSHRs [Farkas and Jouppi 1994].
Implicitly addressed MSHR : Kroft’s MSHR Explicitly addressed MSHR : complement version of implicitly MSHR In-cache MSHR : each cache line as MSHR The first three MSHRs : only one entry per miss block address Inverted MSHR: single entry per possible destination
The number of entries = usable registers in a processor (possible destination)
9
Related Work: Inverted MSHR
<Inverted MSHR organization>
Reg #1validbit
Reg #1requestaddress
Comparator
Reg #1format
Reg #1addressin block
Reg #2validbit
Reg #2requestaddress
Comparator
Reg #2format
Reg #2addressin block
PCvalidbit
PCrequestaddress
Comparator
PCformat
PCaddressin block
⁞ ⁞ ⁞ ⁞
Matchencoder
MatchingRegisternumber
◦ Recent high-performance out-of-order (OOO) processor using the latest SPEC benchmark [Li et al. 2011] A hit under two-misses non-blocking cache improved the OOO processor’s perfor-
mance 17.76% more than the one using a blocking data cache
11
This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme:◦ Retry buffer
Guarantee IO completion◦ Waiting list buffer
Keep track of miss information◦ Block address buffer
Remove duplicate block address
October 3, 2013
The Proposed Architecture
Lookup Retry Buffer
Retry Buffer
Texture Address Generation
L1 Cache
Block Address Buffer
DRAMor
L2 Cache
Texture Request
MissUpdate
| | |
| Request Address Queue
Retry Buffer Update
Block Address BufferUpdateWaiting List Buffer
Hit/Miss Router
Waiting List
BufferUpdate
MUX
Lookup
Fragment Information
H i t
T e x t u r e
R e q u e s t
M i s s e d
T e x t u r e
R e q u e s t
R e a d y
T e x t u r e
R e q u e s t
Texture MappingPipeline
M i s s e d
T e x e l
R e q u e s t
Shading Unit
<Proposed NBTC architecture>
Triangle
TextureRequest
TexelRequest
Fragment ......
......
......
(Retry Buffer)
(Waiting List Buffer)
(Block Address Buffer)texaddr
or
or
Feature◦ The most important property of the retry buffer (RB) is its support of IO com-
pletion The RB stores fragment information by input order The RB is designed as FIFO
Data Format of each RB entry◦ Valid bit : 0 = empty, 1 = occupied◦ Screen coordinate : screen coordinate for output display unit (x, y)◦ Texture request◦ Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data◦ Filtered texture data : texture data for accomplished texture mapping
October 3, 2013 12
Retry Buffer: Fragment informationRetry Buffer
Texture request,(Filtering information, Texture address)
ReadyBitValid
Bit
Filtered Texture
Data:
::
:
::
:
::
ScreenCoordinate
13
Features◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in
[Farkas and Jouppi 1994] The WLB stores information of both missed and hit addresses The texture address of the WLB plays a similar role as a register in the in-
verted MSHR Data format of each WLB entry◦ Valid bit : 0 = empty, 1 = occupied◦ Texture ID : ID number of a texture request◦ Filtering information : the information to accomplish the texture mapping◦ Texel addr N : the texture address of necessary texture data ◦ Texel data N : the texel data of Texel Addr N◦ Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3, 2013
Waiting List Buffer: Texture requests
ValidBit
Filteringinformation Texel Data0 … 7
TextureID
Ready Bit0…7
Texel Addr0 … 7
Waiting List Buffer
14
Feature◦ The block address buffer operates the DRAM access sequentially with regard
to the texel request that caused a cache miss The block address buffer removes duplicate DRAM requests
When data are loaded, all the removed DRAM requests are found The block address buffer is designed as FIFO
October 3, 2013
Block Address Buffer: Texel requestsBlock
AddressBlock
Address… Miss Address
Request Address Queue
15
Lookup Retry Buffer
Retry Buffer
Texture Address Generation
L1 Cache
Block Address Buffer
DRAMor
L2 Cache
Texture Request
MissUpdate
| | |
| Request Address Queue
Retry Buffer Update
Block Address BufferUpdateWaiting List Buffer
Hit/Miss Router
Waiting List
BufferUpdate
MUX
Lookup
Fragment Information
H i t
T e x t u r e
R e q u e s t
M i s s e d
T e x t u r e
R e q u e s t
R e a d y
T e x t u r e
R e q u e s t
Texture MappingPipeline
M i s s e d
T e x e l
R e q u e s t
Shading Unit
October 3, 2013
Execution Flow of our NBTC
Start
Executelookup RB
Generatetexture addresses
Executetag compare with
texel requests
All hitsOccurred
miss
Miss handling caseHit handling case
16
Lookup Retry Buffer
Retry Buffer
Texture Address Generation
L1 Cache
Block Address Buffer
DRAMor
L2 Cache
Texture Request
MissUpdate
| | |
| Request Address Queue
Retry Buffer Update
Block Address BufferUpdateWaiting List Buffer
Hit/Miss Router
Waiting List
BufferUpdate
MUX
Lookup
Fragment Information
H i t
T e x t u r e
R e q u e s t
M i s s e d
T e x t u r e
R e q u e s t
R e a d y
T e x t u r e
R e q u e s t
Texture MappingPipeline
M i s s e d
T e x e l
R e q u e s t
Shading Unit
October 3, 2013
Execution Flow: Hit Handling Case
Read texel datafrom L1 cache
Input texel datato texture mapping
unit via MUX
Executetexture mapping
Hit handling case
Update RB
17October 3, 2013
Execution Flow: Miss Handling Case
Lookup Retry Buffer
Retry Buffer
Texture Address Generation
L1 Cache
Block Address Buffer
DRAMor
L2 Cache
Texture Request
MissUpdate
| | |
| Request Address Queue
Retry Buffer Update
Block Address BufferUpdateWaiting List Buffer
Hit/Miss Router
Waiting List
BufferUpdate
MUX
Lookup
Fragment Information
H i t
T e x t u r e
R e q u e s t
M i s s e d
T e x t u r e
R e q u e s t
R e a d y
T e x t u r e
R e q u e s t
Texture MappingPipeline
M i s s e d
T e x e l
R e q u e s t
Shading Unit Read hit texel data
from L1 cache
Input missed tex-ture requests
to WLB
Miss handling case
Input missed texel requests to BAB
“Concurrent execution”
Remove duplicate texel requests
Process the next texture request
18October 3, 2013
Execution Flow: Miss Handling Case
Lookup Retry Buffer
Retry Buffer
Texture Address Generation
L1 Cache
Block Address Buffer
DRAMor
L2 Cache
Texture Request
MissUpdate
| | |
| Request Address Queue
Retry Buffer Update
Block Address BufferUpdateWaiting List Buffer
Hit/Miss Router
Waiting List
BufferUpdate
MUX
Lookup
Fragment Information
H i t
T e x t u r e
R e q u e s t
M i s s e d
T e x t u r e
R e q u e s t
R e a d y
T e x t u r e
R e q u e s t
Texture MappingPipeline
M i s s e d
T e x e l
R e q u e s t
Shading Unit Read hit texel data
from L1 cache
Input missed tex-ture requests
to WLB
Miss handling case
Input missed texel requests to BAB
“Concurrent execution”
Remove duplicate texel requests
Process the next texture request
Complete memory request
Forward the loaded data to WLB and
cache
Determinethe ready entry
in WLB
Invalidate the entry Executetexture mapping
Update RB
Input texel datato texture mapping
unit via MUX
19October 3, 2013
Execution Flow: Update Retry Buffer
Lookup Retry Buffer
Retry Buffer
Texture Address Generation
L1 Cache
Block Address Buffer
DRAMor
L2 Cache
Texture Request
MissUpdate
| | |
| Request Address Queue
Retry Buffer Update
Block Address BufferUpdateWaiting List Buffer
Hit/Miss Router
Waiting List
BufferUpdate
MUX
Lookup
Fragment Information
H i t
T e x t u r e
R e q u e s t
M i s s e d
T e x t u r e
R e q u e s t
R e a d y
T e x t u r e
R e q u e s t
Texture MappingPipeline
M i s s e d
T e x e l
R e q u e s t
Shading Unit
Update RB
Determine the ready entry in RB
Forward the ready entry to the shad-
ing unit
Process the next fragment infroma-
tion
Determine whether IO completion
21
Simulator configuration◦ mRPsim : announced by SAIT [Yoo et al. 2010]
Execution driven cycle-accurate simulator for SRP-based GPU Modification of the texture mapping unit Eight pixel processors DRAM access latency cycles : 50, 100, 200, and 300 cycles
◦ Benchmark Taiji which has nearest, bi-linear, and tri-linear filtering modes
Cache configuration ◦ Four-way set associative, eight-word block size and 32KByte cache size◦ The number of each buffer entries : 32
October 3, 2013
Experimental Environment
22October 3, 2013
Pixel shader cycle/frame◦ PS run cycle : running cycles◦ PS stall cycle : stall cycle◦ NBTC stall cycle : stall cycles due to the WLB full◦ The pixel shader’s execution cycle decreased from 12.47% (latency 50) to
41.64% (latency 300)
Pixel Shader Execution Cycle
BTC-50
NBTC-50
BTC-100
NBTC-100
BTC-200
NBTC-200
BTC-300
NBTC-300
0
3000000
6000000
9000000
12000000
15000000NBTC stall cycles PS stall cyclesPS run cycles
DRAM Access Latency (Cycles)
Tot
al P
S C
ycle
s (M
cyc
les)
23
Cache miss rates◦ The NBTC’s cache miss rate increased slightly more than the BTC’s
cache miss rate The NBTC can handle the following cache accesses in cases where a cache
update is not completed
October 3, 2013
Cache Miss Rate
50 100 200 3000
2
4
6
8 BTC NBTC
DRAM Access Latency (Cycles)
Mis
s R
ate
(%)
24
Memory bandwidth requirement◦The memory bandwidth requirement of the NBTC increased up to
11% more than that of the BTC Since the block address buffer removes duplicate DRAM requests, the in-
creasing memory bandwidth requirement was relatively lower
Memory Bandwidth Requirement
50 100 200 3004000000
5000000
6000000
7000000BTC NBTC
DRAM Access Latency (Cycles)
Mem
ory
Ban
dwid
th
(MB
ytes
)
25
A non-blocking texture cache to improve the performance of texture caches◦ basic OOO executions maintaining IO completion for texture requests
with the same screen coordinate◦ Three buffers to support the non-blocking scheme:
The retry buffer : IO completion The waiting list buffer : tracking the miss information The block address buffer : deleting the duplicate block address
We plan to also implement hardware for the proposed NBTC archi-tecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture
October 3, 2013
Conclusion & Future Work