dukki hong 1 youngduke seo 1 youngsik kim 2 kwon-taek kwon 3 sang-oak woo 3 seok-yoon jung 3...

Effective Non-Blocking Cache Architecture for High-Performance Texture Mapping

Dukki Hong1 Youngduke Seo1 Youngsik Kim2 Kwon-Taek Kwon3

Sang-Oak Woo3 Seok-Yoon Jung3 Kyoungwoo Lee4 Woo-Chan Park1

1Media Processor Lab., Sejong University2Korea Polytechnic University

3SAIT of Samsung Electronics Co., Ltd.4Yonsei University

[email protected]

http://rayman.sejong.ac.kr

October 3, 2013

mailto:[email protected]

2

Introduction Related Work◦Texture mapping◦Non-blocking Scheme

Proposed Non-Blocking Texture Cache◦The Proposed Architecture◦Buffers for Non-blocking scheme◦Execution Flow of The NBTC

Experimental Results Conclusion

October 3, 2013

Contents

3

Texture mapping◦ Core technique for 3D graphics◦ Maps texture images to the surface

Problem: a huge amount of memory access is required◦ Major bottleneck in graphics pipelines◦ Modern GPUs generally use texture caches to solve this problem

Improving texture cache performance◦ Improving cache hit rates◦ Reducing miss penalty◦ Reducing cache access time

October 3, 2013

Introduction

4

The visual quality of mobile 3D games have evolved enough to compare with PC games.◦ Detailed texture images

ex) Infinity blade : 2048 [GDC 2011]

◦ Demand high texture mapping throughput

October 3, 2013

Mobile 3D games

<Epic Games: Infinity Blade Series> <Gameloft: Asphalt Series>

5

Improving texture cache performance◦ Improving cache hit rates◦ Reducing miss penalty◦ Reducing cache access time

In this presentation, we introduce a non-blocking texture cache (NBTC) architecture◦ Out-of-order (OOO) execution◦ Conditional in-order (IO) completion

the same screen coordinate to support the standard API effectively

October 3, 2013

Our approach

“Our approach”

6October 3, 2013

What is the texture mapping?

<Texture> <Object> <Texture Mapped Object>

Texture mapping is that glue n-D images onto geometrical objects◦ To increase realism

<Results of the texture filtering>

Texture mapping

Texture filtering Texture filtering is a operation for re-

ducing artifacts of texture aliasing caused by the texture mappingBi-linear filtering : four samples per texture accessTri-linear filtering : eight samples per texture access

7

Cache performance study ◦ In [Hakura and Gupta 1997], the performance of a texture cache was mea-

sured with regard to various benchmarks◦ In [Igehy et al. 1999], the performance of a texture cache was studied with re-

gard to multiple pixel pipelines

Pre-fetching scheme◦ In [Igehy et al. 1998], the latency generated during texture cache misses can

be hidden by applying an explicit pre-fetching scheme

Survey of texture cache◦ The introduction of a texture cache and the integration of texture cache archi-

tectures into modern GPUs were studied in [Doggett 2012]

October 3, 2013

Related Work

8

Non-blocking cache (NBC)◦ allows the following cache request while a cache miss is handled

Reducing the miss-induced processor stalls

◦ Kroft firstly published a NBC using missing information/status holding reg-isters (MSHR) that keep track of multiple miss information [Kroft 1981]

Related Work: Non-blocking Scheme

Word 0validbit

Word 0desti-nation

Comparator

Word 0format

Blockvalidbit

Word 1validbit

Word 1desti-nation

Word 1format

Blockrequestaddress

Word nvalidbit

Word ndesti-nation

Word nformat

⁞ ⁞ ⁞

<Kroft’s MSHR>

Miss PenaltyCPUCPU

Miss Miss PenaltyMiss PenaltyMiss

Hitstall only when result needed

<Non-blocking Cache with MSHR>

Miss PenaltyCPU

Miss

<Blocking Cache>

Performance study with regard to non-blocking cache◦ Comparison with four different MSHRs [Farkas and Jouppi 1994].

Implicitly addressed MSHR : Kroft’s MSHR Explicitly addressed MSHR : complement version of implicitly MSHR In-cache MSHR : each cache line as MSHR The first three MSHRs : only one entry per miss block address Inverted MSHR: single entry per possible destination

The number of entries = usable registers in a processor (possible destination)

9

Related Work: Inverted MSHR

<Inverted MSHR organization>

Reg #1validbit

Reg #1requestaddress

Comparator

Reg #1format

Reg #1addressin block

Reg #2validbit

Reg #2requestaddress

Comparator

Reg #2format

Reg #2addressin block

PCvalidbit

PCrequestaddress

Comparator

PCformat

PCaddressin block

⁞ ⁞ ⁞ ⁞

Matchencoder

MatchingRegisternumber

◦ Recent high-performance out-of-order (OOO) processor using the latest SPEC benchmark [Li et al. 2011] A hit under two-misses non-blocking cache improved the OOO processor’s perfor-

mance 17.76% more than the one using a blocking data cache

10

Proposed Non-Blocking Texture Cache

October 3, 2013

11

This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme:◦ Retry buffer

Guarantee IO completion◦ Waiting list buffer

Keep track of miss information◦ Block address buffer

Remove duplicate block address

October 3, 2013

The Proposed Architecture

Lookup Retry Buffer

Retry Buffer

Texture Address Generation

L1 Cache

Block Address Buffer

DRAMor

L2 Cache

Texture Request

MissUpdate

| | |

| Request Address Queue

Retry Buffer Update

Block Address BufferUpdateWaiting List Buffer

Hit/Miss Router

Waiting List

BufferUpdate

MUX

Lookup

Fragment Information

H i t

T e x t u r e

R e q u e s t

M i s s e d

T e x t u r e

R e q u e s t

R e a d y

T e x t u r e

R e q u e s t

Texture MappingPipeline

M i s s e d

T e x e l

R e q u e s t

Shading Unit

<Proposed NBTC architecture>

Triangle

TextureRequest

TexelRequest

Fragment ......

......

......

(Retry Buffer)

(Waiting List Buffer)

(Block Address Buffer)texaddr

or

or

Feature◦ The most important property of the retry buffer (RB) is its support of IO com-

pletion The RB stores fragment information by input order The RB is designed as FIFO

Data Format of each RB entry◦ Valid bit : 0 = empty, 1 = occupied◦ Screen coordinate : screen coordinate for output display unit (x, y)◦ Texture request◦ Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data◦ Filtered texture data : texture data for accomplished texture mapping

October 3, 2013 12

Retry Buffer: Fragment informationRetry Buffer

Texture request,(Filtering information, Texture address)

ReadyBitValid

Bit

Filtered Texture

Data:

::

:

::

:

::

ScreenCoordinate

13

Features◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in

[Farkas and Jouppi 1994] The WLB stores information of both missed and hit addresses The texture address of the WLB plays a similar role as a register in the in-

verted MSHR Data format of each WLB entry◦ Valid bit : 0 = empty, 1 = occupied◦ Texture ID : ID number of a texture request◦ Filtering information : the information to accomplish the texture mapping◦ Texel addr N : the texture address of necessary texture data ◦ Texel data N : the texel data of Texel Addr N◦ Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3, 2013

Waiting List Buffer: Texture requests

ValidBit

Filteringinformation Texel Data0 … 7

TextureID

Ready Bit0…7

Texel Addr0 … 7

Waiting List Buffer

14

Feature◦ The block address buffer operates the DRAM access sequentially with regard

to the texel request that caused a cache miss The block address buffer removes duplicate DRAM requests

When data are loaded, all the removed DRAM requests are found The block address buffer is designed as FIFO

October 3, 2013

Block Address Buffer: Texel requestsBlock

AddressBlock

Address… Miss Address

Request Address Queue

15

Lookup Retry Buffer

Retry Buffer


L1 Cache


DRAMor

L2 Cache

Texture Request

MissUpdate

| | |


Retry Buffer Update


Hit/Miss Router

Waiting List

BufferUpdate

MUX

Lookup


H i t

T e x t u r e

R e q u e s t

M i s s e d

T e x t u r e

R e q u e s t

R e a d y

T e x t u r e

R e q u e s t


M i s s e d

T e x e l

R e q u e s t

Shading Unit

October 3, 2013

Execution Flow of our NBTC

Start

Executelookup RB

Generatetexture addresses

Executetag compare with

texel requests

All hitsOccurred

miss

Miss handling caseHit handling case

16

Lookup Retry Buffer

Retry Buffer


L1 Cache


DRAMor

L2 Cache

Texture Request

MissUpdate

| | |


Retry Buffer Update


Hit/Miss Router

Waiting List

BufferUpdate

MUX

Lookup


H i t

T e x t u r e

R e q u e s t

M i s s e d

T e x t u r e

R e q u e s t

R e a d y

T e x t u r e

R e q u e s t


M i s s e d

T e x e l

R e q u e s t

Shading Unit

October 3, 2013

Execution Flow: Hit Handling Case

Read texel datafrom L1 cache

Input texel datato texture mapping

unit via MUX

Executetexture mapping

Hit handling case

Update RB

17October 3, 2013

Execution Flow: Miss Handling Case

Lookup Retry Buffer

Retry Buffer


L1 Cache


DRAMor

L2 Cache

Texture Request

MissUpdate

| | |


Retry Buffer Update


Hit/Miss Router

Waiting List

BufferUpdate

MUX

Lookup


H i t

T e x t u r e

R e q u e s t

M i s s e d

T e x t u r e

R e q u e s t

R e a d y

T e x t u r e

R e q u e s t


M i s s e d

T e x e l

R e q u e s t

Shading Unit Read hit texel data

from L1 cache

Input missed tex-ture requests

to WLB

Miss handling case

Input missed texel requests to BAB

“Concurrent execution”

Remove duplicate texel requests

Process the next texture request

18October 3, 2013

Execution Flow: Miss Handling Case

Lookup Retry Buffer

Retry Buffer


L1 Cache


DRAMor

L2 Cache

Texture Request

MissUpdate

| | |


Retry Buffer Update


Hit/Miss Router

Waiting List

BufferUpdate

MUX

Lookup


H i t

T e x t u r e

R e q u e s t

M i s s e d

T e x t u r e

R e q u e s t

R e a d y

T e x t u r e

R e q u e s t


M i s s e d

T e x e l

R e q u e s t

Shading Unit Read hit texel data

from L1 cache

Input missed tex-ture requests

to WLB

Miss handling case

Input missed texel requests to BAB

“Concurrent execution”

Remove duplicate texel requests

Process the next texture request

Complete memory request

Forward the loaded data to WLB and

cache

Determinethe ready entry

in WLB

Invalidate the entry Executetexture mapping

Update RB

Input texel datato texture mapping

unit via MUX

19October 3, 2013

Execution Flow: Update Retry Buffer

Lookup Retry Buffer

Retry Buffer


L1 Cache


DRAMor

L2 Cache

Texture Request

MissUpdate

| | |


Retry Buffer Update


Hit/Miss Router

Waiting List

BufferUpdate

MUX

Lookup


H i t

T e x t u r e

R e q u e s t

M i s s e d

T e x t u r e

R e q u e s t

R e a d y

T e x t u r e

R e q u e s t


M i s s e d

T e x e l

R e q u e s t

Shading Unit

Update RB

Determine the ready entry in RB

Forward the ready entry to the shad-

ing unit

Process the next fragment infroma-

tion

Determine whether IO completion

20

Experimental Results

October 3, 2013

21

Simulator configuration◦ mRPsim : announced by SAIT [Yoo et al. 2010]

Execution driven cycle-accurate simulator for SRP-based GPU Modification of the texture mapping unit Eight pixel processors DRAM access latency cycles : 50, 100, 200, and 300 cycles

◦ Benchmark Taiji which has nearest, bi-linear, and tri-linear filtering modes

Cache configuration ◦ Four-way set associative, eight-word block size and 32KByte cache size◦ The number of each buffer entries : 32

October 3, 2013

Experimental Environment

22October 3, 2013

Pixel shader cycle/frame◦ PS run cycle : running cycles◦ PS stall cycle : stall cycle◦ NBTC stall cycle : stall cycles due to the WLB full◦ The pixel shader’s execution cycle decreased from 12.47% (latency 50) to

41.64% (latency 300)

Pixel Shader Execution Cycle

BTC-50

NBTC-50

BTC-100

NBTC-100

BTC-200

NBTC-200

BTC-300

NBTC-300

0

3000000

6000000

9000000

12000000

15000000NBTC stall cycles PS stall cyclesPS run cycles

DRAM Access Latency (Cycles)

Tot

al P

S C

ycle

s (M

cyc

les)

23

Cache miss rates◦ The NBTC’s cache miss rate increased slightly more than the BTC’s

cache miss rate The NBTC can handle the following cache accesses in cases where a cache

update is not completed

October 3, 2013

Cache Miss Rate

50 100 200 3000

2

4

6

8 BTC NBTC


Mis

s R

ate

(%)

24

Memory bandwidth requirement◦The memory bandwidth requirement of the NBTC increased up to

11% more than that of the BTC Since the block address buffer removes duplicate DRAM requests, the in-

creasing memory bandwidth requirement was relatively lower

Memory Bandwidth Requirement

50 100 200 3004000000

5000000

6000000

7000000BTC NBTC


Mem

ory

Ban

dwid

th

(MB

ytes

)

25

A non-blocking texture cache to improve the performance of texture caches◦ basic OOO executions maintaining IO completion for texture requests

with the same screen coordinate◦ Three buffers to support the non-blocking scheme:

The retry buffer : IO completion The waiting list buffer : tracking the miss information The block address buffer : deleting the duplicate block address

We plan to also implement hardware for the proposed NBTC archi-tecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture

October 3, 2013

Conclusion & Future Work

26

Thank youfor your attention

October 3, 2013

http://rayman.sejong.ac.kr

27

Backup Slides

October 3, 2013

28October 3, 2013

Alpha blending order issue

dukki hong 1 youngduke seo 1 youngsik kim 2 kwon-taek kwon 3 sang-oak woo 3 seok-yoon jung 3...

Documents