-
ComputerArchitectureLecture3:MemoryHierarchyDesign(Chapter2,AppendixB)
ChihWeiLiuNationalChiao [email protected]
-
Since1980,CPUhasoutpacedDRAM
CA-Lec3 [email protected]
Introduction
2
Gap grew 50% per year
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
-
Introduction Programmerswantunlimitedamountsofmemorywithlow
latency Fastmemorytechnologyismoreexpensiveperbitthan
slowermemory Solution:organizememorysystemintoahierarchy
Entireaddressablememoryspaceavailableinlargest,slowestmemory Incrementallysmallerandfastermemories,eachcontainingasubset
ofthememorybelowit,proceedinstepsuptowardtheprocessor Temporalandspatiallocalityinsuresthatnearlyallreferences
canbefoundinsmallermemories Givestheallusionofalarge,fastmemorybeingpresentedtothe
processor
CA-Lec3 [email protected]
Introduction
3
-
MemoryHierarchyDesign Memoryhierarchydesignbecomesmorecrucialwithrecentmulticoreprocessors: Aggregatepeakbandwidthgrowswith#cores:
IntelCorei7cangeneratetworeferencespercoreperclock Fourcoresand3.2GHzclock
25.6billion64bitdatareferences/second+12.8billion128bitinstructionreferences=409.6GB/s!
DRAMbandwidthisonly6%ofthis(25GB/s) Requires:
Multiport,pipelinedcaches Twolevelsofcachepercore Sharedthirdlevelcacheonchip
CA-Lec3 [email protected]
Introduction
4
-
MemoryHierarchy Takeadvantageoftheprincipleoflocalityto:
Presentasmuchmemoryasinthecheapesttechnology Provideaccessatspeedofferedbythefastesttechnology
On-C
hipC
ache
Registers
Control
Datapath
SecondaryStorage(Disk/
FLASH/PCM)
Processor
MainMemory(DRAM/FLASH/PCM)
SecondLevelCache
(SRAM)
1s 10,000,000s (10s ms)
Speed (ns): 10s-100s 100s
100s GsSize (bytes): Ks-Ms Ms
TertiaryStorage(Tape/Cloud
Storage)
10,000,000,000s (10s sec)
Ts
CA-Lec3 [email protected] 5
-
MulticoreArchitecture
CPU
Local memory hierarchy(optimal fixed size)
Processing Node
CPU
Local memory hierarchy(optimal fixed size)
Processing Node
CPU
Local memory hierarchy(optimal fixed size)
Processing Node
CPU
Local memory hierarchy(optimal fixed size)
Processing Node
CPU
Local memory hierarchy(optimal fixed size)
Processing Node
CPU
Local memory hierarchy(optimal fixed size)
Processing Node
Interconnection network
CA-Lec3 [email protected] 6
-
ThePrincipleofLocality ThePrincipleofLocality:
Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.
TwoDifferentTypesofLocality: TemporalLocality(LocalityinTime):Ifanitemisreferenced,it
willtendtobereferencedagainsoon(e.g.,loops,reuse) SpatialLocality(LocalityinSpace):Ifanitemisreferenced,items
whoseaddressesareclosebytendtobereferencedsoon(e.g.,straightline code,arrayaccess)
HWreliedonlocalityforspeed
CA-Lec3 [email protected] 7
-
MemoryHierarchyBasics
Whenawordisnotfoundinthecache,amissoccurs: Fetchwordfromlowerlevelinhierarchy,requiringahigherlatencyreference
Lowerlevelmaybeanothercacheorthemainmemory Alsofetchtheotherwordscontainedwithintheblock
Takesadvantageofspatiallocality Placeblockintocacheinanylocationwithinitsset,determinedbyaddress
blockaddressMODnumberofsets
CA-Lec3 [email protected]
Introduction
8
-
HitandMiss Hit:dataappearsinsomeblockintheupperlevel(e.g.:BlockX)
HitRate:thefractionofmemoryaccessfoundintheupperlevel HitTime:Timetoaccesstheupperlevelwhichconsistsof
RAMaccesstime+Timetodeterminehit/miss Miss:dataneedstoberetrievefromablockinthelowerlevel
(BlockY) MissRate=1 (HitRate) MissPenalty:Timetoreplaceablockintheupperlevel+
Timetodelivertheblocktheprocessor HitTime
-
CachePerformanceFormulas
CA-Lec3 [email protected] 10
missmisshitacc TfTT (Average memory access time) = (Hit time) + (Miss rate)(Miss penalty) The times Tacc, Thit, and T+miss can be all either:
Real time (e.g., nanoseconds) Or, number of clock cycles
In contexts where cycle time is known to be a constant
Important: T+miss means the extra (not total) time for a miss
in addition to Thit, which is incurred by all accesses
CPU CacheLower levelsof hierarchy
Hit time
Miss penalty
-
FourQuestionsforMemoryHierarchy
Consideranylevelinamemoryhierarchy. Rememberablockistheunitofdatatransfer.
Betweenthegivenlevel,andthelevelsbelowit
Theleveldesignisdescribedbyfourbehaviors: BlockPlacement:
Wherecouldanewblockbeplacedinthelevel? BlockIdentification:
Howisablockfoundifitisinthelevel? BlockReplacement:
Whichexistingblockshouldbereplacedifnecessary? WriteStrategy:
Howarewritestotheblockhandled?
CA-Lec3 [email protected] 11
-
Q1:Wherecanablockbeplacedintheupperlevel?
Block12placedin8blockcache: Fullyassociative,directmapped,2waysetassociative S.A.Mapping=BlockNumberModuloNumberSets
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
FullMapped DirectMapped(12mod8)=42WayAssoc(12mod4)=0
CA-Lec3 [email protected] 12
-
Q2:Howisablockfoundifitisintheupperlevel?
IndexUsedtoLookupCandidates Indexidentifiesthesetincache
Tagusedtoidentifyactualcopy Ifnocandidatesmatch,thendeclarecachemiss
Blockisminimumquantumofcaching Dataselectfieldusedtoselectdatawithinblock Manycachingapplicationsdonthavedataselectfield
Largerblocksizehasdistincthardwareadvantages: lesstagoverhead exploitfastbursttransfersfromDRAM/overwidebusses
Disadvantagesoflargerblocksize? Fewerblocksmoreconflicts.Canwastebandwidth
Blockoffset
Block AddressTag Index
Set Select
Data Select
CA-Lec3 [email protected] 13
-
:0x50
Valid Bit
:
Cache Tag
Byte 320123
:
Cache DataByte 0Byte 1Byte 31 :
Byte 33Byte 63 :Byte 992Byte 1023 : 31
Review:DirectMappedCache DirectMapped2N bytecache:
Theuppermost(32 N)bitsarealwaystheCacheTag ThelowestMbitsaretheByteSelect(BlockSize=2M)
Example:1KBDirectMappedCachewith32BBlocks Indexchoosespotentialblock Tagcheckedtoverifyblock Byteselectchoosesbytewithinblock
Ex: 0x50 Ex: 0x00Cache Index
0431Cache Tag Byte Select
9
Ex: 0x01
CA-Lec3 [email protected] 14
-
DirectMappedCacheArchitecture
CA-Lec3 [email protected] 15
Tags Block framesAddress
Decode & Row Select
?Compare Tags
Hit
Tag Frm# Off.
Data Word
Muxselect
-
Review:SetAssociativeCache Nwaysetassociative:NentriesperCacheIndex
Ndirectmappedcachesoperatesinparallel Example:Twowaysetassociativecache
CacheIndexselectsasetfromthecache Twotagsinthesetarecomparedtoinputinparallel Dataisselectedbasedonthetagresult
CA-Lec3 [email protected]
Cache Index0431
Cache Tag Byte Select8
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Mux 01Sel1 Sel0
OR
Hit
Compare Compare
Cache Block16
-
Review:FullyAssociativeCache FullyAssociative:Everyblockcanholdanyline
Addressdoesnotincludeacacheindex CompareCacheTagsofallCacheEntriesinParallel
Example:BlockSize=32Bblocks WeneedN27bitcomparators Stillhavebyteselecttochoosefromwithinblock
:
Cache DataByte 0Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Valid Bit
::
Cache Tag
04Cache Tag (27 bits long) Byte Select
31
=
==
=
=
Ex: 0x01
CA-Lec3 [email protected] 17
-
ConcludingRemarks
Directmappedcache=1waysetassociativecache
Fullyassociativecache:thereisonly1set
CA-Lec3 [email protected] 18
-
19
CacheSizeEquation
Simpleequationforthesizeofacache:(Cachesize)=(Blocksize) (Numberofsets)
(SetAssociativity) Canrelatetothesizeofvariousaddressfields:
(Blocksize)=2(#ofoffsetbits)
(Numberofsets)=2(#ofindexbits)
(#oftagbits)=(#ofmemoryaddressbits) (#ofindexbits) (#ofoffsetbits)
Memory address
CA-Lec3 [email protected]
-
Q3:Whichblockshouldbereplacedonamiss?
Easyfordirectmappedcache Onlyonechoice
Setassociativeorfullyassociative LRU(leastrecentlyused)
Appealing,buthardtoimplementforhighassociativity Random
Easy,buthowwelldoesitwork? Firstin,firstout(FIFO)
CA-Lec3 [email protected] 20
-
Q4:Whathappensonawrite?
Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level memory
Write data only to the cache
Update lower level when a block falls out of the cache
Debug Easy Hard
Do read misses produce writes? No Yes
Do repeated writes make it to
lower level?Yes No
Additional option -- let writes to an un-cached address allocate a new cache line (write-allocate).
CA-Lec3 [email protected] 21
-
WriteBuffers
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesnt stall
Q. Why a buffer, why not just one register ?
A. Bursts of writes arecommon.
Q. Are Read After Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or check write buffers for match on readsCA-Lec3 [email protected] 22
-
MoreonCachePerformanceMetrics
Cansplitaccesstimeintoinstructions&data:Avg.mem.acc.time=
(%instructionaccesses) (inst.mem.accesstime)+(%dataaccesses) (datamem.accesstime)
Anotherformulafromchapter1:CPUtime=(CPUexecutionclockcycles+Memorystallclockcycles)
cycletime UsefulforexploringISAchanges
Canbreakstallsintoreadsandwrites:Memorystallcycles=
(Reads readmissrate readmisspenalty)+(Writes writemissrate writemisspenalty)
CA-Lec3 [email protected] 23
-
Compulsory (coldstartorprocessmigration,firstreference):firstaccesstoablock Coldfactoflife:notawholelotyoucandoaboutit Note:Ifyouaregoingtorunbillionsofinstruction,Compulsory
Missesareinsignificant Capacity:
Cachecannotcontainallblocksaccessbytheprogram Solution:increasecachesize
Conflict (collision): Multiplememorylocationsmapped
tothesamecachelocation Solution1:increasecachesize Solution2:increaseassociativity
Coherence (Invalidation):otherprocess(e.g.,I/O)updatesmemory
SourcesofCacheMisses
CA-Lec3 [email protected] 24
-
MemoryHierarchyBasics Sixbasiccacheoptimizations:
Largerblocksize Reducescompulsorymisses Increasescapacityandconflictmisses,increasesmisspenalty
Largertotalcachecapacitytoreducemissrate Increaseshittime,increasespowerconsumption
Higherassociativity Reducesconflictmisses Increaseshittime,increasespowerconsumption
Highernumberofcachelevels Reducesoverallmemoryaccesstime
Givingprioritytoreadmissesoverwrites Reducesmisspenalty
Avoidingaddresstranslationincacheindexing Reduceshittime
CA-Lec3 [email protected]
Introduction
25
-
1.LargerBlockSizes
Largerblocksize no.ofblocks Obviousadvantages:reducecompulsorymisses
Reasonisduetospatiallocality
Obviousdisadvantage Highermisspenalty:largerblocktakeslongertomove Mayincreaseconflictmissesandcapacitymissifcacheissmall
Dont let increase in miss penalty outweigh the decrease in miss rate
CA-Lec3 [email protected] 26
-
2.LargeCaches
Cachesizemissrate;hittime Helpwithbothconflictandcapacitymisses
MayneedlongerhittimeAND/ORhigherHWcost
Popularinoffchipcaches
CA-Lec3 [email protected] 27
-
3.HigherAssociativity
Reduceconflictmiss 2:1Cacheruleofthumbonmissrate
2waysetassociativeofsizeN/2isaboutthesameasadirectmappedcacheofsizeN(heldforcachesize
-
4.MultiLevelCaches
2levelcachesexample AMATL1 =HittimeL1 +MissrateL1MisspenaltyL1 AMATL2 =HittimeL1 +MissrateL1 (HittimeL2 +MissrateL2 MisspenaltyL2)
Probablythebestmisspenaltyreductionmethod Definitions:
Localmissrate:missesinthiscachedividedbythetotalnumberofmemoryaccessestothiscache(MissrateL2)
Globalmissrate:missesinthiscachedividedbythetotalnumberofmemoryaccessesgeneratedbyCPU(MissrateL1xMissrateL2)
GlobalMissRateiswhatmatters
CA-Lec3 [email protected] 29
-
MultiLevelCaches(Cont.) Advantages:
CapacitymissesinL1endupwithasignificantpenaltyreduction ConflictmissesinL1similarlygetsuppliedbyL2
Holdingsizeof1stlevelcacheconstant: Decreasesmisspenaltyof1stlevelcache. Or,increasesaverageglobalhittimeabit:
hittimeL1+missrateL1xhittimeL2 butdecreasesglobalmissrate
Holdingtotalcachesizeconstant: Globalmissrate,misspenaltyaboutthesame Decreasesaverageglobalhittimesignificantly!
NewL1muchsmallerthanoldL1
CA-Lec3 [email protected] 30
-
MissRateExample Supposethatin1000memoryreferencesthereare40missesinthefirstlevel
cacheand20missesinthesecondlevelcache Missrateforthefirstlevelcache=40/1000(4%) Localmissrateforthesecondlevelcache=20/40(50%) Globalmissrateforthesecondlevelcache=20/1000(2%)
AssumemisspenaltyL2is200CC,hittimeL2is10CC,hittimeL1is1CC,and1.5memoryreferenceperinstruction.Whatisaveragememoryaccesstimeandaveragestallcyclesperinstructions?Ignorewritesimpact.
AMAT=HittimeL1+MissrateL1 (HittimeL2+MissrateL2MisspenaltyL2)=1+4% (10+50% 200)=5.4CC
Averagememorystallsperinstruction=MissesperinstructionL1 HittimeL2+MissesperinstructionsL2MisspenaltyL2=(401.5/1000) 10+(201.5/1000)200=6.6CC
Or(5.4 1.0) 1.5=6.6CC
CA-Lec3 [email protected] 31
-
5.GivingPrioritytoReadMissesOverWrites
Inwritethrough,writebufferscomplicatememoryaccessinthattheymightholdtheupdatedvalueoflocationneededonareadmiss RAW conflictswithmainmemoryreadsoncachemisses
Readmisswaitsuntilthewritebufferempty increasereadmisspenalty Checkwritebuffercontentsbeforeread,andifnoconflicts,letthe
memoryaccesscontinue WriteBack?
Readmissreplacingdirtyblock Normal:Writedirtyblocktomemory,andthendotheread Instead,copythedirtyblocktoawritebuffer,thendotheread,andthendo
thewrite CPUstalllesssincerestartsassoonasdoread
CA-Lec3 [email protected] 32
SW R3, 512(R0) ;cache index 0LW R1, 1024(R0) ;cache index 0LW R2, 512(R0) ;cache index 0
R2=R3 ?
read priority over write
-
6.AvoidingAddressTranslationduringIndexingoftheCache
Virtuallyaddressedcaches
33
Address Translation
PhysicalAddress Cache
Indexing
VirtualAddress
CPU
TLB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TLB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym (Alias) Problem
VATags
$ means cache
CPU
$ TLB
MEM
VAVA
TagsPAL2 $
Overlap $ access with VA translation: requires $
index to remain invariantacross translationCA-Lec3 [email protected]
-
WhynotVirtualCache?
Taskswitchcausesthesame VAtorefertodifferentPAs Hence,cachemustbeflushed
Hughtaskswitchoverhead Alsocreateshugecompulsorymissratesfornewprocess
SynonymsorAliasproblemcausesdifferentVAswhichmaptothesamePA Twocopiesofthesamedatainavirtualcache
AntialiasingHWmechanismisrequired(complicated) SWcanhelp
I/O(alwaysusesPA) RequiremappingtoVAtointeractwithavirtualcache
CA-Lec3 [email protected] 34
-
AdvancedCacheOptimizations
Reducinghittime1. Smallandsimplecaches2. Wayprediction
Increasingcachebandwidth3.Pipelinedcaches4. Multibanked caches5. Nonblocking caches
CA-Lec3 [email protected] 35
ReducingMissPenalty6.Criticalwordfirst7. Mergingwritebuffers
ReducingMissRate8.Compileroptimizations
Reducingmisspenaltyormissrate viaparallelism
9.Hardwareprefetching10. Compilerprefetching
-
1.SmallandSimpleL1Cache
Criticaltimingpathincache: addressingtagmemory,thencomparingtags,thenselectingcorrectset
Indextagmemoryandthencomparetakestime Directmappedcachescanoverlaptagcompareandtransmissionofdata Sincethereisonlyonechoice
Lowerassociativityreducespowerbecausefewercachelinesareaccessed
CA-Lec3 [email protected]
Advanced O
ptimizations
36
-
L1SizeandAssociativity
CA-Lec3 [email protected]
Access time vs. size and associativity
Advanced O
ptimizations
37
-
L1SizeandAssociativity
CA-Lec3 [email protected]
Energy per read vs. size and associativity
Advanced O
ptimizations
38
-
2.FastHittimesviaWayPrediction
HowtocombinefasthittimeofDirectMappedandhavethelowerconflictmissesof2waySAcache?
Wayprediction:keepextrabitsincachetopredicttheway, orblockwithintheset,ofnextcacheaccess.
Multiplexorissetearlytoselectdesiredblock,only1tagcomparisonperformedthatclockcycleinparallelwithreadingthecachedata
Miss 1st checkotherblocksformatchesinnextclockcycle
Accuracy 85% Drawback:CPUpipelineishardifhittakes1or2cycles
Usedforinstructioncachesvs.datacaches
CA-Lec3 [email protected] 39
Hit Time
Way-Miss Hit Time Miss Penalty
-
WayPrediction
Toimprovehittime,predictthewaytopresetmux Mispredictiongiveslongerhittime Predictionaccuracy
>90%fortwoway >80%forfourway IcachehasbetteraccuracythanDcache
FirstusedonMIPSR10000inmid90s UsedonARMCortexA8
Extendtopredictblockaswell Wayselection Increasesmispredictionpenalty
CA-Lec3 [email protected]
Advanced O
ptimizations
40
-
3.IncreasingCacheBandwidthbyPipelining
Pipelinecacheaccesstoimprovebandwidth Examples:
Pentium:1cycle PentiumPro PentiumIII:2cycles Pentium4 Corei7:4cycles
Makesiteasiertoincreaseassociativity
But,pipelinecacheincreasestheaccesslatency Moreclockcyclesbetweentheissueoftheloadandtheuseofthedata
AlsoIncreasesbranchmispredictionpenalty
CA-Lec3 [email protected]
Advanced O
ptimizations
41
-
4.IncreasingCacheBandwidth:NonBlockingCaches
Nonblockingcache orlockupfreecache allowdatacachetocontinuetosupplycachehitsduringamiss requiresF/Ebitsonregistersoroutoforderexecution requiresmultibankmemories
hitundermiss reducestheeffectivemisspenaltybyworkingduringmissvs.ignoringCPUrequests
hitundermultiplemiss ormissundermiss mayfurtherlowertheeffectivemisspenaltybyoverlappingmultiplemisses Significantlyincreasesthecomplexityofthecachecontrollerasthere
canbemultipleoutstandingmemoryaccesses Requiresmuliplememorybanks(otherwisecannotsupport) PeniumProallows4outstandingmemorymisses
CA-Lec3 [email protected] 42
-
Nonblocking CachePerformances
L2mustsupportthis Ingeneral,processorscanhideL1misspenaltybutnotL2miss
penalty CA-Lec3 [email protected]
Advanced O
ptimizations
43
-
6:IncreasingCacheBandwidthviaMultipleBanks
Ratherthantreatthecacheasasinglemonolithicblock,divideintoindependentbanksthatcansupportsimultaneousaccesses E.g.,T1(Niagara)L2has4banks
Bankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanksmappingofaddressestobanksaffectsbehaviorofmemorysystem
Simplemappingthatworkswellissequentialinterleaving Spreadblockaddressessequentiallyacrossbanks E,g,ifthere4banks,Bank0hasallblockswhoseaddressmodulo4is0;
bank1hasallblockswhoseaddressmodulo4is1;
CA-Lec3 [email protected] 44
-
5.IncreasingCacheBandwidthviaMultibanked Caches
Organizecacheasindependentbankstosupportsimultaneousaccess(ratherthanasinglemonolithicblock) ARMCortexA8supports14banksforL2 Inteli7supports4banksforL1and8banksforL2
Bankingworksbestwhenaccessesnaturallyspreadthemselvesacrossbanks Interleavebanksaccordingtoblockaddress
CA-Lec3 [email protected]
Advanced O
ptimizations
45
-
6.ReduceMissPenalty:CriticalWordFirstandEarlyRestart
Processorusuallyneedsonewordoftheblockatatime Donotwaitforfullblocktobeloadedbeforerestartingprocessor
CriticalWordFirst requestthemissedwordfirstfrommemoryandsendittotheprocessorassoonasitarrives;lettheprocessorcontinueexecutionwhilefillingtherestofthewordsintheblock.Alsocalledwrappedfetch andrequestedwordfirst
Earlyrestart assoonastherequestedwordoftheblockarrives,sendittotheprocessorandlettheprocessorcontinueexecution
Benefitsofcriticalwordfirstandearlyrestartdependon Blocksize:generallyusefulonlyinlargeblocks Likelihood ofanotheraccesstotheportionoftheblockthathasnotyetbeen
fetched Spatiallocalityproblem:tendtowantnextsequentialword,sonotclearifbenefit
CA-Lec3 [email protected] 46
-
7.MergingWriteBuffertoReduceMissPenalty
Writebuffertoallowprocessortocontinuewhilewaitingtowritetomemory
Ifbuffercontainsmodifiedblocks,theaddressescanbecheckedtoseeifaddressofnewdatamatchestheaddressofavalidwritebufferentry.Ifso,newdataarecombinedwiththatentry
Increasesblocksizeofwriteforwritethroughcacheofwritestosequentialwords,bytessincemultiwordwritesmoreefficienttomemory
CA-Lec3 [email protected] 47
-
MergingWriteBuffer Whenstoringtoablockthatisalreadypendinginthewrite
buffer,updatewritebuffer Reducesstallsduetofullwritebuffer
CA-Lec3 [email protected]
Advanced O
ptimizations
Nowritebuffering
Writebuffering
48
-
8.ReducingMissesbyCompilerOptimizations
McFarling [1989]reducedcachesmissesby75%on8KBdirectmappedcache,4byteblocksinsoftware
Instructions Reorderproceduresinmemorysoastoreduceconflictmisses Profilingtolookatconflicts(usingtoolstheydeveloped)
Data LoopInterchange:swapnestedloopstoaccessdatainorderstoredinmemory
(insequentialorder) LoopFusion:Combine2independentloopsthathavesameloopingandsome
variablesoverlap Blocking:Improvetemporallocalitybyaccessingblocks ofdatarepeatedlyvs.
goingdownwholecolumnsorrows Insteadofaccessingentirerowsorcolumns,subdividematricesintoblocks Requiresmorememoryaccessesbutimproveslocalityofaccesses
CA-Lec3 [email protected] 49
-
LoopInterchangeExample/* Before */for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequentialaccessesinsteadofstridingthroughmemoryevery100words;improvedspatiallocality
CA-Lec3 [email protected] 50
-
LoopFusionExample/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality
CA-Lec3 [email protected] 51
Perform different computations on the common data in two loops fuse the two loops
-
BlockingExample/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){r = 0;for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};x[i][j] = r;
}; TwoInnerLoops:
ReadallNxNelementsofz[] ReadNelementsof1rowofy[]repeatedly WriteNelementsof1rowofx[]
CapacityMissesafunctionofN&CacheSize: 2N3+N2 =>(assumingnoconflict;otherwise)
Idea:computeonBxBsubmatrixthatfits
52CA-Lec3 [email protected]
-
Snapshotofx,y,zwhenN=6,i=1
CA-Lec3 [email protected] 53
White: not yet touchedLight: older accessDark: newer access Before.
-
BlockingExample/* After */for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1){r = 0;for (k = kk; k < min(kk+B-1,N); k = k+1) {r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;};
BcalledBlockingFactor CapacityMissesfrom2N3 +N2 to2N3/B+N2 ConflictMissesToo?
CA-Lec3 [email protected] 54
-
TheAgeofAccessestox,y,zwhenB=3
CA-Lec3 [email protected] 55
Note in contrast to previous Figure, the smaller number of elements accessed
-
Prefetchingreliesonhavingextramemorybandwidth thatcanbeusedwithoutpenalty
InstructionPrefetching Typically,CPUfetches2blocksonamiss:therequestedblockandthenextconsecutive
block. Requestedblockisplacedininstructioncachewhenitreturns,andprefetched blockis
placedintoinstructionstreambuffer DataPrefetching
Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4KBpages
Prefetchinginvokedif2successiveL2cachemissestoapage,ifdistancebetweenthosecacheblocksis
-
10.ReducingMissPenaltyorMissRatebyCompilerControlled PrefetchingData
Prefetch instructionisinsertedbeforedataisneeded
DataPrefetch Registerprefetch:loaddataintoregister(HPPARISCloads) CachePrefetch:loadintocache(MIPSIV,PowerPC,SPARCv.9) Specialprefetchinginstructionscannotcausefaults;
aformofspeculativeexecution
IssuingPrefetch Instructionstakestime Iscostofprefetch issues
-
Summary
CA-Lec3 [email protected]
Advanced O
ptimizations
58
-
MemoryTechnology
Performancemetrics Latencyisconcernofcache BandwidthisconcernofmultiprocessorsandI/O Accesstime
Timebetweenreadrequestandwhendesiredwordarrives Cycletime
Minimumtimebetweenunrelatedrequeststomemory
DRAMusedformainmemory,SRAMusedforcache
CA-Lec3 [email protected]
Mem
ory Technology
59
-
MemoryTechnology SRAM:staticrandomaccessmemory
Requireslowpowertoretainbit,sincenorefresh But,requires6transistors/bit(vs.1transistor/bit)
DRAM Onetransistor/bit Mustberewrittenafterbeingread Mustalsobeperiodicallyrefreshed
Every~8ms Eachrowcanberefreshedsimultaneously
Addresslinesaremultiplexed: Upperhalfofaddress:rowaccessstrobe(RAS) Lowerhalfofaddress:columnaccessstrobe(CAS)
CA-Lec3 [email protected]
Mem
ory Technology
60
-
DRAMTechnology
Emphasizeoncostperbitandcapacity Multiplexaddresslines cutting#ofaddresspinsinhalf
Rowaccessstrobe(RAS)first,thencolumnaccessstrobe(CAS) Memoryasa2Dmatrix rowsgotoabuffer SubsequentCASselectssubrow
Useonlyasingletransistortostoreabit Readingthatbitcandestroytheinformation Refresheachbitperiodically(ex.8milliseconds)bywritingback
Keeprefreshingtimelessthan5%ofthetotaltime
DRAMcapacityis4to8timesthatofSRAM
CA-Lec3 [email protected] 61
-
DRAMLogicalOrganization(4Mbit)
CA-Lec3 [email protected] 62
SquarerootofbitsperRAS/CAS
Column Decoder
Sense Amps & I/O
Memory Array
(2,048 x 2,048)A0A10
11 D
Q
Word LineStorage Cell
-
DRAMTechnology(cont.)
DIMM:Dualinlinememorymodule DRAMchipsarecommonlysoldonsmallboardscalledDIMMs DIMMstypicallycontain4to16DRAMs
SlowingdowninDRAMcapacitygrowth Fourtimesthecapacityeverythreeyears,formorethan20years Newchipsonlydoublecapacityeverytwoyear,since1998
DRAMperformanceisgrowingataslowerrate RAS(relatedtolatency):5%peryear CAS(relatedtobandwidth):10%+peryear
CA-Lec3 [email protected] 63
-
RASImprovement
CA-Lec3 [email protected]
Mem
ory Technology
64
-
QuestforDRAMPerformance1. FastPagemode
Addtimingsignalsthatallowrepeatedaccessestorowbufferwithoutanotherrowaccesstime
Suchabuffercomesnaturally,aseacharraywillbuffer1024to2048bitsforeachaccess
2. SynchronousDRAM(SDRAM) AddaclocksignaltoDRAMinterface,sothattherepeatedtransferswould
notbearoverheadtosynchronizewithDRAMcontroller3. DoubleDataRate(DDRSDRAM)
TransferdataonboththerisingedgeandfallingedgeoftheDRAMclocksignal doublingthepeakdatarate
DDR2lowerspowerbydroppingthevoltagefrom2.5to1.8volts+offershigherclockrates:upto400MHz
DDR3dropsto1.5volts+higherclockrates:upto800MHz DDR4dropsto1.2volts,clockrateupto1600MHz
ImprovedBandwidth,notLatency
CA-Lec3 [email protected] 65
-
66
DRAMnamebasedonPeakChipTransfers/SecDIMMnamebasedonPeakDIMMMBytes /Sec
Stan-dard
Clock Rate (MHz)
M transfers / second DRAM Name
Mbytes/s/ DIMM
DIMM Name
DDR 133 266 DDR266 2128 PC2100
DDR 150 300 DDR300 2400 PC2400
DDR 200 400 DDR400 3200 PC3200
DDR2 266 533 DDR2-533 4264 PC4300
DDR2 333 667 DDR2-667 5336 PC5300
DDR2 400 800 DDR2-800 6400 PC6400
DDR3 533 1066 DDR3-1066 8528 PC8500
DDR3 666 1333 DDR3-1333 10664 PC10700
DDR3 800 1600 DDR3-1600 12800 PC12800
x 2 x 8
F
a
s
t
e
s
t
f
o
r
s
a
l
e
4
/
0
6
(
$
1
2
5
/
G
B
)
CA-Lec3 [email protected]
-
DRAMPerformance
CA-Lec3 [email protected]
Mem
ory Technology
67
-
GraphicsMemory
GDDR5isgraphicsmemorybasedonDDR3 Graphicsmemory:
Achieve25XbandwidthperDRAMvs.DDR3 Widerinterfaces(32vs.16bit) Higherclockrate
PossiblebecausetheyareattachedviasolderinginsteadofsocketedDIMMmodules
CA-Lec3 [email protected] 68
-
MemoryPowerConsumption
CA-Lec3 [email protected]
Mem
ory Technology
69
-
SRAMTechnology
CacheusesSRAM:StaticRandomAccessMemory SRAMusessixtransistorsperbittopreventtheinformation
frombeingdisturbedwhenread noneedtorefresh
SRAMneedsonlyminimalpowertoretainthechargeinthestandbymode goodforembeddedapplications
NodifferencebetweenaccesstimeandcycletimeforSRAM
Emphasizeonspeedandcapacity SRAMaddresslinesarenotmultiplexed
SRAMspeedis8to16xthatofDRAM
CA-Lec3 [email protected] 70
-
ROMandFlash Embeddedprocessormemory Readonlymemory(ROM)
Programmedatthetimeofmanufacture Onlyasingletransistorperbittorepresent1or0 Usedfortheembeddedprogramandforconstant Nonvolatileandindestructible
Flashmemory: Mustbeerased(inblocks)beforebeingoverwritten Nonvolatilebutallowthememorytobemodified ReadsatalmostDRAMspeeds,butwrites10to100timesslower DRAMcapacityperchipandMBperdollarisabout4to8timesgreater
thanflash CheaperthanSDRAM,moreexpensivethandisk SlowerthanSRAM,fasterthandisk
CA-Lec3 [email protected] 71
-
MemoryDependability
Memoryissusceptibletocosmicrays Softerrors:dynamicerrors
Detectedandfixedbyerrorcorrectingcodes(ECC) Harderrors:permanenterrors
Usesparserowstoreplacedefectiverows
Chipkill:aRAIDlikeerrorrecoverytechnique
CA-Lec3 [email protected]
Mem
ory Technology
72
-
VirtualMemory? Thelimitsofphysicaladdressing
Allprogramsshareonephysicaladdressspace Machinelanguageprogramsmustbeawareofthemachine
organization Nowaytopreventaprogramfromaccessinganymachineresource
Recall:manyprocessesuseonlyasmallportionofaddressspace Virtualmemorydividesphysicalmemoryintoblocks(calledpageor
segment)andallocatesthemtodifferentprocesses Withvirtualmemory,theprocessorproducesvirtualaddressthat
aretranslatedbyacombinationofHWandSWtophysicaladdresses(calledmemorymappingoraddresstranslation).
CA-Lec3 [email protected] 73
-
VirtualMemory:AddaLayerofIndirection
CPU Memory
A0-A31 A0-A31
D0-D31 D0-D31
Data
User programs run in an standardizedvirtual address space
Address Translation hardware managed by the operating system (OS)
maps virtual address to physical memory
Physical Addresses
AddressTranslation
Virtual Physical
Virtual Addresses
Hardware supports modern OS features:Protection, Translation, SharingCA-Lec3 [email protected] 74
-
VirtualMemory
CA-Lec3 [email protected] 75
Mapping by apage table
-
VirtualMemory(cont.) Permitsapplicationstogrowbiggerthanmainmemorysize Helpswithmultipleprocessmanagement
Eachprocessgetsitsownchunkofmemory Permitsprotection of1processchunksfromanother Mappingofmultiplechunksontosharedphysicalmemory Mappingalsofacilitatesrelocation(aprogramcanruninanymemorylocation,
andcanbemovedduringexecution) ApplicationandCPUruninvirtualspace(logicalmemory,0 max) Mappingontophysicalspaceisinvisibletotheapplication
Cachevs.virtualmemory Blockbecomesapage orsegment Missbecomesapageoraddressfault
CA-Lec3 [email protected] 76
-
3AdvantagesofVM Translation:
Programcanbegivenconsistentviewofmemory,eventhoughphysicalmemoryisscrambled
Makesmultithreadingreasonable(nowusedalot!) Onlythemostimportantpartofprogram(WorkingSet)mustbeinphysical
memory. Contiguousstructures(likestacks)useonlyasmuchphysicalmemoryasnecessary
yetstillgrowlater. Protection:
Differentthreads(orprocesses)protectedfromeachother. Differentpagescanbegivenspecialbehavior
(ReadOnly,Invisibletouserprograms,etc). KerneldataprotectedfromUserprograms Veryimportantforprotectionfrommaliciousprograms
Sharing: Canmapsamephysicalpagetomultipleusers
(Sharedmemory)
CA-Lec3 [email protected] 77
-
VirtualMemory
Protectionviavirtualmemory Keepsprocessesintheirownmemoryspace
Roleofarchitecture: Provideusermodeandsupervisormode ProtectcertainaspectsofCPUstate Providemechanismsforswitchingbetweenusermodeandsupervisormode
Providemechanismstolimitmemoryaccesses ProvideTLBtotranslateaddresses
CA-Lec3 [email protected]
Virtual Mem
ory and Virtual Machines
78
-
Page Tables Encode Virtual Address Spaces
A machine usually supports
pages of a few sizes
(MIPS R4000):
PhysicalMemory Space
A valid page table entry codes physical memory frame address for the page
A virtual address spaceis divided into blocks
of memory called pagesframe
frame
frame
frame
A page table is indexed by a virtual address
virtual address
Page Table
OS manages the page table for each ASID
CA-Lec3 [email protected] 79
-
PhysicalMemory Space
Page table maps virtual page numbers to physical frames (PTE = Page Table Entry)
Virtual memory => treat memory cache for disk
Details of Page TableVirtual Address
Page Table
indexintopagetable
Page TableBase Reg
V AccessRights PA
V page no. offset12
table locatedin physicalmemory
P page no. offset12
Physical Address
frame
frame
frame
frame
virtual address
Page Table
CA-Lec3 [email protected] 80
-
PageTableEntry(PTE)? WhatisinaPageTableEntry(orPTE)?
Pointertonextlevelpagetableortoactualpage Permissionbits:valid,readonly,readwrite,writeonly
Example:Intelx86architecturePTE: Addresssameformatpreviousslide(10,10,12bitoffset) IntermediatepagetablescalledDirectories
P: Present(sameasvalidbitinotherarchitectures)W: WriteableU: Useraccessible
PWT: Pagewritetransparent:externalcachewritethroughPCD: Pagecachedisabled(pagecannotbecached)
A: Accessed:pagehasbeenaccessedrecentlyD: Dirty(PTEonly):pagehasbeenmodifiedrecentlyL: L=14MBpage(directoryonly).
Bottom22bitsofvirtualaddressserveasoffset
Page Frame Number(Physical Page Number)
Free(OS) 0 L D A
PCDPW
T U W P
01234567811-931-12
CA-Lec3 [email protected] 81
-
Cachevs.VirtualMemory Replacement
Cachemisshandledbyhardware PagefaultusuallyhandledbyOS
Addresses VirtualmemoryspaceisdeterminedbytheaddresssizeoftheCPU CachespaceisindependentoftheCPUaddresssize
Lowerlevelmemory Forcaches themainmemoryisnotsharedbysomethingelse Forvirtualmemory mostofthediskcontainsthefilesystem
Filesystemaddresseddifferently usuallyinI/Ospace VirtualmemorylowerlevelisusuallycalledSWAP space
CA-Lec3 [email protected] 82
-
Thesame4questionsforVirtualMemory
BlockPlacement Choice:lowermissratesandcomplexplacementorviceversa
Misspenaltyishuge,sochooselowmissrate placeanywhere Similartofullyassociativecachemodel
BlockIdentification bothuseadditionaldatastructure Fixedsizepages useapagetable Variablesizedsegments segmenttable
BlockReplacement LRUisthebest HowevertrueLRUisabitcomplex souseapproximation
Pagetablecontainsausetag,andonaccesstheusetagisset OSchecksthemeverysooften recordswhatitseesinadatastructure thenclears
themall OnamisstheOSdecideswhohasbeenusedtheleastandreplacethatone
WriteStrategy alwayswriteback Duetotheaccesstimetothedisk,writethroughissilly Useadirtybittoonlywritebackpagesthathavebeenmodified
CA-Lec3 [email protected] 83
-
TechniquesforFastAddressTranslation
Pagetableiskeptinmainmemory(kernelmemory) Eachprocesshasapagetable
Everydata/instructionaccessrequirestwomemoryaccesses Oneforthepagetableandoneforthedata/instruction Canbesolvedbytheuseofaspecialfastlookuphardwarecache
calledassociativeregistersortranslationlookasidebuffers(TLBs)
Iflocality appliesthencachetherecenttranslation TLB=translationlookasidebuffer TLBentry:virtualpageno,physicalpageno,protectionbit,usebit,
dirtybit
CA-Lec3 [email protected] 84
-
TranslationLookAsideBuffers(TLB) Cacheontranslations FullyAssociative,SetAssociative,orDirectMapped
TLBsare: Small typicallynotmorethan128 256entries FullyAssociative
TranslationLookAsideBuffers
CPU TLB Cache MainMemory
VA PA miss
hit
data
Trans-lation
hit
missTranslationwithaTLB
-
V=0 pages either reside on disk or
have not yet been allocated.
OS handles V=0Page fault
Physical and virtual pages must be the
same size!
The TLB Caches Page Table Entries
TLB
Page Table
2
0
1
3
virtual address
page off
2frame page
250
physical address
page off
TLB caches page table
entries.for ASID
Physicalframe
address
-
CachingAppliedtoAddressTranslation
Data Read or Write(untranslated)
CPU PhysicalMemory
TLB
Translate(MMU)
No
VirtualAddress
PhysicalAddress
YesCached?
-
VirtualMachines Supportsisolationandsecurity Sharingacomputeramongmanyunrelatedusers Enabledbyrawspeedofprocessors,makingtheoverhead
moreacceptable
AllowsdifferentISAsandoperatingsystemstobepresentedtouserprograms SystemVirtualMachines SVMsoftwareiscalledvirtualmachinemonitororhypervisor Individualvirtualmachinesrununderthemonitorarecalledguest
VMs
CA-Lec3 [email protected]
Virtual Mem
ory and Virtual Machines
88
-
ImpactofVMsonVirtualMemory
EachguestOSmaintainsitsownsetofpagetables VMMaddsalevelofmemorybetweenphysicalandvirtualmemorycalledrealmemory
VMMmaintainsshadowpagetablethatmapsguestvirtualaddressestophysicaladdresses
RequiresVMMtodetectguestschangestoitsownpagetable Occursnaturallyifaccessingthepagetablepointerisaprivilegedoperation
CA-Lec3 [email protected]
Virtual Mem
ory and Virtual Machines
89