Optimizing NUCA Organizations and Wiring Alternatives for Large
- Slides: 30
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6. 0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1
Large Caches Ø Cache hierarchies will dominate chip Intel Montecito area Ø 3 D stacked processors with an entire die for on-chip cache could be common Ø Cache Montecito has two private 12 MB L 3 caches (27 MB including L 2) Ø Long global wires are required to transmit data/address University of Utah 2 Cache
Wire Delay/Power q Wire delays are costly for performance and power - Latencies of 60 cycles to reach ends of a chip at 32 nm (@ 5 GHz) - 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) q CACTI* access time for 24 MB cache is 90 cycles @ 5 GHz, 65 nm Tech *version 4 University of Utah 3
Contribution o Support for various interconnect models n Improved design space exploration o Support for modeling Non-Uniform Cache Access (NUCA) University of Utah 4
Cache Design Basics Bitlines Data array Wordline Decoder Tag array Input address Column muxes Sense Amps Comparators Mux drivers Output driver Data output Valid output? University of Utah 5
Existing Model - CACTI Wordline & bitline delay Decoder delay Cache model with 4 sub-arrays Cache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay University of Utah 6
Power/Delay Overhead of Wires o H-tree delay increases with cache size o H-tree power continues to dominate o Bitlines are other major contributors to total power 7
Motivation o The dominant role of interconnect is clear o Lack of tool to model interconnect in detail can impede progress o Current solutions have limited wire options n Orion, CACTI - Weak wire model - No support for modeling Multi-megabyte caches University of Utah 8
CACTI 6. 0 Enhancements o Incorporation of n Different wire models n Different router models n Grid topology for NUCA n Shared bus for UCA n Contention values for various cache configurations o Methodology to compute optimal NUCA organization o Improved interface that enables trade-off analysis o Validation analysis University of Utah 9
Full-swing Wires Z Y X University of Utah 10
Full-swing Wires II 10% Delay penalty Three different design points 20% Delay penalty 30% Delay penalty Repeater size o Caveat: Repeater sizing and spacing cannot be controlled precisely all the time University of Utah 11
Full-Swing Wires o Fast and simple n Delay proportional to sqrt(RC) as against RC o High bandwidth n Can be pipelined - Requires silicon area - High energy - Quadratic dependence on voltage 12
Low-swing wires 50 m. V raise 400 m. V 50 m. V Differential wires University of Utah drop 13
Differential Low-swing + Very low-power, can be routed over other modules - Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver q Bitlines are a form of low-swing wire ØOptimized for speed and area as against power ØDriver and pre-charger employ full Vdd voltage University of Utah 14
Delay Characteristics Quadratic increase in delay University of Utah 15
Energy Characteristics University of Utah 16
Search Space of CACTI-5 o Design space with global wires optimized for delay University of Utah 17
Search Space of CACTI-6 Low-swing 30% Delay Penalty Least Delay Design space with global and low-swing wires University of Utah 18
CACTI – Another Limitation o Access delay is equal to the delay of slowest subarray § Very high hit time for large caches Potential solution – NUCA Extend CACTI to model NUCA o Employs a separate bus for each cache bank for multi-banked caches § Not scalable Exploit different wire types and network design choices to improve the search space University of Utah 19
Non-Uniform Cache Access (NUCA)* o Large cache is broken into a number of small banks CPU & L 1 o Employs on-chip network for communication o Access delay a (distance between bank and cache controller) Cache banks *(Kim et al. ASPLOS 02) University of Utah 20
Extension to CACTI o On-chip network n Wire model based on ITRS 2005 parameters n Grid network n 3 -stage speculative router pipeline o Network latency vs Bank access latency tradeoff n Iterate over different bank sizes n Calculate the average network delay based on the number of banks and bank sizes n Consider contention values for different cache configurations o Similarly we also consider power consumed for each organization University of Utah 21
Trade-off Analysis (32 MB Cache) 16 Core CMP 22
Effect of Core Count 23
Power Centric Design (32 MB Cache) University of Utah 24
Validation o HSPICE tool o Predictive Technology Model (65 nm tech. ) o Analytical model that employs PTM parameters compared against HSPICE o Distributed wordlines, bitlines, low-swing transmitters, wires, receivers n Verified to be within 12% University of Utah 25
Case Study: Heterogeneous D-NUCA o Dynamic-NUCA n Reduces access time by dynamic data movement n Near-by banks are accessed more frequently o Heterogeneous Banks o Near-by banks are made smaller and hence faster o Access to nearby banks consume less power o Other banks can be made larger and more power efficient 26
Access Frequency o % request satisfied by x KB of cache 27
Few Heterogeneous Organizations Considered by CACTI Model 1 Model 2 University of Utah 28
Other Applications o Exposing wire properties n Novel cache pipelining o Early lookup, Aggressive lookup (ISCA 07) n Flit-reservation flow control (Peh et al. , HPCA 00) n Novel topologies o Hybrid network (ISCA 07) 29
Conclusion o Network parameters and contention play a critical role in deciding NUCA organization o Wire choices have significant impact on cache properties o CACTI 6. 0 can identify models that reduce power by a factor of three for a delay penalty of 25% http: //www. hpl. hp. com/personal/Norman_Jouppi/cacti 6. html http: //www. cs. utah. edu/~rajeev/cacti 6/ 30
- Chapter 21 wiring diagrams and wiring repairs
- Quiero yo servir a mi señor
- Cuyaj dioslla punllanta letra
- Nuca cea laudaroasa
- Desferiu uma chapada na nuca do filho
- Texto o menino que escrevia versos
- Esternotirohioideo
- How is economizing different from optimizing?
- Syncthreads
- Reduction cuda
- The fortran optimizing compiler
- Optimizing patient flow
- Fspos
- Typiska drag för en novell
- Nationell inriktning för artificiell intelligens
- Ekologiskt fotavtryck
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Särskild löneskatt för pensionskostnader
- Tidböcker
- A gastrica
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Boverket ka
- Hur skriver man en tes
- Delegerande ledarstil
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Formel för lufttryck
- Publik sektor
- Kyssande vind analys