Optimizing NUCA Organizations and Wiring Alternatives for Large

Large Caches Ø Cache hierarchies will dominate chip Intel Montecito area Ø 3 D

Wire Delay/Power q Wire delays are costly for performance and power - Latencies of

Contribution o Support for various interconnect models n Improved design space exploration o Support

Cache Design Basics Bitlines Data array Wordline Decoder Tag array Input address Column muxes

Existing Model - CACTI Wordline & bitline delay Decoder delay Cache model with 4

Power/Delay Overhead of Wires o H-tree delay increases with cache size o H-tree power

Motivation o The dominant role of interconnect is clear o Lack of tool to

CACTI 6. 0 Enhancements o Incorporation of n Different wire models n Different router

Full-swing Wires Z Y X University of Utah 10

Full-swing Wires II 10% Delay penalty Three different design points 20% Delay penalty 30%

Full-Swing Wires o Fast and simple n Delay proportional to sqrt(RC) as against RC

Low-swing wires 50 m. V raise 400 m. V 50 m. V Differential wires

Differential Low-swing + Very low-power, can be routed over other modules - Relatively slow,

Delay Characteristics Quadratic increase in delay University of Utah 15

Energy Characteristics University of Utah 16

Search Space of CACTI-5 o Design space with global wires optimized for delay University

Search Space of CACTI-6 Low-swing 30% Delay Penalty Least Delay Design space with global

CACTI – Another Limitation o Access delay is equal to the delay of slowest

Non-Uniform Cache Access (NUCA)* o Large cache is broken into a number of small

Extension to CACTI o On-chip network n Wire model based on ITRS 2005 parameters

Trade-off Analysis (32 MB Cache) 16 Core CMP 22

Power Centric Design (32 MB Cache) University of Utah 24

Validation o HSPICE tool o Predictive Technology Model (65 nm tech. ) o Analytical

Case Study: Heterogeneous D-NUCA o Dynamic-NUCA n Reduces access time by dynamic data movement

Access Frequency o % request satisfied by x KB of cache 27

Few Heterogeneous Organizations Considered by CACTI Model 1 Model 2 University of Utah 28

Other Applications o Exposing wire properties n Novel cache pipelining o Early lookup, Aggressive

Conclusion o Network parameters and contention play a critical role in deciding NUCA organization

Slides: 30

Download presentation

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6. 0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1

Large Caches Ø Cache hierarchies will dominate chip Intel Montecito area Ø 3 D stacked processors with an entire die for on-chip cache could be common Ø Cache Montecito has two private 12 MB L 3 caches (27 MB including L 2) Ø Long global wires are required to transmit data/address University of Utah 2 Cache

Wire Delay/Power q Wire delays are costly for performance and power - Latencies of 60 cycles to reach ends of a chip at 32 nm (@ 5 GHz) - 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) q CACTI* access time for 24 MB cache is 90 cycles @ 5 GHz, 65 nm Tech *version 4 University of Utah 3

Contribution o Support for various interconnect models n Improved design space exploration o Support for modeling Non-Uniform Cache Access (NUCA) University of Utah 4

Cache Design Basics Bitlines Data array Wordline Decoder Tag array Input address Column muxes Sense Amps Comparators Mux drivers Output driver Data output Valid output? University of Utah 5

Existing Model - CACTI Wordline & bitline delay Decoder delay Cache model with 4 sub-arrays Cache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay University of Utah 6

Power/Delay Overhead of Wires o H-tree delay increases with cache size o H-tree power continues to dominate o Bitlines are other major contributors to total power 7

Motivation o The dominant role of interconnect is clear o Lack of tool to model interconnect in detail can impede progress o Current solutions have limited wire options n Orion, CACTI - Weak wire model - No support for modeling Multi-megabyte caches University of Utah 8

CACTI 6. 0 Enhancements o Incorporation of n Different wire models n Different router models n Grid topology for NUCA n Shared bus for UCA n Contention values for various cache configurations o Methodology to compute optimal NUCA organization o Improved interface that enables trade-off analysis o Validation analysis University of Utah 9

Full-swing Wires Z Y X University of Utah 10

Full-swing Wires II 10% Delay penalty Three different design points 20% Delay penalty 30% Delay penalty Repeater size o Caveat: Repeater sizing and spacing cannot be controlled precisely all the time University of Utah 11

Full-Swing Wires o Fast and simple n Delay proportional to sqrt(RC) as against RC o High bandwidth n Can be pipelined - Requires silicon area - High energy - Quadratic dependence on voltage 12

Low-swing wires 50 m. V raise 400 m. V 50 m. V Differential wires University of Utah drop 13

Differential Low-swing + Very low-power, can be routed over other modules - Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver q Bitlines are a form of low-swing wire ØOptimized for speed and area as against power ØDriver and pre-charger employ full Vdd voltage University of Utah 14

Delay Characteristics Quadratic increase in delay University of Utah 15

Energy Characteristics University of Utah 16

Search Space of CACTI-5 o Design space with global wires optimized for delay University of Utah 17

Search Space of CACTI-6 Low-swing 30% Delay Penalty Least Delay Design space with global and low-swing wires University of Utah 18

CACTI – Another Limitation o Access delay is equal to the delay of slowest subarray § Very high hit time for large caches Potential solution – NUCA Extend CACTI to model NUCA o Employs a separate bus for each cache bank for multi-banked caches § Not scalable Exploit different wire types and network design choices to improve the search space University of Utah 19

Non-Uniform Cache Access (NUCA)* o Large cache is broken into a number of small banks CPU & L 1 o Employs on-chip network for communication o Access delay a (distance between bank and cache controller) Cache banks *(Kim et al. ASPLOS 02) University of Utah 20

Extension to CACTI o On-chip network n Wire model based on ITRS 2005 parameters n Grid network n 3 -stage speculative router pipeline o Network latency vs Bank access latency tradeoff n Iterate over different bank sizes n Calculate the average network delay based on the number of banks and bank sizes n Consider contention values for different cache configurations o Similarly we also consider power consumed for each organization University of Utah 21

Trade-off Analysis (32 MB Cache) 16 Core CMP 22

Effect of Core Count 23

Power Centric Design (32 MB Cache) University of Utah 24

Validation o HSPICE tool o Predictive Technology Model (65 nm tech. ) o Analytical model that employs PTM parameters compared against HSPICE o Distributed wordlines, bitlines, low-swing transmitters, wires, receivers n Verified to be within 12% University of Utah 25

Case Study: Heterogeneous D-NUCA o Dynamic-NUCA n Reduces access time by dynamic data movement n Near-by banks are accessed more frequently o Heterogeneous Banks o Near-by banks are made smaller and hence faster o Access to nearby banks consume less power o Other banks can be made larger and more power efficient 26

Access Frequency o % request satisfied by x KB of cache 27

Few Heterogeneous Organizations Considered by CACTI Model 1 Model 2 University of Utah 28

Other Applications o Exposing wire properties n Novel cache pipelining o Early lookup, Aggressive lookup (ISCA 07) n Flit-reservation flow control (Peh et al. , HPCA 00) n Novel topologies o Hybrid network (ISCA 07) 29

Conclusion o Network parameters and contention play a critical role in deciding NUCA organization o Wire choices have significant impact on cache properties o CACTI 6. 0 can identify models that reduce power by a factor of three for a delay penalty of 25% http: //www. hpl. hp. com/personal/Norman_Jouppi/cacti 6. html http: //www. cs. utah. edu/~rajeev/cacti 6/ 30