Optimizing NUCA Organizations and Wiring Alternatives for Large

  • Slides: 30
Download presentation
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6. 0 Naveen

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6. 0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1

Large Caches Ø Cache hierarchies will dominate chip Intel Montecito area Ø 3 D

Large Caches Ø Cache hierarchies will dominate chip Intel Montecito area Ø 3 D stacked processors with an entire die for on-chip cache could be common Ø Cache Montecito has two private 12 MB L 3 caches (27 MB including L 2) Ø Long global wires are required to transmit data/address University of Utah 2 Cache

Wire Delay/Power q Wire delays are costly for performance and power - Latencies of

Wire Delay/Power q Wire delays are costly for performance and power - Latencies of 60 cycles to reach ends of a chip at 32 nm (@ 5 GHz) - 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) q CACTI* access time for 24 MB cache is 90 cycles @ 5 GHz, 65 nm Tech *version 4 University of Utah 3

Contribution o Support for various interconnect models n Improved design space exploration o Support

Contribution o Support for various interconnect models n Improved design space exploration o Support for modeling Non-Uniform Cache Access (NUCA) University of Utah 4

Cache Design Basics Bitlines Data array Wordline Decoder Tag array Input address Column muxes

Cache Design Basics Bitlines Data array Wordline Decoder Tag array Input address Column muxes Sense Amps Comparators Mux drivers Output driver Data output Valid output? University of Utah 5

Existing Model - CACTI Wordline & bitline delay Decoder delay Cache model with 4

Existing Model - CACTI Wordline & bitline delay Decoder delay Cache model with 4 sub-arrays Cache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay University of Utah 6

Power/Delay Overhead of Wires o H-tree delay increases with cache size o H-tree power

Power/Delay Overhead of Wires o H-tree delay increases with cache size o H-tree power continues to dominate o Bitlines are other major contributors to total power 7

Motivation o The dominant role of interconnect is clear o Lack of tool to

Motivation o The dominant role of interconnect is clear o Lack of tool to model interconnect in detail can impede progress o Current solutions have limited wire options n Orion, CACTI - Weak wire model - No support for modeling Multi-megabyte caches University of Utah 8

CACTI 6. 0 Enhancements o Incorporation of n Different wire models n Different router

CACTI 6. 0 Enhancements o Incorporation of n Different wire models n Different router models n Grid topology for NUCA n Shared bus for UCA n Contention values for various cache configurations o Methodology to compute optimal NUCA organization o Improved interface that enables trade-off analysis o Validation analysis University of Utah 9

Full-swing Wires Z Y X University of Utah 10

Full-swing Wires Z Y X University of Utah 10

Full-swing Wires II 10% Delay penalty Three different design points 20% Delay penalty 30%

Full-swing Wires II 10% Delay penalty Three different design points 20% Delay penalty 30% Delay penalty Repeater size o Caveat: Repeater sizing and spacing cannot be controlled precisely all the time University of Utah 11

Full-Swing Wires o Fast and simple n Delay proportional to sqrt(RC) as against RC

Full-Swing Wires o Fast and simple n Delay proportional to sqrt(RC) as against RC o High bandwidth n Can be pipelined - Requires silicon area - High energy - Quadratic dependence on voltage 12

Low-swing wires 50 m. V raise 400 m. V 50 m. V Differential wires

Low-swing wires 50 m. V raise 400 m. V 50 m. V Differential wires University of Utah drop 13

Differential Low-swing + Very low-power, can be routed over other modules - Relatively slow,

Differential Low-swing + Very low-power, can be routed over other modules - Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver q Bitlines are a form of low-swing wire ØOptimized for speed and area as against power ØDriver and pre-charger employ full Vdd voltage University of Utah 14

Delay Characteristics Quadratic increase in delay University of Utah 15

Delay Characteristics Quadratic increase in delay University of Utah 15

Energy Characteristics University of Utah 16

Energy Characteristics University of Utah 16

Search Space of CACTI-5 o Design space with global wires optimized for delay University

Search Space of CACTI-5 o Design space with global wires optimized for delay University of Utah 17

Search Space of CACTI-6 Low-swing 30% Delay Penalty Least Delay Design space with global

Search Space of CACTI-6 Low-swing 30% Delay Penalty Least Delay Design space with global and low-swing wires University of Utah 18

CACTI – Another Limitation o Access delay is equal to the delay of slowest

CACTI – Another Limitation o Access delay is equal to the delay of slowest subarray § Very high hit time for large caches Potential solution – NUCA Extend CACTI to model NUCA o Employs a separate bus for each cache bank for multi-banked caches § Not scalable Exploit different wire types and network design choices to improve the search space University of Utah 19

Non-Uniform Cache Access (NUCA)* o Large cache is broken into a number of small

Non-Uniform Cache Access (NUCA)* o Large cache is broken into a number of small banks CPU & L 1 o Employs on-chip network for communication o Access delay a (distance between bank and cache controller) Cache banks *(Kim et al. ASPLOS 02) University of Utah 20

Extension to CACTI o On-chip network n Wire model based on ITRS 2005 parameters

Extension to CACTI o On-chip network n Wire model based on ITRS 2005 parameters n Grid network n 3 -stage speculative router pipeline o Network latency vs Bank access latency tradeoff n Iterate over different bank sizes n Calculate the average network delay based on the number of banks and bank sizes n Consider contention values for different cache configurations o Similarly we also consider power consumed for each organization University of Utah 21

Trade-off Analysis (32 MB Cache) 16 Core CMP 22

Trade-off Analysis (32 MB Cache) 16 Core CMP 22

Effect of Core Count 23

Effect of Core Count 23

Power Centric Design (32 MB Cache) University of Utah 24

Power Centric Design (32 MB Cache) University of Utah 24

Validation o HSPICE tool o Predictive Technology Model (65 nm tech. ) o Analytical

Validation o HSPICE tool o Predictive Technology Model (65 nm tech. ) o Analytical model that employs PTM parameters compared against HSPICE o Distributed wordlines, bitlines, low-swing transmitters, wires, receivers n Verified to be within 12% University of Utah 25

Case Study: Heterogeneous D-NUCA o Dynamic-NUCA n Reduces access time by dynamic data movement

Case Study: Heterogeneous D-NUCA o Dynamic-NUCA n Reduces access time by dynamic data movement n Near-by banks are accessed more frequently o Heterogeneous Banks o Near-by banks are made smaller and hence faster o Access to nearby banks consume less power o Other banks can be made larger and more power efficient 26

Access Frequency o % request satisfied by x KB of cache 27

Access Frequency o % request satisfied by x KB of cache 27

Few Heterogeneous Organizations Considered by CACTI Model 1 Model 2 University of Utah 28

Few Heterogeneous Organizations Considered by CACTI Model 1 Model 2 University of Utah 28

Other Applications o Exposing wire properties n Novel cache pipelining o Early lookup, Aggressive

Other Applications o Exposing wire properties n Novel cache pipelining o Early lookup, Aggressive lookup (ISCA 07) n Flit-reservation flow control (Peh et al. , HPCA 00) n Novel topologies o Hybrid network (ISCA 07) 29

Conclusion o Network parameters and contention play a critical role in deciding NUCA organization

Conclusion o Network parameters and contention play a critical role in deciding NUCA organization o Wire choices have significant impact on cache properties o CACTI 6. 0 can identify models that reduce power by a factor of three for a delay penalty of 25% http: //www. hpl. hp. com/personal/Norman_Jouppi/cacti 6. html http: //www. cs. utah. edu/~rajeev/cacti 6/ 30