CMP Design Choices Finding Parameters that Impact CMP

  • Slides: 21
Download presentation
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter Mc.

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter Mc. Clone

Outline n n n Introduction Assumptions Plackett & Burman Analysis ¡ ¡ ¡ n

Outline n n n Introduction Assumptions Plackett & Burman Analysis ¡ ¡ ¡ n Mean Value Analysis ¡ ¡ n n Simulation methods Statistical Design Plackett & Burman Results MVA Implementation MVA Results AMVA Implementation AMVA Results Complementary Results Conclusions

Introduction n 2 part study ¡ n Design space is huge, how can we

Introduction n 2 part study ¡ n Design space is huge, how can we reduce it? Method 1 ¡ ¡ ¡ Plackett & Burman (PB) Analysis finds critical parameters Design uses extreme values of parameters Detailed architecture design can focus on a few parameters

Introduction (cont. ) n Method 2 ¡ ¡ Mean Value Analysis Model of a

Introduction (cont. ) n Method 2 ¡ ¡ Mean Value Analysis Model of a CMP Simply designed to compute throughput Design choices can be narrowed down quickly Intuition is gained and patterns/parameter relationships identified

Assumptions - PB Design n In-Order approximated as Oo. O with small window Die

Assumptions - PB Design n In-Order approximated as Oo. O with small window Die Size = 300 mm 2 (16 MB Cache @ 65 nm) L 2 Cache Size expanded to fill the die ¡ ¡ n Discrete sizes: 4, 8, 12 MB Associativity can be non-power-of-2 Core size measured in Cache Byte Equivalents: Pipeline Width CBE In-Order 1 50 k. B In-Order 4 100 k. B Out-of-Order 1 75 k. B Out-of-Order 4 250 k. B

Simulation Methodology n n Simics with Ruby & Opal 16 P sims used cache

Simulation Methodology n n Simics with Ruby & Opal 16 P sims used cache warmup files 2 P sims ran for more transactions Attempted OLTP and JBB benchmarks Benchmark Processors Transactions OLTP 2 200 OLTP 16 100 JBB 2 20000 JBB 16 10000

Plackett & Burman Design n Motivation ¡ ¡ n Narrow a huge design space

Plackett & Burman Design n Motivation ¡ ¡ n Narrow a huge design space Minimize simulation runs (experiments) Preliminaries ¡ ¡ ¡ Performance Measure Extreme Parameter Values Number of Parameters (N < 4 Xn-1)

PB Design Example A + + + + 191 B C D + +

PB Design Example A + + + + 191 B C D + + + + + + 19 111 -13 E + + + + 79 F G Time 9 + 11 + 2 + 1 + + 9 + + 74 + 7 + + 4 + + 17 + 76 + 31 19 33 + 6 112 55 239

PB Design Parameter Values Parameter Low Value (-) High Value (+) Number of Cores

PB Design Parameter Values Parameter Low Value (-) High Value (+) Number of Cores 2 16 Pipeline Organization In-Order Out-of-Order Pipeline Width 1 4 L 1 Cache Size 16 k. B 128 k. B L 1 Associativity Direct Mapped 32 -Way L 2 Cache Size Die Area – Core Area L 2 Associativity Direct Mapped 32 -Way L 2 Banks 2 32 L 2 Latency 50 Cycles 12 Cycles L 2 Directory Latency 25 Cycles 6 Cycles Pin Bandwidth 400 10000 Memory Latency 300 Cycles 100 Cycles

PB Results n n n Extreme Values stressed the simulator Have not completed an

PB Results n n n Extreme Values stressed the simulator Have not completed an entire set of runs, yet Possibly necessary to build a custom L 2 network for each run

PB Results for JBB

PB Results for JBB

Assumptions - MVA n n Distribution of time between memory requests is exponential Processor

Assumptions - MVA n n Distribution of time between memory requests is exponential Processor cores exhibit the same average behavior with respect to their service times and miss rates. Doubling the size of the cache reduces the miss rate by a factor of 1/√ 2 An inorder core takes approximately the same area as 50 KB of cache

MVA Design n Simple Closed Model:

MVA Design n Simple Closed Model:

MVA Design n Two phases of this Model design ¡ First: Use the exact

MVA Design n Two phases of this Model design ¡ First: Use the exact MVA equations n n ¡ Use average time between memory access as an application parameter Solve for throughput Second: Use Approximate MVA (AMVA) n n Use an iterative method to converge on this service time Solve for throughput

Exact MVA n To solve for the MVA equations, we determine the mean residence

Exact MVA n To solve for the MVA equations, we determine the mean residence time at all service centers: ¡ ¡ ¡ n Rp – processor/L 1 residence time RL 2 – L 2 residence time RM – memory residence time. The case with one core is trivial. Use this case to solve for additional cores ¡ Rn, p = Dp * (1 + Qn-1, p)

Exact MVA results n Using data from simulation runs throughput was calculated ¡ n

Exact MVA results n Using data from simulation runs throughput was calculated ¡ n n n Miss rates, number of memory requests Results are erratic Not consistent with simulation results Source of the problem is most likely processor service time!

Approximate MVA Design n An iterative method can be used to converge on a

Approximate MVA Design n An iterative method can be used to converge on a service time ¡ n n Uses total R as an input parameter Iterative method works well with approximate MVA Goal is to match total average residence time of a memory request

Approximate MVA Results n n Convergence using the AMVA equations does not always occur

Approximate MVA Results n n Convergence using the AMVA equations does not always occur Total measured residence time cannot be reached with this model and parameter set. Variation of input values without convergence implies flaws in the model structure There is a complex relationship between the memory system and the rate at which a core issues requests that must be modeled

Complementary Results n n Initial goal to produce PB Results to find parameters to

Complementary Results n n Initial goal to produce PB Results to find parameters to focus on for MVA Model Results from both approaches could cross-verify correctness

Conclusions n Simics has a STEEP learning curve ¡ n n <5 weeks is

Conclusions n Simics has a STEEP learning curve ¡ n n <5 weeks is not enough time for valid/any results Refinement of a PB Design leads to long lead times on valid results CMPs complicate the relationship between cores and memory subsystem Design methodologies that focus simulation runs are necessary More results and conclusions to follow

Questions n Questions?

Questions n Questions?