Methodologies for Performance Simulation of Superscalar OOO processors

  • Slides: 39
Download presentation
Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman Cpr. E

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman Cpr. E 585: Survey Project

Introduction Modeling Processor Design Simulation Performance Study

Introduction Modeling Processor Design Simulation Performance Study

Architectural Simulators n n n Explore Design Space Evaluate existing hardware, or Predict performance

Architectural Simulators n n n Explore Design Space Evaluate existing hardware, or Predict performance of proposed hardware Designer has control Functional Simulators: Performance Simulators: Model architecture (programmers’ focus) Model microarchitecture (designer’s focus) Eg. , sim-fast, sim-safe Eg. , cycle-by-cycle (sim-outoforder)

Simulation Issues n n Real-applications take too long for a cycle-by-cycle simulation Vast design

Simulation Issues n n Real-applications take too long for a cycle-by-cycle simulation Vast design space: n Design Parameters: n n code properties, value prediction, dynamic instruction distance, basic block size, instruction fetch mechanisms, etc. Architectural metrics: n IPC/ILP, cache miss rate, branch prediction accuracy, etc. n Find design flaws + Provide design improvements n Need a “robust” simulation methodology !!

Two Methodologies n HLS Hybrid: Statistical + Symbolic REF: n n HLS: Combining Statistical

Two Methodologies n HLS Hybrid: Statistical + Symbolic REF: n n HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. M. Oskin, F. T. Chong and M. Farrens. Proc. ISCA. 71 -82. 2000. BBDA n n Basic block distribution analysis REF: n Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. T. Sherwood, E. Perelman and B. Calder. Proc. PACT. 2001.

HLS: An Overview n A hybrid processor simulator Statistical Model HLS Symbolic Execution Performance

HLS: An Overview n A hybrid processor simulator Statistical Model HLS Symbolic Execution Performance Contours spanned by design space parameters What can be achieved? Explore design changes in architectures and compilers that would be impractical to simulate using conventional simulators

HLS: Main Idea Synthetically generated code Application code Machine independent characteristics: -basic block size

HLS: Main Idea Synthetically generated code Application code Machine independent characteristics: -basic block size -Dynamic instruction distance -Instruction mix Statistical Profiling Architecture metrics: -Cache behavior -Branch prediction accuracy Instruction stream, data stream Structural Simulation of FU, issue pipeline units

Statistical Code Generation n Each “synthetic instruction” contains the following parameters based on the

Statistical Code Generation n Each “synthetic instruction” contains the following parameters based on the statistical profile: n n n Functional unit requirements Dynamic instruction distances Cache behavior

Validation of HLS against Simple. Scalar n For varying combinations of design parameters: n

Validation of HLS against Simple. Scalar n For varying combinations of design parameters: n n n Run original benchmark code on Simple. Scalar (use sim-outoforder) Run statistically generated code on HLS Compare Simple. Scalar IPC vs. HLS IPC

Validation: Single- and Multi-value correlations IPC vs. L 1 -cache hit rate For SPECint

Validation: Single- and Multi-value correlations IPC vs. L 1 -cache hit rate For SPECint 95: HLS Errors are within 5 -7% of the cycle-by-cycle results !!

Validation: L 1 Instruction Cache Miss Penalty vs. Hit Rate Correlation suggests that: Cache

Validation: L 1 Instruction Cache Miss Penalty vs. Hit Rate Correlation suggests that: Cache hit rate should be at least 88% to dominate

HLS: Code Properties Basic Block Size vs. L 1 -Cache Hit Rate Correlation suggests

HLS: Code Properties Basic Block Size vs. L 1 -Cache Hit Rate Correlation suggests that: Increasing block size helps only when L 1 cache hit rate is >96% or <82%

HLS: Code Properties Dynamic Instruction Distance vs. Basic Block Size Correlation suggests that: Moderate

HLS: Code Properties Dynamic Instruction Distance vs. Basic Block Size Correlation suggests that: Moderate DID values suffice for IPC, and high values of basic block size (>8) does not help without an increase in DID

HLS: Value Prediction GOAL: Break True Dependency DID vs. Value predictability Stall Penalty for

HLS: Value Prediction GOAL: Break True Dependency DID vs. Value predictability Stall Penalty for mispredict vs. Value Prediction Knowledge

HLS: More Multi-value Correlations L 1 -cache hit rate vs. Value Predictability DID vs.

HLS: More Multi-value Correlations L 1 -cache hit rate vs. Value Predictability DID vs. Superscalar issue width

HLS: Discussion n n Low error rate only on SPECint 95 benchmark suite. High

HLS: Discussion n n Low error rate only on SPECint 95 benchmark suite. High error rates on SPECfp 95 and STREAM benchmarks Findings: by R. H. Bell et. Al, 2004 Reason: n Instruction-level granularity for workload Recommended Improvement: n Basic block-level granularity

Goals n n n The end of the initialization The period of the program

Goals n n n The end of the initialization The period of the program Ideal place to simulate given a specific number of instructions one has to simulate Accurate confidence estimation of the simulation point. <Note> Revamp this slide.

Program Behavior n n Program behavior has ramification on architectural techniques. Program behavior is

Program Behavior n n Program behavior has ramification on architectural techniques. Program behavior is different in different parts of execution. Initialization Cyclic behavior (Periodic)

Basic Block Distribution Analysis n n Each basic block gets executed a certain number

Basic Block Distribution Analysis n n Each basic block gets executed a certain number of times. Number of times each basic block executes gives a fingerprint. Use the fingerprints to find representative areas to simulate. <Note> How does fingerprinting help?

Cyclic Behavior of Programs n n n Cyclic Behavior is not representative of all

Cyclic Behavior of Programs n n n Cyclic Behavior is not representative of all programs. Common case for compute bound applications. SPEC 95 wave program executes 7 billion instructions before it reaches the code that amounts to the bulk of execution.

Basic Block Vectors n Fast profiling to determine the number of times a basic

Basic Block Vectors n Fast profiling to determine the number of times a basic block executes. n n Behavior of the program is directly related to the code that it is executing. Profiling gives a basic block fingerprint for that particular interval of time. Full execution of the program and the interval we choose spends proportionally the same amount of time in the same code. Collected in intervals of 100 million instructions.

Basic Block Vector - BBV n BBV is a single dimensional array. n n

Basic Block Vector - BBV n BBV is a single dimensional array. n n n There is an element for each basic block in the program. Each element is the count of how many times a given basic block was entered during an interval. Varying size intervals n A BBV collected over an interval of N times 100 million instructions is a BBV of duration N.

Basic Block Vectors n BBV is normalized n n Target BBV n n Each

Basic Block Vectors n BBV is normalized n n Target BBV n n Each element divided by the sum of all elements. BBV for the entire execution of the program. Objective n Find a BBV of small duration similar to Target BBV.

Basic Block Vector Difference n Difference between BBVs n n n Element wise subtraction,

Basic Block Vector Difference n Difference between BBVs n n n Element wise subtraction, sum of absolute values. A number between 0 and 2. Manhattan and Euclidean Distance.

Basic Block Difference Graph n n n Plot of how well each individual sample

Basic Block Difference Graph n n n Plot of how well each individual sample in the program compares to the target BBV. For each interval of 100 million instructions, we create a BBV and calculate its difference from target BBV. Used to n n Find the initialization phase Find the period for the program.

Basic Block Difference Graph n Diagram and explain

Basic Block Difference Graph n Diagram and explain

Initialization n n Initialization is not trivial. Important to simulate representative sections of code.

Initialization n n Initialization is not trivial. Important to simulate representative sections of code. Detection of the end of the initialization phase is important. Initialization Difference Graph n n Initial Representative Signal - First quarter of BB Difference graph. Slide it across BB difference graph. Difference calculated at each point for first half of BBDG. When IRS reaches the end of the initialization stage on the BB difference graph, the difference is maximized.

Initialization n Diagram and explain

Initialization n Diagram and explain

Period n Period Difference Graph n Period Representative Signal n n Part of BBDG,

Period n Period Difference Graph n Period Representative Signal n n Part of BBDG, starting from the end of initialization to ¼th the length of program execution. Slide across half the BBDG. Distance between the minimum Y-axis points is the period. Using larger durations of a BBV creates a BBDG that emphasizes the larger periods.

Period n Diagram and explain

Period n Diagram and explain

Method n Simple. Scalar modified. n n Output and clear statistics counters every 100

Method n Simple. Scalar modified. n n Output and clear statistics counters every 100 million instructions committed. Graphed data: n n IPC, % RUU Occupancy, Cache Miss Rate etc. To get the most representative sample of a program at least one full period must be simulated.

Results

Results

Basic Block Similarity Matrix n n A phase of a program behavior can be

Basic Block Similarity Matrix n n A phase of a program behavior can be defined as all similar sections of execution regardless of temporal adjacency. Similarity Matrix n n Upper Triangle N X N Matrix, where N is the number of intervals in the program execution. An entry at (x, y) in the matrix represents Manhattan distance between the BBV at x and BBV at y.

Basic Block Similarity Matrix n IMAGE and explain the image.

Basic Block Similarity Matrix n IMAGE and explain the image.

Finding Basic Block Similarity n n n Many intervals of execution are similar to

Finding Basic Block Similarity n n n Many intervals of execution are similar to each other. It makes sense to group them together. Analogous to clustering.

Clustering n n n Goal is to divide a set of points into groups

Clustering n n n Goal is to divide a set of points into groups such that points within each group are similar to one another by some metric. This problem arises in other fields such as computer vision, genomics etc. Two types of clustering algorithms exist n Partitioning n n n Choose an initial solution then iteratively update to find better solution Linear Time Complexity Hierarchical n n Divisive or Agglomerative Quadratic Time Complexity

Phase Finding Algorithm n n Generate BBVs with a duration of 1. Reduce the

Phase Finding Algorithm n n Generate BBVs with a duration of 1. Reduce the dimension of the BBVs to 15. Apply clustering algorithm on the BBVs. Score the clustering and choose the most suitable.

Random Projection n n Curse of Dimensionality BBV dimensions n n Number of executed

Random Projection n n Curse of Dimensionality BBV dimensions n n Number of executed Basic Blocks. Could grow to millions. Dimension Selection Dimension Reduction n Random Linear Projection.

Clustering Algorithm n K-means algorithm n n n Iterative optimizing algorithm. Two repetitive phases

Clustering Algorithm n K-means algorithm n n n Iterative optimizing algorithm. Two repetitive phases that converge. WORK IN PROGRESS