Application of Instruction AnalysisSynthesis Tools to x 86s

Superscalar Model under Investigation • Decoupled superscalar architecture – register renaming – branch prediction

The Problem Q: How many functional units are needed in an x 86 compatible

How to Obtain FU Distribution? • Simulation-based approaches [Shinatani, 1995], [Davidson, 1995], [Hara et

A Fast Performance/Cost Approximation Environment

ASIA: Automatic Synthesis of Instruction Set Architedcture • GOAL: analyzes and synthesizes application-specific instruction

ASIA-II: Extensions for Superscalar Architecture • Register renaming – Temporary registers are used on

Register Renaming • In ASIA-II: ignore output, anti dependencies during scheduling

Realistic Patterns in the Execution Window • Balanced distribution: 0 bjective function includes both

Basic Block Expansion (Eblocks) Due to Branch Prediction

Functional Unit Usage • Notation: A - Integer unit M - Memory unit B

Accumulated Coverage of Functional Unit Allocation (NSC 98) (IA-64) (AMD K 6) (Pentium Pro)

Conclusions • Synthesis/analysis tools have been used to observe the functional unit usage and

Slides: 18

Download presentation

Application of Instruction Analysis/Synthesis Tools to x 86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information Engineering National Sun Yat-sen University Kaohsiung, Taiwan 80441 R. O. C. ijhuang@cie. nsysu. edu. tw

Superscalar Model under Investigation • Decoupled superscalar architecture – register renaming – branch prediction • Assumptions – no cache miss – fast instruction fetcher and decoder – 100% branch prediction correct – load/store unit: 2 cycles; others: 1 cycle – large RS and ROB

The Problem Q: How many functional units are needed in an x 86 compatible superscalar core? A: The distribution of functional unit usage in typical x 86 programs FU Usage 4 A, 2 M, 1 B 3 A, 0 M, 0 B 2 A, 2 M, 1 B 2 A, 1 M, 0 B 1 A, 1 M, 1 B Frequency

How to Obtain FU Distribution? • Simulation-based approaches [Shinatani, 1995], [Davidson, 1995], [Hara et al. , 1996], etc. – Running on different CPU platforms – Slow, but can explore many configurations • Monitoring-based approaches [Adams et al. , 1989], [Bhandarkar et al. , 1997], [Huang, 1997], etc. – Directly running on the same CPU platform – Fast, but work for only the configuration of the underlying CPU platform

A Fast Performance/Cost Approximation Environment

ASIA: Automatic Synthesis of Instruction Set Architedcture • GOAL: analyzes and synthesizes application-specific instruction set for pipelined uni-processors. • APPROACH: a micro-operation scheduling engine based on a simulated annealing algorithm The superscalar core is an application-specific RISC core for x 86 emulation

ASIA-II: Extensions for Superscalar Architecture • Register renaming – Temporary registers are used on the fly to resolve anti and data dependencies. • Execution window – Instructions are dispatched sequentially. • Branch prediction – Effective sizes of basic blocks are enlarged.

Realistic Patterns in the Execution Window • Balanced distribution: 0 bjective function includes both time steps and H/W counts • Window effect: MOP’s are displaced with a limited distance; long distance is possible with many iterations of displacement. as long as performance is improved.

Basic Block Expansion (Eblocks) Due to Branch Prediction

A Small Example from Word 97

Extended Basic Blocks

Scheduled Eblocks

Description of Benchmark

Micro-operation Level Parallelism (MSP)

Functional Unit Usage • Notation: A - Integer unit M - Memory unit B - Branch unit F - Floating unit • Others is the sum of that frequent less than 1. 0%

Accumulated Coverage of Functional Unit Allocation (NSC 98) (IA-64) (AMD K 6) (Pentium Pro) (Base Machine)

Conclusions • Synthesis/analysis tools have been used to observe the functional unit usage and MLP in superscalar core. • Speedup over simulation is over 600 times. • FUTURE WORK: investigate various microarchitecture features – register renaming vs. branch prediction – functional unit optimization