Application of Instruction AnalysisSynthesis Tools to x 86s

  • Slides: 18
Download presentation
Application of Instruction Analysis/Synthesis Tools to x 86’s Functional Unit Allocation Ing-Jer Huang and

Application of Instruction Analysis/Synthesis Tools to x 86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information Engineering National Sun Yat-sen University Kaohsiung, Taiwan 80441 R. O. C. ijhuang@cie. nsysu. edu. tw

Superscalar Model under Investigation • Decoupled superscalar architecture – register renaming – branch prediction

Superscalar Model under Investigation • Decoupled superscalar architecture – register renaming – branch prediction • Assumptions – no cache miss – fast instruction fetcher and decoder – 100% branch prediction correct – load/store unit: 2 cycles; others: 1 cycle – large RS and ROB

The Problem Q: How many functional units are needed in an x 86 compatible

The Problem Q: How many functional units are needed in an x 86 compatible superscalar core? A: The distribution of functional unit usage in typical x 86 programs FU Usage 4 A, 2 M, 1 B 3 A, 0 M, 0 B 2 A, 2 M, 1 B 2 A, 1 M, 0 B 1 A, 1 M, 1 B Frequency

How to Obtain FU Distribution? • Simulation-based approaches [Shinatani, 1995], [Davidson, 1995], [Hara et

How to Obtain FU Distribution? • Simulation-based approaches [Shinatani, 1995], [Davidson, 1995], [Hara et al. , 1996], etc. – Running on different CPU platforms – Slow, but can explore many configurations • Monitoring-based approaches [Adams et al. , 1989], [Bhandarkar et al. , 1997], [Huang, 1997], etc. – Directly running on the same CPU platform – Fast, but work for only the configuration of the underlying CPU platform

A Fast Performance/Cost Approximation Environment

A Fast Performance/Cost Approximation Environment

ASIA: Automatic Synthesis of Instruction Set Architedcture • GOAL: analyzes and synthesizes application-specific instruction

ASIA: Automatic Synthesis of Instruction Set Architedcture • GOAL: analyzes and synthesizes application-specific instruction set for pipelined uni-processors. • APPROACH: a micro-operation scheduling engine based on a simulated annealing algorithm The superscalar core is an application-specific RISC core for x 86 emulation

ASIA-II: Extensions for Superscalar Architecture • Register renaming – Temporary registers are used on

ASIA-II: Extensions for Superscalar Architecture • Register renaming – Temporary registers are used on the fly to resolve anti and data dependencies. • Execution window – Instructions are dispatched sequentially. • Branch prediction – Effective sizes of basic blocks are enlarged.

Register Renaming • In ASIA-II: ignore output, anti dependencies during scheduling

Register Renaming • In ASIA-II: ignore output, anti dependencies during scheduling

Realistic Patterns in the Execution Window • Balanced distribution: 0 bjective function includes both

Realistic Patterns in the Execution Window • Balanced distribution: 0 bjective function includes both time steps and H/W counts • Window effect: MOP’s are displaced with a limited distance; long distance is possible with many iterations of displacement. as long as performance is improved.

Basic Block Expansion (Eblocks) Due to Branch Prediction

Basic Block Expansion (Eblocks) Due to Branch Prediction

A Small Example from Word 97

A Small Example from Word 97

Extended Basic Blocks

Extended Basic Blocks

Scheduled Eblocks

Scheduled Eblocks

Description of Benchmark

Description of Benchmark

Micro-operation Level Parallelism (MSP)

Micro-operation Level Parallelism (MSP)

Functional Unit Usage • Notation: A - Integer unit M - Memory unit B

Functional Unit Usage • Notation: A - Integer unit M - Memory unit B - Branch unit F - Floating unit • Others is the sum of that frequent less than 1. 0%

Accumulated Coverage of Functional Unit Allocation (NSC 98) (IA-64) (AMD K 6) (Pentium Pro)

Accumulated Coverage of Functional Unit Allocation (NSC 98) (IA-64) (AMD K 6) (Pentium Pro) (Base Machine)

Conclusions • Synthesis/analysis tools have been used to observe the functional unit usage and

Conclusions • Synthesis/analysis tools have been used to observe the functional unit usage and MLP in superscalar core. • Speedup over simulation is over 600 times. • FUTURE WORK: investigate various microarchitecture features – register renaming vs. branch prediction – functional unit optimization