OutofOrder Execution Structure Optimizations A Moshovos ECE 1773

  • Slides: 53
Download presentation
Out-of-Order Execution Structure Optimizations A. Moshovos © ECE 1773 - Fall ‘ 07 ECE

Out-of-Order Execution Structure Optimizations A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Tag Elimination A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Tag Elimination A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Conventional Schedulers are Overdesigned • For MIPS-like ISA – Two source tags – One

Conventional Schedulers are Overdesigned • For MIPS-like ISA – Two source tags – One destination tag • Not all instructions use two source operands – E. g. , addi $1, $2, 10 • Not all instructions produce a result that is interesting for scheduling – E. g. , beq • Some operands are ready when the instruction enters the scheduler • Source: Efficient Dynamic Scheduling Through Tag Elimination, Dan Ernst and Todd Austin, ISCA 2002 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Some Operands are Ready when the Instruction Enters the Scheduler A. Moshovos © ECE

Some Operands are Ready when the Instruction Enters the Scheduler A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Window Specialization • Have reservation stations with different source operand wait capabilities A. Moshovos

Window Specialization • Have reservation stations with different source operand wait capabilities A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Window Specialization • At rename check how many source operands are not ready •

Window Specialization • At rename check how many source operands are not ready • If there is an appropriate slot proceed to schedule • If not, stall at rename • Advantages: – Destination bus only runs over reservation stations with comparators – Load on the destination bus is reduced • Disadvantages: – Stalls due to unavailability of reservation stations – Complexity of res. Station assignment A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Window Specialization - Performance as IPC – Actual Clock Frequency not considered A. Moshovos

Window Specialization - Performance as IPC – Actual Clock Frequency not considered A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Window Specialization - Performance as IPC per ns A. Moshovos © ECE 1773 -

Window Specialization - Performance as IPC per ns A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Last Tag Prediction • Observe: – Instruction becomes ready after the last tag it

Last Tag Prediction • Observe: – Instruction becomes ready after the last tag it waits for appears • Last Tag prediction – Predict which of the two tags will that be • Speculatively execute – Correct speculation: that was the last tag – Incorrect speculation: • Need to reschedule • Detection? Try to read a value that is not available A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

GShare-Style Last Tag Prediction Two-bit saturating counters A. Moshovos © ECE 1773 - Fall

GShare-Style Last Tag Prediction Two-bit saturating counters A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Accuracy • Over all instructions with two outstanding operands A. Moshovos © ECE 1773

Accuracy • Over all instructions with two outstanding operands A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Window Specialization - Performance Predictive schedulers Performance as IPC – Actual Clock Frequency not

Window Specialization - Performance Predictive schedulers Performance as IPC – Actual Clock Frequency not considered A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Window Specialization - Performance as IPC per ns A. Moshovos © ECE 1773 -

Window Specialization - Performance as IPC per ns A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Prescheduling Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec,

Prescheduling Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec, HPCA 2001 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Prescheduling • Predict latencies • Put scheduled instructions into a FIFO • Slide into

Prescheduling • Predict latencies • Put scheduled instructions into a FIFO • Slide into a smaller window A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Prescheduling Method A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Prescheduling Method A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Prescheduling Example A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Prescheduling Example A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Latency Prediction A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Latency Prediction A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Latency Prediction Contd. A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Latency Prediction Contd. A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Broadcast Free Scheduler A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Broadcast Free Scheduler A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Broadcast Free Scheduler • Cyclone design – D. Ernst, A. Hamel, T. Austin –

Broadcast Free Scheduler • Cyclone design – D. Ernst, A. Hamel, T. Austin – ISCA 2003 • Preschedule Instructions • Put them into a dual strip cyclical FIFO • Vertical paths allow for motion between the strips A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture Will be ready in cycle + 6 A. Moshovos © ECE 1773

Cyclone Architecture Will be ready in cycle + 6 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle +1 A. Moshovos © ECE 1773 - Fall ‘ 07

Cyclone Architecture – Cycle +1 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle + 2 A. Moshovos © ECE 1773 - Fall ‘

Cyclone Architecture – Cycle + 2 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle + 3 A. Moshovos © ECE 1773 - Fall ‘

Cyclone Architecture – Cycle + 3 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle + 4 A. Moshovos © ECE 1773 - Fall ‘

Cyclone Architecture – Cycle + 4 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle + 5 A. Moshovos © ECE 1773 - Fall ‘

Cyclone Architecture – Cycle + 5 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle + 6 A. Moshovos © ECE 1773 - Fall ‘

Cyclone Architecture – Cycle + 6 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Cycle + 6 A. Moshovos © ECE 1773 - Fall ‘

Cyclone Architecture – Cycle + 6 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone Architecture – Mis-scheduling Estimate new latency A. Moshovos © ECE 1773 - Fall

Cyclone Architecture – Mis-scheduling Estimate new latency A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Pre-scheduler Can only do two cascaded MAX calculations Due to timing considerations Insert instruction

Pre-scheduler Can only do two cascaded MAX calculations Due to timing considerations Insert instruction with predicted latency N at the front of the FIFO Have it switch at N/2 A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone IPC Performance A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone IPC Performance A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Cyclone True Performance and Area A. Moshovos © ECE 1773 - Fall ‘ 07

Cyclone True Performance and Area A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Matrix Schedulers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Matrix Schedulers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Conventional Scheduler IW grants WS requests A. Moshovos © ECE 1773 - Fall ‘

Conventional Scheduler IW grants WS requests A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Conventional Scheduler Timing A 2 B 1 B 3 B 1 A 2 B

Conventional Scheduler Timing A 2 B 1 B 3 B 1 A 2 B 3 A. Moshovos © Can’t pipeline without introducing Bubbles between dependent Instructions: Source: A High-Speed Dynamic Instruction Scheduling Scheme for Superscalar Processors Masahiro Goshima Kengo Nishino Yasuhiko Nakashima Shin-ichiro Mori Toshiaki Kitamura Shinji Tomita MICRO 2001 ECE 1773 - Fall ‘ 07 ECE Toronto

Towards a Matrix Scheduler • Observe: – In conventional scheduling dependences are discovered twice:

Towards a Matrix Scheduler • Observe: – In conventional scheduling dependences are discovered twice: • Once at renaming • Once during scheduling – Why? Dependences are implicitly represented • Producer and Consumer link via a name • This is indirect • Matrix Scheduler idea: – Represent dependences explicitly A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Dependence Matrix Who do I depend upon? Who am I Left source A. Moshovos

Dependence Matrix Who do I depend upon? Who am I Left source A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto Right source

Matrix Scheduler Write port One cell wakeup A. Moshovos © ECE 1773 - Fall

Matrix Scheduler Write port One cell wakeup A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Inserting an entry Write port A. Moshovos © ECE 1773 - Fall ‘ 07

Inserting an entry Write port A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Wakeup wakeup A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Wakeup wakeup A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Mispeculation Recovery • Do not cleanup • Use external logic to inhibit request signals

Mispeculation Recovery • Do not cleanup • Use external logic to inhibit request signals A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Delay Partial wakeup lines A. Moshovos © ECE 1773 - Fall ‘ 07 ECE

Delay Partial wakeup lines A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto 0. 18 um 1. 8 V 85 C

Delay measurement points A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Delay measurement points A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Scheduling Priorities A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Scheduling Priorities A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Conflict Resolution • More instructions ready than available issue slots – Which get to

Conflict Resolution • More instructions ready than available issue slots – Which get to go? • Age vs. Pseudo-Random Resolution • Age is important • Priority Enforcer picks the oldest – Complex A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto Source: Matrix Scheduler Reloaded ISCA 2007

Compacting Scheduler • Implemented in the Alpha 21264 • Physical order within scheduler corresponds

Compacting Scheduler • Implemented in the Alpha 21264 • Physical order within scheduler corresponds to age • Entry freed: – Shift up all younger entries A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Virtual Physical Registers • Physical register names are used for two purposes – Scheduling

Virtual Physical Registers • Physical register names are used for two purposes – Scheduling – Communicating • A physical register is held much in advance than needed – We need the register only after the value is produced • De-couple scheduling from communication names A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Used vs. Allocated Registers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE

Used vs. Allocated Registers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Goal A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Goal A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Virtual Physical Registers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Virtual Physical Registers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Deadlock • Older instruction completes later than younger ones – No registers available •

Deadlock • Older instruction completes later than younger ones – No registers available • Steal a register and re-execute A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto

Performance vs. Physical Registers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE

Performance vs. Physical Registers A. Moshovos © ECE 1773 - Fall ‘ 07 ECE Toronto