Macroop Scheduling Relaxing Scheduling Loop Constraints Ilhyun Kim

It’s all about granularity n Instruction-centric hardware design n n HW structures are built

Outline n n n Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation

Scheduling loop constraints n Loops in out-of-order execution Load latency resolution loop Scheduling loop

Related Work n Scheduling atomicity n Speculation & pipelining n n Grandparent scheduling [Stark],

Source of the atomicity constraint n Minimal execution latency of instruction n n Many

Macro-op scheduling overview Fetch / Decode / Rename Queue Scheduling Disp RF / EXE

MOP scheduling(2 x) example 6 2 5 • 9 cycles • 16 queue entries

Issues in grouping instructions n Candidate instructions n n n The number of source

Dependence edge distance (instruction count) % total insts 49. 2 50. 9 27. 8

MOP detection Select RF Dependence information MEM WB Commit in Wakeup order information EXE

MOP detection – Avoiding cycle conditions n Cycle condition examples (leading to deadlocks) 1

MOP formation I-cache Fetch MOP pointers n Rename MOP detection Payload RAM Wakeup Select

Scheduling MOPs MOP formation I-cache Fetch MOP pointers n MOP detection Payload RAM Wakeup

Sequencing instructions I-cache Fetch MOP pointers n Rename MOP detection Issue queue insert Payload

Machine parameters n Simplescalar-Alpha-based 4 -wide Oo. O + speculative scheduling w/ selective replay,

2 -src 3 -src # grouped instructions n 28~46% of total instructions are grouped

MOP scheduling performance (relaxed atomicity constraint only) Unrestricted IQ / 128 ROB n n

Insight into MOP scheduling n Performance loss of 2 -cycle scheduling n n n

MOP scheduling performance (relaxed atomicity + scalability constraints) 32 IQ / 128 ROB n

Conclusions & Future work n Changing processing granularity can relax the constraints imposed by

Questions? ? December 4, 2003 Ilhyun Kim -- MICRO-36 24

Select-free (Brown et al. ) vs. MOP scheduling 32 IQ / 128 ROB, no

MOP detection – MOP pointer generation n Finding dependent pairs n n Dependence matrix-based

MOP formation – MOP dependence translation n Assigns a single ID to two MOPable

Inserting MOPs into issue queue I-cache Fetch MOP pointers n MOP formation Issue queue

Performance considerations n Independent MOPs n n n Group independent instructions with the same

1 2 3 4 Original data dependence graph clk n+1 clk n STEP 1

Slides: 30

Download presentation

Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team University of Wisconsin-Madison December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23

It’s all about granularity n Instruction-centric hardware design n n HW structures are built to match an instruction’s specifications Controls occur at every instruction boundary Instruction granularity may impose constraints on the hardware design space Relaxing the constraints at different processing granularities Half-price architecture (ISCA 03) Finer conventional Coarser-granular architecture Processing granularity operand December 4, 2003 instruction Ilhyun Kim -- MICRO-36 Coarser macro-op 2

Outline n n n Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation Performance evaluation Conclusions & future work December 4, 2003 Ilhyun Kim -- MICRO-36 3

Scheduling loop constraints n Loops in out-of-order execution Load latency resolution loop Scheduling loop (wakeup / select) Fetch Decode Sched n RF Exe WB Commit Scheduling atomicity (wakeup / select within a single cycle) n n n Disp Exe loop (bypass) Essential for back-to-back instruction execution Hard to pipeline in conventional designs Poor scalability n n n Extractable ILP is a function of window size Complexity increases exponentially as the size grows Increasing pressure due to deeper pipelining and slower memory system December 4, 2003 Ilhyun Kim -- MICRO-36 4

Related Work n Scheduling atomicity n Speculation & pipelining n n Grandparent scheduling [Stark], Select-free scheduling [Brown] Poor scalability n Low complexity scheduling logic n n n Judicious window scaling n n FIFO style window [Palacharla, H. Kim] Data-flow based window [Canal, Michaud, Raasch …] Segmented windows [Hrishikesh], WIB [Lebeck] … Issue queue entry sharing n AMD K 7 (MOP), Intel Pentium M (uops fusion) Still based on instruction-centric scheduler designs n n Making a scheduling decision at every instruction boundary Overcoming atomicity and scalability in isolation December 4, 2003 Ilhyun Kim -- MICRO-36 5

Source of the atomicity constraint n Minimal execution latency of instruction n n Many ALU operations have single-cycle latency Schedule should keep up with execution 1 -cycle instructions need 1 -cycle scheduling Multi-cycle operations do not need atomic scheduling Relax the constraints by increasing the size of scheduling unit n n n Combine multiple instructions into a multi-cycle latency unit Scheduling decisions occur at multiple instruction boundaries Attack both atomicity and scalability constraints December 4, 2003 Ilhyun Kim -- MICRO-36 6

Macro-op scheduling overview Fetch / Decode / Rename Queue Scheduling Disp RF / EXE / MEM / WB / Commit cache ports Coarser MOP-grained Instruction-grained I-cache Fetch MOP formation Rename Issue queue insert Payload RAM Wakeup Select Pipelined scheduling MOP pointers MOP detection December 4, 2003 Instruction-grained Dependence information RF EXE MEM WB Commit Sequencing instructions Wakeup order information Ilhyun Kim -- MICRO-36 7

MOP scheduling(2 x) example 6 2 5 • 9 cycles • 16 queue entries 7 10 6 1 3 4 8 2 1 5 3 9 • 10 cycles • 9 queue entries 4 7 11 Macro-op (MOP) 8 12 select wakeup select n+1 wakeup n 13 10 9 14 12 11 15 select / wakeup select n+1 / wakeup n 13 16 15 14 16 n Pipelined instruction scheduling of multi-cycle MOPs n n Still issues original instructions consecutively Larger instruction window n Multiple original instructions logically share a single issue queue entry December 4, 2003 Ilhyun Kim -- MICRO-36 8

Issues in grouping instructions n Candidate instructions n n n The number of source operands n n n Grouping two dependent instructions up to 3 source operands Allow up to 2 source operands (conventional) / no restriction (wired-OR) MOP size n n n Single-cycle instructions: integer ALU, control, store agen operations Multi-cycle instructions (e. g. loads) do not need single-cycle scheduling Bigger MOP sizes may be more beneficial 2 instructions in this study MOP formation scope n n Instructions are processed in order before inserted into issue queue Candidate instructions need to be captured within a reasonable scope December 4, 2003 Ilhyun Kim -- MICRO-36 10

Dependence edge distance (instruction count) % total insts 49. 2 50. 9 27. 8 48. 7 37. 4 56. 3 40. 2 47. 5 42. 7 47. 7 37. 6 44. 7 MOP potential 8 -instruction scope n n n 73% of value-generating candidates (potential MOP heads) have dependent candidate instructions (potential MOP tails) An 8 -instruction scope captures many dependent pairs Variability in distances (e. g. gap vs. vortex) remember this Our configuration: grouping 2 single-cycle instructions within an 8 -instruction scope December 4, 2003 Ilhyun Kim -- MICRO-36 11

MOP detection Select RF Dependence information MEM WB Commit in Wakeup order information EXE po MOP detection Payload RAM Wakeup int er MOP pointers Rename Issue queue insert po MOP formation I-cache Fetch r te n Finds groupable instruction pairs n n Dependence matrix-based detection (detailed in the paper) Performance is insensitive to detection latency (pointers reused repeatedly) n n A pessimistic 100 -cycle latency loses 0. 22% of IPC Generates MOP pointers n n 4 bits per instruction, stored in $IL 1 A MOP pointer represents a groupable instruction pair December 4, 2003 Ilhyun Kim -- MICRO-36 12

MOP detection – Avoiding cycle conditions n Cycle condition examples (leading to deadlocks) 1 1 2 2 3 3 4 n Conservative cycle detection heuristic n Precise detection is hard (multiple levels of dep tracking) n ? December 4, 2003 n Assume a cycle if both outgoing and incoming edges are detected Captures over 90% of MOP opportunities (compared to the precise detection) Ilhyun Kim -- MICRO-36 13

MOP formation I-cache Fetch MOP pointers n Rename MOP detection Payload RAM Wakeup Select RF EXE MEM WB Commit Dependence information Wakeup order information MOP MOP pointers are fetched along with instructions Converts register dependences to MOP dependences n n Architected register IDs MOP IDs Identical to register renaming n n n Issue queue insert Locates MOP pairs using MOP pointers n n MOP formation Except that it assigns a single ID to two groupable instructions Reflects the fact that two instructions are grouped into one scheduling unit Two instructions are later inserted into one issue entry December 4, 2003 Ilhyun Kim -- MICRO-36 14

Scheduling MOPs MOP formation I-cache Fetch MOP pointers n MOP detection Payload RAM Wakeup Select EXE RF MEM WB Commit Dependence information Wakeup order information Instructions in a MOP are scheduled as a single unit n n n Rename Issue queue insert A MOP is a non-pipelined, 2 -cycle operation from the scheduler’s perspective Issued when all source operands are ready, incurs one tag broadcast Wakeup / select timings cycle n n+1 n+2 Atomic scheduling select 1 wakeup 2, 3 select 2, 3 wakeup 4 select 4 2 2 -cycle scheduling 1 select 1 3 wakeup 2, 3 4 select 2, 3 n+3 wakeup 4 n+4 select 4 December 4, 2003 1 2 -cycle MOP scheduling select MOP(1, 3) 3 wakeup 2, 4 2 3 select 2, 4 1 2 4 4 Ilhyun Kim -- MICRO-36 15

Sequencing instructions I-cache Fetch MOP pointers n Rename MOP detection Issue queue insert Payload RAM Wakeup Select Dependence information Wakeup order information RF EXE MEM WB Commit sequence original insts A MOP is converted back to two original instructions n n MOP formation The dual-entry payload RAM sends two original instructions Original instructions are sequentially executed within 2 cycles Register values are accessed using physical register IDs ROB separately commits original instructions in order n MOPs do not affect precise exception or branch misprediction recovery December 4, 2003 Ilhyun Kim -- MICRO-36 16

Machine parameters n Simplescalar-Alpha-based 4 -wide Oo. O + speculative scheduling w/ selective replay, 14 stages n Ideally pipelined scheduler n n n 128 ROB, unrestricted / 32 -entry issue queue 4 ALUs, 2 memory ports, 16 K IL 1 (2), 16 K DL 1 (2), 256 K L 2 (8), memory (100) Combined branch prediction, fetch until the first taken branch MOP scheduling n n conceptually equivalent to atomic scheduling + 1 extra stage 2 -cycle (pipelined) scheduling + 2 X MOP technique 2 (conventional) or 3 (wired-OR) source operands MOP detection scope: 2 cycles (4 -wide X 2 -cycle = up to 8 insts) Spec 2 k INT, reduced input sets n Reference input sets for crafty, eon, gap (up to 3 B instructions) December 4, 2003 Ilhyun Kim -- MICRO-36 18

2 -src 3 -src # grouped instructions n 28~46% of total instructions are grouped n n 14~23% reduction in the instructions count in scheduler Dependent MOP cases enable consecutive issue of dependent instructions December 4, 2003 Ilhyun Kim -- MICRO-36 19

MOP scheduling performance (relaxed atomicity constraint only) Unrestricted IQ / 128 ROB n n Up to ~19% of IPC loss in 2 -cycle scheduling MOP scheduling restores performance n n Enables consecutive issue of dependent instructions 97. 2% of atomic scheduling performance on average December 4, 2003 Ilhyun Kim -- MICRO-36 20

Insight into MOP scheduling n Performance loss of 2 -cycle scheduling n n n Correlated to dependence edge distance Short dependence edges (e. g. gap) instruction window is filled up with chains of dependent instructions 2 -cycle scheduler cannot find plenty of ready instructions to issue MOP scheduling captures short-distance dependent instruction pairs n n They are the important ones Low MOP coverage due to long dependence edges does not matter n 2 -cycle scheduler can find many instructions to issue (e. g. vortex) MOP scheduling complements 2 -cycle scheduling n Overall performance is less sensitive to code layout December 4, 2003 Ilhyun Kim -- MICRO-36 21

MOP scheduling performance (relaxed atomicity + scalability constraints) 32 IQ / 128 ROB n Benefits from both relaxed atomicity and scalability constraints Pipelined 2 -cycle MOP scheduling performs comparably or better than atomic scheduling December 4, 2003 Ilhyun Kim -- MICRO-36 22

Conclusions & Future work n Changing processing granularity can relax the constraints imposed by instruction-centric designs n Constraints in instruction scheduling loop n Scheduling atomicity, poor scalability n Macro-op scheduling relaxes both constraints at a coarser granularity n Pipelined, 2 -cycle macro-op scheduling can perform comparably or even better than atomic scheduling n Potentials for narrow bandwidth microarchitecture n n Extending the MOP idea to the whole pipeline (Disp, RF, bypass) e. g. achieving 4 -wide machine performance using 2 -wide bandwidth December 4, 2003 Ilhyun Kim -- MICRO-36 23

Questions? ? December 4, 2003 Ilhyun Kim -- MICRO-36 24

Select-free (Brown et al. ) vs. MOP scheduling 32 IQ / 128 ROB, no extra stage for MOP formation n n 4. 1% better IPC on average over select-free-scoreboard (best 8. 3%) Select-free cannot outperform the atomic scheduling n n Select-free scheduling is speculative and requires recovery operations MOP scheduling is non-speculative, leading to many advantages December 4, 2003 Ilhyun Kim -- MICRO-36 25

MOP detection – MOP pointer generation n Finding dependent pairs n n Dependence matrix-based detection (detailed in MICRO paper) Insensitive to detection latency (pointers reused repeatedly) n n n A pessimistic 100 -cycle latency loses 0. 22% of IPC Similar to instruction preprocessing in trace cache lines MOP pointers (4 bits per instruction) control offset MOP pointers 0 011: add r 1 r 2, r 3 0 000: lw 1 010: and r 5 r 4, r 2 0 000: bez r 1, 0 xff 0 000: sub r 6 r 5, 1 December 4, 2003 n r 4 0(r 3) n (taken) Control bit (1) : captures up to 1 control discontinuity Offset bits (3) : instruction count from head to tail Ilhyun Kim -- MICRO-36 26

MOP formation – MOP dependence translation n Assigns a single ID to two MOPable instructions n n n reflecting the fact that two instructions are grouped into one unit The process and required structure is identical to register renaming Register values are still access based on original register IDs Register rename table p 3 I 1 p 5 I 2 p 6 I 3 p 7 I 4 p 8 December 4, 2003 p 4 MOP translation table Logical Physical reg ID 1 3 2 4 3 5 4 6 5 7 6 8 7 … - m 3 I 1 m 5 I 2 m 5 I 3 m 6 I 4 m 6 Ilhyun Kim -- MICRO-36 Logical reg ID MOP ID 1 3 2 4 3 5 4 5 m 4 5 6 6 6 7 … - a single MOP ID is allocated to two grouped instructions 27

Inserting MOPs into issue queue I-cache Fetch MOP pointers n MOP formation Issue queue insert Rename Payload RAM Wakeup Select EXE RF MEM WB Commit Dependence information MOP detection Wakeup order information Inserting instructions across different groups cycle n : MOP pointer 5 6 7 8 1 2 3 4 pending Issue queue December 4, 2003 cycle n+1 5 6 7 cycle n+2 8 pending X pending 1 2 4 3 Ilhyun Kim -- MICRO-36 1 2 4 5 6 8 3 7 28

Performance considerations n Independent MOPs n n n Group independent instructions with the same source dependences No direct performance benefit but reduce queue contention Last-arriving operands in tail instructions n n n Unnecessarily delays head instructions MOP detection logic filters out harmful grouping Create an alternative pair if any December 4, 2003 CLK 10 1 CLK 17 Ilhyun Kim -- MICRO-36 2 CLK 10 1 CLK 15 CLK 12 CLK 15 2 3 3 CLK 19 CLK 17 29

1 2 3 4 Original data dependence graph clk n+1 clk n STEP 1 9 5 1 MOP pointers 10 6 2 11 7 3 1 1 1 4 1 2 1 not groupable MOP pointer detected after step 3 12 8 1 1 inval head clk n+2 tail 1 possible cycle detected 1 3 4 2 2 1 2 3 4 5 6 7 8 STEP 2 1 MOP pointers STEP 3 3: 5 7: 8 6 1 7 1 8 1 December 4, 2003 1 1 1 2 1 3 1 4 5 1 5 6 7 8 9 10 11 12 1 1 2 2 priority decoder picks one 2 2 MOP pointers 3: 5 5 1 6 1 7: 8 8 1 9: 10 11 1 11: 12 Ilhyun Kim -- MICRO-36 10 11 12 1 1 1 2 1 30