ECECS 552 Introduction to Superscalar Processors Instructor Mikko

  • Slides: 60
Download presentation
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of

ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen

Limitations of Scalar Pipelines l Scalar upper bound on throughput – IPC <= 1

Limitations of Scalar Pipelines l Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 l Inefficient unified pipeline – Long latency for each instruction l Rigid pipeline stall policy – One stalled instruction stalls all newer instructions © Shen, Lipasti 2

Parallel Pipelines © Shen, Lipasti 3

Parallel Pipelines © Shen, Lipasti 3

Intel Pentium Parallel Pipeline © Shen, Lipasti 4

Intel Pentium Parallel Pipeline © Shen, Lipasti 4

Diversified Pipelines © Shen, Lipasti 5

Diversified Pipelines © Shen, Lipasti 5

Power 4 Diversified Pipelines PC I-Cache FP Issue Q FP 1 Unit FP 2

Power 4 Diversified Pipelines PC I-Cache FP Issue Q FP 1 Unit FP 2 Unit Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX 1 Unit FX/LD 2 Issue Q LD 1 Unit LD 2 Unit FX 2 Unit BR/CR Issue Q CR Unit Reorder Buffer BR Unit St. Q D-Cache © Shen, Lipasti 6

Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Not Allowed © Shen, Lipasti Backward

Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Not Allowed © Shen, Lipasti Backward Propagation of Stalling Stalled Instruction 7

Dynamic Pipelines © Shen, Lipasti 8

Dynamic Pipelines © Shen, Lipasti 8

Interstage Buffers © Shen, Lipasti 9

Interstage Buffers © Shen, Lipasti 9

Superscalar Pipeline Stages In Program Order Out of Order In Program Order © Shen,

Superscalar Pipeline Stages In Program Order Out of Order In Program Order © Shen, Lipasti 10

Limitations of Scalar Pipelines l Scalar upper bound on throughput – IPC <= 1

Limitations of Scalar Pipelines l Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 – Solution: wide (superscalar) pipeline l Inefficient unified pipeline – Long latency for each instruction – Solution: diversified, specialized pipelines l Rigid pipeline stall policy – One stalled instruction stalls all newer instructions – Solution: Out-of-order execution, distributed execution pipelines © Shen, Lipasti 11

Impediments to High IPC © Shen, Lipasti 12

Impediments to High IPC © Shen, Lipasti 12

Superscalar Pipeline Design Instruction Fetching Issues l Instruction Decoding Issues l Instruction Dispatching Issues

Superscalar Pipeline Design Instruction Fetching Issues l Instruction Decoding Issues l Instruction Dispatching Issues l Instruction Execution Issues l Instruction Completion & Retiring Issues l © Shen, Lipasti 13

Instruction Fetch l Objective: Fetch multiple instructions per cycle l Challenges: – Branches: control

Instruction Fetch l Objective: Fetch multiple instructions per cycle l Challenges: – Branches: control dependences – Branch target misalignment – Instruction cache misses l PC Instruction Memory Solutions – Alignment hardware – Prediction/speculation © Shen, Lipasti 3 instructions fetched 14

Fetch Alignment © Shen, Lipasti 15

Fetch Alignment © Shen, Lipasti 15

Branches – MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump

Branches – MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm) © Shen, Lipasti 16

Disruption of Sequential Control Flow © Shen, Lipasti 17

Disruption of Sequential Control Flow © Shen, Lipasti 17

Branch Prediction l Target address generation Target Speculation – Access register: l PC, General

Branch Prediction l Target address generation Target Speculation – Access register: l PC, General purpose register, Link register – Perform calculation: l l +/- offset, autoincrement Condition resolution Condition speculation – Access register: l Condition code register, General purpose register – Perform calculation: l © Shen, Lipasti Comparison of data register(s) 18

Target Address Generation © Shen, Lipasti 19

Target Address Generation © Shen, Lipasti 19

Condition Resolution © Shen, Lipasti 20

Condition Resolution © Shen, Lipasti 20

Branch Instruction Speculation © Shen, Lipasti 21

Branch Instruction Speculation © Shen, Lipasti 21

Static Branch Prediction l Single-direction – Always not-taken: Intel i 486 l Backwards Taken/Forward

Static Branch Prediction l Single-direction – Always not-taken: Intel i 486 l Backwards Taken/Forward Not Taken – Loop-closing branches have negative offset – Used as backup in Pentium Pro, III, 4 © Shen, Lipasti 22

Static Branch Prediction Profile-based l 1. Instrument program binary 2. Run with representative (?

Static Branch Prediction Profile-based l 1. Instrument program binary 2. Run with representative (? ) input set 3. Recompile program a. b. Annotate branches with hint bits, or Restructure code to match predict not-taken Performance: 75 -80% accuracy l – Much higher for “easy” cases © Shen, Lipasti 23

Dynamic Branch Prediction l Main advantages: – Learn branch behavior autonomously l No compiler

Dynamic Branch Prediction l Main advantages: – Learn branch behavior autonomously l No compiler analysis, heuristics, or profiling – Adapt to changing branch behavior l l Program phase changes branch behavior First proposed in 1980 – US Patent #4, 370, 711, Branch predictor using random access memory, James. E. Smith l Continually refined since then © Shen, Lipasti 24

Smith Predictor Hardware l l Jim E. Smith. A Study of Branch Prediction Strategies.

Smith Predictor Hardware l l Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135 -148, May 1981 Widely employed: Intel Pentium, Power. PC 604, Power. PC 620, etc. © Shen, Lipasti 25

Two-level Branch Prediction l BHR adds global branch history – Provides more context –

Two-level Branch Prediction l BHR adds global branch history – Provides more context – Can differentiate multiple instances of the same static branch – Can correlate behavior across multiple static branches © Shen, Lipasti 26

Combining or Hybrid Predictors l l l Select “best” history Reduce interference w/partial updates

Combining or Hybrid Predictors l l l Select “best” history Reduce interference w/partial updates Scott Mc. Farling. Combining Branch Predictors. TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993. © Shen, Lipasti 27

Branch Target Prediction l Partial tags sufficient in BTB © Shen, Lipasti 28

Branch Target Prediction l Partial tags sufficient in BTB © Shen, Lipasti 28

Return Address Stack l For each call/return pair: – Call: push return address onto

Return Address Stack l For each call/return pair: – Call: push return address onto hardware stack – Return: pop return address from hardware stack © Shen, Lipasti 29

Branch Speculation l Leading Speculation – Typically done during the Fetch stage – Based

Branch Speculation l Leading Speculation – Typically done during the Fetch stage – Based on potential branch instruction(s) in the current fetch group l Trailing Confirmation – Typically done during the Branch Execute stage – Based on the next Branch instruction to finish execution © Shen, Lipasti 30

Branch Speculation l Leading Speculation 1. Tag speculative instructions 2. Advance branch and following

Branch Speculation l Leading Speculation 1. Tag speculative instructions 2. Advance branch and following instructions 3. Buffer addresses of speculated branch instructions l Trailing Confirmation 1. When branch resolves, remove/deallocate speculation tag 2. Permit completion of branch and following instructions © Shen, Lipasti 31

Branch Speculation l Start new correct path – Must remember the alternate (non-predicted) path

Branch Speculation l Start new correct path – Must remember the alternate (non-predicted) path l Eliminate incorrect path – Must ensure that the mis-speculated instructions produce no side effects © Shen, Lipasti 32

Mis-speculation Recovery l Start new correct path 1. Update PC with computed branch target

Mis-speculation Recovery l Start new correct path 1. Update PC with computed branch target (if predicted NT) 2. Update PC with sequential instruction address (if predicted T) 3. Can begin speculation again at next branch l Eliminate incorrect path 1. Use tag(s) to deallocate resources occupied by speculative instructions 2. Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations © Shen, Lipasti 33

Summary: Instruction Fetch l l Fetch group alignment Target address generation – Branch target

Summary: Instruction Fetch l l Fetch group alignment Target address generation – Branch target buffer – Return address stack l Target condition generation – Static prediction – Dynamic prediction l Speculative execution – Tagging/tracking instructions – Recovering from mispredicted branches © Shen, Lipasti 34

Issues in Decoding l Primary Tasks – Identify individual instructions (!) – Determine instruction

Issues in Decoding l Primary Tasks – Identify individual instructions (!) – Determine instruction types – Determine dependences between instructions l Two important factors – Instruction set architecture – Pipeline width © Shen, Lipasti 35

Pentium Pro Fetch/Decode © Shen, Lipasti 36

Pentium Pro Fetch/Decode © Shen, Lipasti 36

Predecoding in the AMD K 5 © Shen, Lipasti 37

Predecoding in the AMD K 5 © Shen, Lipasti 37

Dependence Checking Dest Src 0 Src 1 ? = ? = ? = l

Dependence Checking Dest Src 0 Src 1 ? = ? = ? = l ? = Trailing instructions in fetch group – Check for dependence on leading instructions © Shen, Lipasti 38

Instruction Dispatch and Issue l Parallel pipeline – Centralized instruction fetch – Centralized instruction

Instruction Dispatch and Issue l Parallel pipeline – Centralized instruction fetch – Centralized instruction decode l Diversified pipeline – Distributed instruction execution © Shen, Lipasti 39

Necessity of Instruction Dispatch © Shen, Lipasti 40

Necessity of Instruction Dispatch © Shen, Lipasti 40

Centralized Reservation Station © Shen, Lipasti 41

Centralized Reservation Station © Shen, Lipasti 41

Distributed Reservation Station © Shen, Lipasti 42

Distributed Reservation Station © Shen, Lipasti 42

Issues in Instruction Execution l Parallel execution units – Bypassing is a real challenge

Issues in Instruction Execution l Parallel execution units – Bypassing is a real challenge l Resolving register data dependences – Want out-of-order instruction execution l Resolving memory data dependences – Want loads to issue as soon as possible l Maintaining precise exceptions – Required by the ISA © Shen, Lipasti 43

Bypass Networks FP Issue Q FP 1 Unit FP 2 Unit Fetch Q BR

Bypass Networks FP Issue Q FP 1 Unit FP 2 Unit Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX 1 Unit PC I-Cache FX/LD 2 Issue Q LD 1 Unit LD 2 Unit FX 2 Unit BR/CR Issue Q CR Unit Reorder Buffer BR Unit St. Q D-Cache l l l O(n 2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) – Use RF only (IBM Power 4) with no bypass network – Decompose into clusters (Alpha 21264) © Shen, Lipasti 44

The Big Picture INSTRUCTION PROCESSING CONSTRAINTS Resource Contention (Structural Dependences) Control Dependences (RAW) True

The Big Picture INSTRUCTION PROCESSING CONSTRAINTS Resource Contention (Structural Dependences) Control Dependences (RAW) True Dependences (WAR) Anti-Dependences © Shen, Lipasti Code Dependences Data Dependences Storage Conflicts Output Dependences (WAW) 45

Register Data Dependences l Program data dependences cause hazards – True dependences (RAW) –

Register Data Dependences l Program data dependences cause hazards – True dependences (RAW) – Antidependences (WAR) – Output dependences (WAW) l When are registers read and written? – Out of program order! – Hence, any/all of these can occur l Solution to all three: register renaming © Shen, Lipasti 46

Register Renaming: WAR/WAW l l Widely employed (Core i 7, Athlon/Phenom, …) Resolving WAR/WAW:

Register Renaming: WAR/WAW l l Widely employed (Core i 7, Athlon/Phenom, …) Resolving WAR/WAW: – Each register write gets unique “rename register” – Writes are committed in program order at Writeback – WAR and WAW are not an issue l l All updates to “architected state” delayed till writeback Writeback stage always later than read stage – Reorder Buffer (ROB) enforces in-order writeback Add R 3 <= … Sub R 4 <= … And R 3 <= … © Shen, Lipasti P 32 <= … P 33 <= … P 35 <= … 47

Register Renaming: RAW l In order, at dispatch: – Source registers checked to see

Register Renaming: RAW l In order, at dispatch: – Source registers checked to see if “in flight” Register map table keeps track of this l If not in flight, can be read from the register file l If in flight, look up “rename register” tag (IOU) l – Then, allocate new register for register write Add R 3 <= R 2 + R 1 Sub R 4 <= R 3 + R 1 And R 3 <= R 4 & R 2 © Shen, Lipasti P 32 <= P 2 + P 1 P 33 <= P 32 + P 1 P 35 <= P 33 + P 2 48

Register Renaming: RAW l Advance instruction to reservation station – Wait for rename register

Register Renaming: RAW l Advance instruction to reservation station – Wait for rename register tag to trigger issue l Reservation station enables out-of-order issue – Newer instructions can bypass stalled instructions © Shen, Lipasti 49

“Dataflow Engine” for Dynamic Execution Reg. Write Back Dispatch Buffer Dispatch Reg. File Allocate

“Dataflow Engine” for Dynamic Execution Reg. Write Back Dispatch Buffer Dispatch Reg. File Allocate Reorder Buffer entries Ren. Reg. Reservation Stations Branch Integer Reorder Buffer Complete © Shen, Lipasti Float. Point Load/ Store Forwarding results to Res. Sta. & rename registers Managed as a queue; Maintains sequential order of all Instructions in flight 50

Instruction Processing Steps • DISPATCH: • Read operands from Register File (RF) and/or Rename

Instruction Processing Steps • DISPATCH: • Read operands from Register File (RF) and/or Rename Buffers (RRB) • Rename destination register and allocate RRF entry • Allocate Reorder Buffer (ROB) entry • Advance instruction to appropriate Reservation Station (RS) • EXECUTE: • RS entry monitors bus for register Tag(s) to latch in pending operand(s) • When all operands ready, issue instruction into Functional Unit (FU) and deallocate RS entry (no further stalling in execution pipe) • When execution finishes, broadcast result to waiting RS entries, RRB entry, and ROB entry • COMPLETE: • Update architected register from RRB entry, deallocate RRB entry, and if it is a store instruction, advance it to Store Buffer • Deallocate ROB entry and instruction is considered architecturally completed © Shen, Lipasti 51

Load RF Write ALU RF Write D$ Execute Agen-D$ Issue RF Read Decode Rename

Load RF Write ALU RF Write D$ Execute Agen-D$ Issue RF Read Decode Rename Fetch Physical Register File Map Table R 0 => P 7 R 1 => P 3 Physical Register File … R 31 => P 39 l l Used in MIPS R 10000, Pentium 4, AMD Bulldozer All registers in one place – Always accessed right before EX stage – No copying to real register file at commit © Shen, Lipasti 52

Managing Physical Registers Map Table R 0 => P 7 R 1 => P

Managing Physical Registers Map Table R 0 => P 7 R 1 => P 3 … R 31 => P 39 l Add R 3 <= R 2 + R 1 Sub R 4 <= R 3 + R 1 … … And R 3 <= R 4 & R 2 P 32 <= P 2 + P 1 P 33 <= P 32 + P 1 P 35 <= P 33 + P 2 Release P 32 (previous R 3) when this instruction completes execution What to do when all physical registers are in use? – Must release them somehow to avoid stalling – Maintain free list of “unused” physical registers l Release when no more uses are possible – Sufficient: next write commits © Shen, Lipasti 53

Memory Data Dependences l WAR/WAW: stores commit in order Load/Store RS Agen Mem –

Memory Data Dependences l WAR/WAW: stores commit in order Load/Store RS Agen Mem – Hazards not possible. Why? l – Store queue keeps track of pending store addresses – Loads check against these addresses – Similar to register bypass logic – Comparators are 32 or 64 bits wide (address size) l Store Queue RAW: loads must check pending stores Reorder Buffer Major source of complexity in modern designs – Store queue lookup is position-based – What if store address is not yet known? © Shen, Lipasti 54

Increasing Memory Bandwidth © Shen, Lipasti 55

Increasing Memory Bandwidth © Shen, Lipasti 55

Issues in Completion/Retirement l Out-of-order execution – ALU instructions – Load/store instructions l In-order

Issues in Completion/Retirement l Out-of-order execution – ALU instructions – Load/store instructions l In-order completion/retirement – Precise exceptions l Solutions – Reorder buffer retires instructions in order – Store queue retires stores in order – Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ © Shen, Lipasti 56

A Dynamic Superscalar Processor © Shen, Lipasti 57

A Dynamic Superscalar Processor © Shen, Lipasti 57

Superscalar Summary © Shen, Lipasti 58

Superscalar Summary © Shen, Lipasti 58

[John De. Vale & Bryan Black, 2005] © Shen, Lipasti 59

[John De. Vale & Bryan Black, 2005] © Shen, Lipasti 59

Superscalar Summary l Instruction flow – Branches, jumps, calls: predict target, direction – Fetch

Superscalar Summary l Instruction flow – Branches, jumps, calls: predict target, direction – Fetch alignment – Instruction cache misses l Register data flow – Register renaming: RAW/WAR/WAW l Memory data flow – In-order stores: WAR/WAW – Store queue: RAW – Data cache misses: missed load buffers © Shen, Lipasti 60