ECECS 552 Introduction to Superscalar Processors and the
- Slides: 45
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R 10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith
Superscalar Processors • Challenges & Impediments – Instruction flow – Register data flow – Memory data flow • MIPS R 10000 reading – [Kenneth C. Yeager. The MIPS R 10000 Superscalar Microprocessor, IEEE Micro, April 1996] – Exemplifies the techniques described in this lecture 2
Limitations of Scalar Pipelines • Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 • Inefficient unified pipeline – Long latency for each instruction • Rigid pipeline stall policy – One stalled instruction stalls all newer instructions 3
Parallel Pipelines 4
Intel Pentium Parallel Pipeline 5
Diversified Pipelines 6
IBM Power 4 Diversified Pipelines PC I-Cache FP Issue Q FP 1 Unit FP 2 Unit Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX 1 Unit FX/LD 2 Issue Q LD 1 Unit LD 2 Unit FX 2 Unit BR/CR Issue Q CR Unit Reorder Buffer BR Unit St. Q D-Cache 7
Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Not Allowed Backward Propagation of Stalling Stalled Instruction 8
Dynamic Pipelines 9
Superscalar Pipeline Stages In Program Order Out of Order In Program Order 10
Limitations of Scalar Pipelines • Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 – Solution: wide (superscalar) pipeline • Inefficient unified pipeline – Long latency for each instruction – Solution: diversified, specialized pipelines • Rigid pipeline stall policy – One stalled instruction stalls all newer instructions – Solution: Out-of-order execution, distributed execution pipelines 11
Impediments to High IPC 12
Instruction Flow Objective: Fetch multiple instructions per cycle • Challenges: – Branches: control dependences – Branch target misalignment – Instruction cache misses PC Instruction Memory • Solutions – Alignment hardware – Prediction/speculation – Nonblocking cache 3 instructions fetched 13
Disruption of Sequential Control Flow 14
Branch Prediction • Target address generation Target Speculation – Access register: • PC, General purpose register, Link register – Perform calculation: • +/- offset, autoincrement • Condition resolution Condition speculation – Access register: • Condition code register, General purpose register – Perform calculation: • Comparison of data register(s) 15
Target Address Generation 16
Condition Resolution 17
Branch Instruction Speculation 18
Static Branch Prediction • Single-direction – Always not-taken: Intel i 486 • Backwards Taken/Forward Not Taken – Loop-closing branches have negative offset – Used as backup in Pentium Pro, III, 4 19
Static Branch Prediction Profile-based • 1. Instrument program binary 2. Run with representative (? ) input set 3. Recompile program a. Annotate branches with hint bits, or b. Restructure code to match predict not-taken Performance: 75 -80% accuracy • – Much higher for “easy” cases 20
Dynamic Branch Prediction • Main advantages: – Learn branch behavior autonomously • No compiler analysis, heuristics, or profiling – Adapt to changing branch behavior • Program phase changes branch behavior • First proposed in 1980 – US Patent #4, 370, 711, Branch predictor using random access memory, James. E. Smith • Continually refined since then 21
Smith Predictor Hardware • Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135 -148, May 1981 • Widely employed: Intel Pentium, Power. PC 604, MIPS R 10000, etc. • More sophisticated predictors used in more recent designs – Covered in detail in 752 22
Branch Target Prediction • Does not work well for function/procedure returns – Special return address stack (RAS) structure covered in 752 23
Branch Speculation • Leading Speculation – Typically done during the Fetch stage – Based on potential branch instruction(s) in the current fetch group • Trailing Confirmation – Typically done during the Branch Execute stage – Based on the next Branch instruction to finish execution 24
Branch Speculation • Start new correct path – Must remember the alternate (non-predicted) path • Eliminate incorrect path – Must ensure that the mis-speculated instructions produce no side effects 25
Mis-speculation Recovery • Start new correct path 1. Update PC with computed branch target (if predicted NT) 2. Update PC with sequential instruction address (if predicted T) 3. Can begin speculation again at next branch • Eliminate incorrect path 1. Use tag(s) to deallocate resources occupied by speculative instructions 2. Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations 26
Parallel Decode • Primary Tasks – Identify individual instructions (!) – Determine instruction types – Determine dependences between instructions • Two important factors – Instruction set architecture – Pipeline width 27
Pentium Pro Fetch/Decode 28
Dependence Checking Dest Src 0 Src 1 ? = ? = ? = • Trailing instructions in fetch group – Check for dependence on leading instructions 29
Summary: Instruction Flow • Fetch group alignment • Target address generation – Branch target buffer • Target condition generation – Static prediction – Dynamic prediction • Speculative execution – Tagging/tracking instructions – Recovering from mispredicted branches • Parallel decode 30
Register Data Flow • Parallel pipeline – Centralized instruction fetch – Centralized instruction decode • Diversified pipeline – Distributed instruction execution • Data dependence linking – Register renaming to eliminate WAR/WAW – Issue logic to support out-of-order issue – Reorder buffer to maintain precise state 31
MIPS R 10000 Issue Queues 32
Register Data Dependences • Program data dependences cause hazards – True dependences (RAW) – Antidependences (WAR) – Output dependences (WAW) • When are registers read and written? – Out of program order! – Hence, any/all of these can occur • Solution to all three: register renaming 33
Register Renaming: WAR/WAW • Widely employed (Core i 7, MIPS R 10000, …) • Resolving WAR/WAW: – Each register write gets unique “rename register” – Writes are committed in program order at Writeback – WAR and WAW are not an issue • All updates to “architected state” delayed till writeback • Writeback stage always later than read stage – Reorder Buffer (ROB) enforces in-order writeback Add R 3 <= … Sub R 4 <= … And R 3 <= … P 32 <= … P 33 <= … P 35 <= … 34
Register Renaming: RAW • In order, at dispatch: – Source registers checked to see if “in flight” • Register map table keeps track of this • If not in flight, can be read from the register file • If in flight, look up “rename register” tag (IOU) – Then, allocate new register for register write Add R 3 <= R 2 + R 1 Sub R 4 <= R 3 + R 1 And R 3 <= R 4 & R 2 P 32 <= P 2 + P 1 P 33 <= P 32 + P 1 P 35 <= P 33 + P 2 35
Register Renaming: RAW • Advance instruction to instruction queue – Wait for rename register tag to trigger issue • Issue queue/reservation station enables out-of -order issue – Newer instructions can bypass stalled instructions 36
Load RF Write ALU RF Write D$ Execute Agen-D$ Issue RF Read Decode Rename Fetch Physical Register File Map Table R 0 => P 7 R 1 => P 3 Physical Register File … R 31 => P 39 • Used in MIPS R 10000, Core i 7, AMD Bulldozer • All registers in one place – Always accessed right before EX stage – No copying to real register file at commit 37
Managing Physical Registers Map Table R 0 => P 7 R 1 => P 3 … R 31 => P 39 Add R 3 <= R 2 + R 1 Sub R 4 <= R 3 + R 1 … … And R 3 <= R 4 & R 2 P 32 <= P 2 + P 1 P 33 <= P 32 + P 1 P 35 <= P 33 + P 2 Release P 32 (previous R 3) when this instruction completes execution • What to do when all physical registers are in use? – Must release them somehow to avoid stalling – Maintain free list of “unused” physical registers • Release when no more uses are possible – Sufficient: next write commits 38
Memory Data Flow • Resolve WAR/WAW/RAW memory dependences – MEM stage can occur out of order • Provide high bandwidth to memory hierarchy – Non-blocking caches 39
Memory Data Dependences • WAR/WAW: stores commit in order Load/Store RS Agen Mem – Hazards not possible. Why? Store Queue • RAW: loads must check pending stores – Store queue keeps track of pending store addresses – Loads check against these addresses – Similar to register bypass logic – Comparators are 64 bits wide (address size) Reorder Buffer • Major source of complexity in modern designs – Store queue lookup is position-based – What if store address is not yet known? 40
Increasing Memory Bandwidth 41
Maintaining Precise State • Out-of-order execution – ALU instructions – Load/store instructions • In-order completion/retirement – Precise exceptions • Solutions – Reorder buffer retires instructions in order – Store queue retires stores in order – Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ 42
A Dynamic Superscalar Processor 43
MIPS R 10000 • Details in paper [Kenneth C. Yeager. The MIPS R 10000 Superscalar Microprocessor, IEEE Micro, April 1996] • Clean, straightforward implementation of many of the concepts presented in this lecture 44
Superscalar Summary 45
- Ececs
- Ece 552
- Omd552
- Maquinas sin guardas
- Sec552
- +1 (617) 552-2015
- Cis552
- Ece 552
- Quiz
- Vliw processors rely on
- Pipelining and superscalar techniques
- Paralleism
- Disadvantages of vliw
- Superscalar vs superpipelined
- Vliw vs superscalar
- Superscalar simulator
- Intel pentium
- Superscalar pipeline
- Superscalar execution
- In order issue in order completion example
- Which pipeline is linear
- Disadvantages of amd processor
- Microcontrollers and embedded processors
- Language and processors for requirement
- Programming massively parallel processors
- Aicarm
- The history of cpu
- Handlers classification in parallel computing
- Digital camera processors
- Embeded processors
- Comparison of word processors
- Characteristics of query processor
- Parallel processors from client to cloud
- Programming massively parallel processors
- Massively parallel processing ppt
- Gas processors association
- Beagleboard embedded processors
- Ece 526
- Features of macro processor
- Macro instruction
- Difference lan wan
- Programming massively parallel processors, kirk et al.
- Hát kết hợp bộ gõ cơ thể
- Slidetodoc
- Bổ thể
- Tỉ lệ cơ thể trẻ em