Superscalar Organization ECECS 752 Fall 2017 Prof Mikko
- Slides: 73
Superscalar Organization ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison
CPU, circa 1986 Stage Phase Function performed IF φ1 Translate virtual instr. addr. using TLB φ2 Access I-cache φ1 Return instruction from I-cache, check tags & parity φ2 Read RF; if branch, generate target φ1 Start ALU op; if branch, check condition φ2 Finish ALU op; if ld/st, translate addr φ1 Access D-cache φ2 Return data from D-cache, check tags & parity φ1 Write RF RD ALU MEM WB φ2 Source: https: //imgtec. com • MIPS R 2000, ~“most elegant pipeline ever devised” J. Larus • Enablers: RISC ISA, pipelining, on-chip cache memory Mikko Lipasti-University of Wisconsin 2
Iron Law Time Processor Performance = -------Program = Instructions Program (code size) X Cycles X Instruction (CPI) Time Cycle (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Mikko Lipasti -- University of Wisconsin Chip Designer 3
Limitations of Scalar Pipelines • Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 • Rigid pipeline stall policy – One stalled instruction stalls entire pipeline • Limited hardware parallelism – Only temporal (across pipeline stages) Mikko Lipasti-University of Wisconsin 4
Superscalar Proposal • Fetch/execute multiple instructions per cycle • Decouple stages so stalls don’t propagate • Exploit instruction-level parallelism (ILP)
Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1. 58 Sohi and Vajapeyam [1987] 1. 81 Tjaden and Flynn [1970] 1. 86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1. 96 Uht [1986] 2. 00 Smith et al. [1989] 2. 00 Jouppi and Wall [1988] 2. 40 Johnson [1991] 2. 50 Acosta et al. [1986] 2. 79 Wedig [1982] 3. 00 Butler et al. [1991] 5. 8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism)
High-IPC Processor Evolution Desktop/Workstation Market Scalar RISC Pipeline 1980 s: MIPS SPARC Intel 486 2 -4 Issue In-order Early 1990 s: IBM RIOS-I Intel Pentium Limited Outof-Order Mid 1990 s: Power. PC 604 Intel P 6 Large ROB Out-of-Order 2000 s: DEC Alpha 21264 IBM Power 4/5 AMD K 8 1985 – 2005: 20 years, 100 x frequency Mobile Market Scalar RISC Pipeline 2002: ARM 11 2 -4 Issue In-order Limited Outof-Order Large ROB Out-of-Order 2005: Cortex A 8 2009: Cortex A 9 2011: Cortex A 15 2002 – 2011: 10 years, 10 x frequency Mikko Lipasti-University of Wisconsin 7
What Does a High-IPC CPU Do? 1. Fetch and decode 2. Construct data dependence graph (DDG) 3. Evaluate DDG 4. Commit changes to program state Source: [Palacharla, Jouppi, Smith, 1996] Mikko Lipasti-University of Wisconsin 8
A Typical High-IPC Processor 1. Fetch and Decode 2. Construct DDG 3. Evaluate DDG 4. Commit results Mikko Lipasti-University of Wisconsin 9
Power Consumption ARM Cortex A 15 [Source: NVIDIA] Core i 7 [Source: Intel] • Actual computation overwhelmed by overhead of aggressive execution pipeline Mikko Lipasti-University of Wisconsin 10
Limitations of Scalar Pipelines • Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 • Inefficient unified pipeline – Long latency for each instruction • Rigid pipeline stall policy – One stalled instruction stalls all newer instructions
Parallel Pipelines
Intel Pentium Parallel Pipeline
Diversified Pipelines
Power 4 Diversified Pipelines PC I-Cache FP Issue Q FP 1 Unit FP 2 Unit Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX 1 Unit FX/LD 2 Issue Q LD 1 Unit LD 2 Unit St. Q D-Cache FX 2 Unit BR/CR Issue Q CR Unit BR Unit Reorder Buffer
Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Not Allowed Backward Propagation of Stalling Stalled Instruction
Dynamic Pipelines
Interstage Buffers
Superscalar Pipeline Stages In Program Order Out of Order In Program Order
Limitations of Scalar Pipelines • Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 – Solution: wide (superscalar) pipeline • Inefficient unified pipeline – Long latency for each instruction – Solution: diversified, specialized pipelines • Rigid pipeline stall policy – One stalled instruction stalls all newer instructions – Solution: Out-of-order execution, distributed execution pipelines
High-IPC Processor Mikko Lipasti-University of Wisconsin 21
Instruction Flow Objective: Fetch multiple instructions per cycle • Challenges: – Branches: unpredictable – Branch targets misaligned – Instruction cache misses PC Instruction Cache • Solutions – Prediction and speculation – High-bandwidth fetch logic – Nonblocking cache and prefetching Mikko Lipasti-University of Wisconsin only 3 instructions fetched 22
I-Cache Organization TAG TAG 1 cache line = 1 physical row Line • • • TAG Cache TAG • • • R o w D e c o d er Cache Line R o w D e c o d er Address TAG 1 cache line = 2 physical rows SRAM arrays need to be square to minimize delay Mikko Lipasti-University of Wisconsin 23
Fetch Alignment Mikko Lipasti-University of Wisconsin 24
IBM RIOS-I Fetch Hardware Mikko Lipasti-University of Wisconsin 25
Disruption of Instruction Flow Mikko Lipasti-University of Wisconsin 26
Branch Prediction • Target address generation Target speculation – Access register: • PC, General purpose register, Link register – Perform calculation: • +/- offset, autoincrement • Condition resolution Condition speculation – Access register: • Condition code register, General purpose register – Perform calculation: • Comparison of data register(s) Mikko Lipasti-University of Wisconsin 27
Target Address Generation Mikko Lipasti-University of Wisconsin 28
Branch Condition Resolution Mikko Lipasti-University of Wisconsin 29
Branch Instruction Speculation Mikko Lipasti-University of Wisconsin 30
Hardware Smith Predictor • Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135 -148, May 1981 • Widely employed: Intel Pentium, Power. PC 604, MIPS R 10000, etc. Mikko Lipasti-University of Wisconsin 31
Cortex A 15: Bi-Mode Predictor 15% of A 15 Core Power! • PHT partitioned into T/NT halves – Selector chooses source • Reduces negative interference, since most entries in PHT 0 tend towards NT, and most entries in PHT 1 tend towards T Mikko Lipasti-University of Wisconsin 32
Branch Target Prediction • Does not work well for function/procedure returns • Does not work well for virtual functions, switch statements Mikko Lipasti-University of Wisconsin 33
Branch Speculation • Leading Speculation – Done during the Fetch stage – Based on potential branch instruction(s) in the current fetch group • Trailing Confirmation – Done during the Branch Execute stage – Based on the next Branch instruction to finish execution Mikko Lipasti-University of Wisconsin 34
Branch Speculation • Start new correct path – Must remember the alternate (non-predicted) path • Eliminate incorrect path – Must ensure that the mis-speculated instructions produce no side effects Mikko Lipasti-University of Wisconsin 35
Mis-speculation Recovery • Start new correct path 1. Update PC with computed branch target (if predicted NT) 2. Update PC with sequential instruction address (if predicted T) 3. Can begin speculation again at next branch • Eliminate incorrect path 1. Use tag(s) to deallocate resources occupied by speculative instructions 2. Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations Mikko Lipasti-University of Wisconsin 36
Parallel Decode • Primary Tasks – Identify individual instructions (!) – Determine instruction types – Determine dependences between instructions • Two important factors – Instruction set architecture – Pipeline width Mikko Lipasti-University of Wisconsin 37
Intel P 6 Fetch/Decode Mikko Lipasti-University of Wisconsin 38
Predecoding in the AMD K 5 • Now commonly employed in loop buffers, decoded instruction caches (uop caches)
Dependence Checking Dest Src 0 Src 1 ? = ? = ? = • Trailing instructions in fetch group – Check for dependence on leading instructions Mikko Lipasti-University of Wisconsin 40
Summary: Instruction Flow • Fetch group alignment • Target address generation – Branch target buffer • Branch condition prediction • Speculative execution – Tagging/tracking instructions – Recovering from mispredicted branches • Decoding in parallel Mikko Lipasti-University of Wisconsin 41
High-IPC Processor Mikko Lipasti-University of Wisconsin 42
Register Data Flow • Parallel pipelines – Centralized instruction fetch – Centralized instruction decode • Diversified execution pipelines – Distributed instruction execution • Data dependence linking – Register renaming to resolve true/false dependences – Issue logic to support out-of-order issue – Reorder buffer to maintain precise state Mikko Lipasti-University of Wisconsin 43
Issue Queues and Execution Lanes ARM Cortex A 15 Source: theregister. co. uk Mikko Lipasti-University of Wisconsin 44
Necessity of Instruction Dispatch
Centralized Reservation Station
Distributed Reservation Station
Issues in Instruction Execution • Current trends – More parallelism bypassing very challenging – Deeper pipelines – More diversity • Functional unit types – – – Integer => short vector Floating point => vector (longer and longer) Load/store most difficult to make parallel Branch Specialized units (media) • Very wide datapaths (256 bits/register or more)
Bypass Networks PC I-Cache FP Issue Q FP 1 Unit FP 2 Unit Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX 1 Unit FX/LD 2 Issue Q LD 1 Unit LD 2 Unit FX 2 Unit BR/CR Issue Q CR Unit Reorder Buffer BR Unit St. Q D-Cache • O(n 2) interconnect from/to FU inputs and outputs • Associative tag-match to find operands • Solutions (hurt IPC, help cycle time) – Use RF only (IBM Power 4) with no bypass network – Decompose into clusters (Alpha 21264)
Specialized units l Intel Pentium 4 staggered adders – Fireball l l Run at 2 x clock frequency Two 16 -bit bitslices Dependent ops execute on half-cycle boundaries Full result not available until full cycle later
Specialized units l FP multiplyaccumulate R = (A x B) + C l l Doubles FLOP/instruction Lose RISC instruction format symmetry: – 3 source operands l Widely used
Media Data Types a b c d e f g h • Subword parallel vector extensions – Media data (pixels, quantized datum) often 1 -2 bytes – Several operands packed in single 32/64 b register {a, b, c, d} and {e, f, g, h} stored in two 32 b registers – Vector instructions operate on 4/8 operands in parallel – New instructions, e. g. sum of abs. differences (SAD) me = |a – e| + |b – f| + |c – g| + |d – h| • Substantial throughput improvement – Usually requires hand-coding of critical loops – Shuffle ops (gather/scatter of vector elements)
Program Data Dependences • True dependence (RAW) – j cannot execute until i produces its result • Anti-dependence (WAR) – j cannot write its result until i has read its sources • Output dependence (WAW) – j cannot write its result until i has written its result Mikko Lipasti-University of Wisconsin 53
Pipeline Hazards • Necessary conditions: – WAR: write stage earlier than read stage • Is this possible in IF-RD-EX-MEM-WB ? – WAW: write stage earlier than write stage • Is this possible in IF-RD-EX-MEM-WB ? – RAW: read stage earlier than write stage • Is this possible in IF-RD-EX-MEM-WB? • If conditions not met, no need to resolve • Check for both register and memory 54
Pipeline Hazard Analysis • Memory hazards WAR: write stage earlier than read? WAW: write stage earlier than write? RAW: read stage earlier than write? – WAR: Yes/No? – WAW: Yes/No? – RAW: Yes/No? • Register hazards – WAR: Yes/No? – WAW: Yes/No? – RAW: Yes/No? 55
Register Data Dependences • Program data dependences cause hazards – True dependences (RAW) – Antidependences (WAR) – Output dependences (WAW) • When are registers read and written? – Out of program order! – Hence, any and all of these can occur • Solution to all three: register renaming Mikko Lipasti-University of Wisconsin 56
Register Renaming: WAR/WAW • Widely employed (Core i 7, Cortex A 15, …) • Resolving WAR/WAW: – Each register write gets unique “rename register” – Writes are committed in program order at Writeback – WAR and WAW are not an issue • All updates to “architected state” delayed till writeback • Writeback stage always later than read stage – Reorder Buffer (ROB) enforces in-order writeback Add R 3 <= … Sub R 4 <= … And R 3 <= … P 32 <= … P 33 <= … P 35 <= … Mikko Lipasti-University of Wisconsin 57
Register Renaming: RAW • In order, at dispatch: – Source registers checked to see if “in flight” • Register map table keeps track of this • If not in flight, can be read from the register file • If in flight, look up “rename register” tag (IOU) – Then, allocate new register for register write Add R 3 <= R 2 + R 1 Sub R 4 <= R 3 + R 1 And R 3 <= R 4 & R 2 P 32 <= P 2 + P 1 P 33 <= P 32 + P 1 P 35 <= P 33 + P 2 Mikko Lipasti-University of Wisconsin 58
Register Renaming: RAW • Advance instruction to instruction queue – Wait for rename register tag to trigger issue • Issue queue/reservation station enables out-of -order issue – Newer instructions can bypass stalled instructions Mikko Lipasti-University of Wisconsin Source: theregister. co. uk 59
Instruction scheduling • A process of mapping a series of instructions into execution resources – Decides when and where an instruction is executed n Data dependence graph n Mapped to two FUs FU 0 FU 1 1 2 5 3 n 1 n+1 2 3 n+2 5 4 n+3 6 4 6 Mikko Lipasti-University of Wisconsin 60
Instruction scheduling • A set of wakeup and select operations – Wakeup • Broadcasts the tags of parent instructions selected • Dependent instruction gets matching tags, determines if source operands are ready • Resolves true data dependences – Select • Picks instructions to issue among a pool of ready instructions • Resolves resource conflicts – Issue bandwidth – Limited number of functional units / memory ports Mikko Lipasti-University of Wisconsin 61
Scheduling loop • Basic wakeup and select operations broadcast the tag of the selected inst scheduling loop …… grant n request n Select logic selected OR = = ready. L tag. L = = tag. R OR ready. R … grant 1 request 1 … tag 1 … grant 0 request 0 tag W OR = = ready. L tag. L = = tag. R OR ready. R issue to FU ready - request Wakeup logic Mikko Lipasti-University of Wisconsin 62
Wakeup and Select FU 0 1 2 3 5 6 4 FU 1 n+1 2 3 n+2 5 4 n+3 6 Ready inst to issue Mikko Lipasti-University of Wisconsin Wakeup / select 1 Select 1 Wakeup 2, 3, 4 Select 2, 3 Wakeup 5, 6 4, 5 6 Select 4, 5 Wakeup 6 Select 6 63
High-IPC Processor Mikko Lipasti-University of Wisconsin 64
Memory Data Flow • Resolve WAR/WAW/RAW memory dependences – MEM stage can occur out of order • Provide high bandwidth to memory hierarchy – Non-blocking caches Mikko Lipasti-University of Wisconsin 65
Memory Data Dependences • WAR/WAW: stores commit in order Load/Store RS Agen Mem – Hazards not possible. • RAW: loads must check pending stores – – – Store queue keeps track of pending stores Loads check against these addresses Reorder Buffer Similar to register bypass logic Comparators are 64 bits wide Must consider position (age) of loads and stores Store Queue • Major source of complexity in modern designs – Store queue lookup is position-based – What if store address is not yet known? Mikko Lipasti-University of Wisconsin 66
Increasing Memory Bandwidth Expensive to duplicate Mikko Lipasti-University of Wisconsin 67
True Multiporting of SRAM “Bit” Lines -carry data in/out “Word” Lines -select a row Mikko Lipasti-University of Wisconsin 68
Increasing Memory Bandwidth Complex, concurrent FSMs Mikko Lipasti-University of Wisconsin 69
Miss Status Handling Register Address Victim Ld. Tag State V[0: 3] Data • Each MSHR entry keeps track of: – Address: miss address – Victim: set/way to replace – Ld. Tag: which load (s) to wake up – State: coherence state, fill status – V[0: 3]: subline valid bits – Data: block data to be filled into cache Mikko Lipasti-University of Wisconsin 70
Maintaining Precise State ROB • Out-of-order execution – ALU instructions – Load/store instructions Head • In-order completion/retirement – Precise exceptions • Solutions – Reorder buffer retires instructions in order – Store queue retires stores in order – Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ Mikko Lipasti-University of Wisconsin Tail 71
Summary: A High-IPC Processor Mikko Lipasti-University of Wisconsin 72
Superscalar Overview • Instruction flow – Branches, jumps, calls: predict target, direction – Fetch alignment – Instruction cache misses • Register data flow – Register renaming: RAW/WAR/WAW – Instruction scheduling: wakeup & select • Memory data flow – In-order stores: WAR/WAW – Store queue: RAW – Data cache misses
- Ececs
- Opwekking 752
- Fhy 752
- California department of human resources
- 26345 chia 752
- Vliw
- Scalar pipeline vs vector pipeline
- Superscalar simulator
- In order issue in order completion example
- Superscalar execution
- Superpipelined processor
- Superscalar architecture diagram
- Vliw vs superscalar
- Superscalar pipeline
- Superscalar machine
- Pipelining and superscalar techniques
- 2017 dvhimss annual fall conference
- Mikko bentlin
- Mikko h. lipasti
- Mikko lindeman
- Mikko vienonen
- Sirpa kesonen
- System by mikko
- Mikko nieminen
- Mikko lappalainen
- Mikko ranta njs
- Mikko juusela
- Mikko karikytö
- Mikko huovila
- Mikko lipasti
- Mikko h. lipasti
- Modulaarinen tuotekehitys
- Mikko posti
- Mikko lipasti
- Mikko h. lipasti
- Mikko ulander
- Mikko manka tampereen yliopisto
- Mikko h. lipasti
- Mikko routala
- Mikko h. lipasti
- Mikko karppinen
- Mikko keränen kamk
- Mikko häikiö
- Ked ferno
- Mikko lipasti
- Mikko h. lipasti
- Potentiometric accelerometer
- Hh embryo
- Mikko lipasti
- Mikko tiira
- Mikko lipasti
- Mikko mäkelä metropolia
- Joannaseppa
- Ecc syndrome
- Process organization in computer organization
- Block essay
- Ayer o anteayer 2017
- Compositional modes for digital and social media
- Cs form no. 33-b revised 2018
- God with us 2017
- Odyssey of the mind 2017
- 2017 pearson education inc
- Usaha salon berdikari neraca saldo
- Comunicado cg 1182/2017
- Jacqueline saburido
- Contoh kod pejabat perakaunan
- Spears & associates oilfield market report
- Critère de mcdonald
- Penn state football parking map 2017
- Adahi qurbani 2017
- X nxn matrix matlab plot 2017
- Ecbc 2017
- Pnab 2017 resumo
- Elly dia