EECS 470 Lecture 9 MIPS R 10000 Case

  • Slides: 35
Download presentation
EECS 470 Lecture 9 MIPS R 10000 Case Study Winter 2021 Multiprocessor SGI Origin

EECS 470 Lecture 9 MIPS R 10000 Case Study Winter 2021 Multiprocessor SGI Origin Using MIPS R 10 K Jon Beaumont http: //www. eecs. umich. edu/courses/eecs 470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. EECS 470 Lecture 9 Slide 1

Announcements • HW # 2 due 2/17 (tomorrow) • P 3 due 2/25 (next

Announcements • HW # 2 due 2/17 (tomorrow) • P 3 due 2/25 (next Thursday) r r • • Pushed back a day for wellness-day Make sure you are using the starting code linked in the spec (not the one incorrectly listed on the website) Midterm on Thursday 3/4 @ 6 -8 pm (2. 5 weeks away) r Email me if you have a conflict r Covers everything in lecture and lab r Logistics coming soon Final Project spec coming later this week r 5 person teams (use Piazza or Discord if you are looking for a team or extra members) r You will sign up for meeting slots to go over a proposal in the next couple weeks EECS 470 Lecture 9 Slide 2

Readings For Today: • H & P 3. 11 • Yeager “MIPS R 10

Readings For Today: • H & P 3. 11 • Yeager “MIPS R 10 K” EECS 470 Lecture 9 Slide 3

Last Time • How to ensure precise state? • P 6 case study EECS

Last Time • How to ensure precise state? • P 6 case study EECS 470 Lecture 9 Slide 4

Today • MIPS R 10 K case study • Alternate way of implementing precise-state

Today • MIPS R 10 K case study • Alternate way of implementing precise-state in out-of-order machine EECS 470 Lecture 9 Slide 5

EECS 470 Roadmap Speedup Programs Reduce Instruction Latency Reduce number of instructions Parallelize Instruction

EECS 470 Roadmap Speedup Programs Reduce Instruction Latency Reduce number of instructions Parallelize Instruction Level Parallelism Pipelining Poll: Which locations Thread, Process, etc. Level Parallelism Superscalar Execution Dynamic Scheduling Programmability Scoreboarding Register Renaming can store values in P 6? Tomasulo’s Algorithm EECS 470 P 6 Precise State R 10 K Lecture 9 Slide 6

The Problem with P 6 Map Table Regfile value T+ R value Dispatch RS

The Problem with P 6 Map Table Regfile value T+ R value Dispatch RS T T 1 == == T 2 == == T V 1 V 2 CDB. V op CDB. T Head Retire Tail Dispatch ROB FU Problem for high performance implementations – Too much value movement (regfile/ROB RS ROB regfile) – Multi-input muxes, long buses complicate routing and slow clock EECS 470 Lecture 9 Slide 7

MIPS R 10 K: Alternative Implementation Map Table T+ T T R T Told

MIPS R 10 K: Alternative Implementation Map Table T+ T T R T Told value Head Retire Dispatch RS T T 1+ T 2+ == == T Free List ROB CDB. T op Arch. Map Tail Dispatch FU • One big physical register file holds all data - no copies + Register file close to FUs small fast data path • ROB and RS “on the side” used only for control and tags EECS 470 Lecture 9 Slide 8

Register Renaming in R 10 K Architectural register file? Gone Physical register file holds

Register Renaming in R 10 K Architectural register file? Gone Physical register file holds all values • #physical registers = #architectural registers + #ROB entries • Map architectural registers to physical registers • Removes WAW, WAR hazards (physical registers replace RS copies) Fundamental change to map table • Mappings cannot be 0 (there is no architectural register file) Free list keeps track of unallocated physical registers • ROB is responsible for returning physical registers to free list Conceptually, this is “true register renaming” • Have already seen an example EECS 470 Lecture 9 Slide 9

Register Renaming Example Parameters • Names: r 1, r 2, r 3 • Locations:

Register Renaming Example Parameters • Names: r 1, r 2, r 3 • Locations: p 1, p 2, p 3, p 4, p 5, p 6, p 7 • Original mapping: r 1 p 1, r 2 p 2, r 3 p 3, p 4–p 7 are “free” Map. Table r 1 p 4 p 6 r 2 p 2 p 2 r 3 p 3 p 5 Free. List Raw insns Renamed insns p 4, p 5, p 6, p 7 add sub mul div r 2, r 3, r 1 r 2, r 1, r 3 r 2, r 3, r 1, r 3, r 2 p 2, p 3, p 4 p 2, p 4, p 5 p 2, p 5, p 6, p 5, p 7 Poll: In P 6, when was extra storage freed? Question: how is the insn after div renamed? • We are out of free locations (physical registers) • Real question: how/when are physical registers freed? EECS 470 Lecture 9 Slide 10

Freeing Registers in P 6 and R 10 K P 6 • No need

Freeing Registers in P 6 and R 10 K P 6 • No need to free storage for speculative (“in-flight”) values explicitly • Temporary storage comes with ROB entry • Retire: copy speculative value from ROB to register file, free ROB entry R 10 K • • • EECS 470 Can’t free physical register when insn retires No architectural register to copy value to But… Can free physical register previously mapped to same logical register Why? All insns that will ever read its value have retired Lecture 9 Slide 11

Freeing Registers in R 10 K Map. Table r 1 p 4 p 6

Freeing Registers in R 10 K Map. Table r 1 p 4 p 6 • • • EECS 470 r 2 p 2 p 2 r 3 p 3 p 5 Free. List Raw insns Renamed insns p 4, p 5, p 6, p 7 add sub mul div When add retires, free p 1 When sub retires, free p 3 When mul retires, free ? When div retires, free ? See the pattern? r 2, r 3, r 1 r 2, r 1, r 3 r 2, r 3, r 1, r 3, r 2 p 2, p 3, p 4 p 2, p 4, p 5 p 2, p 5, p 6, p 5, p 7 Poll: Which registers are freed for each instruction? Lecture 9 Slide 12

R 10 K Data Structures New tags (again) • P 6: ROB# R 10

R 10 K Data Structures New tags (again) • P 6: ROB# R 10 K: PR# ROB • T: physical register corresponding to insn’s logical output • Told: physical register previously mapped to insn’s logical output RS • T, T 1, T 2: output, input physical registers Map Table • T+: PR# (never empty) + “ready” bit Architectural Map Table • T: PR# (never empty) • Can be used to restore state after branch mispredict or exception Free List • T: PR# No values in ROB, RS, or on CDB EECS 470 Lecture 9 Slide 13

R 10 K Data Structures ROB ht # Insn 1 2 3 4 5

R 10 K Data Structures ROB ht # Insn 1 2 3 4 5 6 7 T ALU LD ST FP 1 FP 2 EECS 470 S ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 Told no no no X C Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#2+ PR#3+ PR#4+ Free List PR#5, PR#6, PR #7, PR#8 T T 1 T 2 PR#1 PR#2 PR#3 PR#4 CDB T Notice I: no values anywhere Notice II: Map. Table is never empty Lecture 9 Slide 14

R 10 K Pipeline R 10 K pipeline structure: F, D, S, X, C,

R 10 K Pipeline R 10 K pipeline structure: F, D, S, X, C, R • D (dispatch) • Structural hazard (RS, ROB, LSQ, physical registers) ? stall • Allocate RS, ROB, LSQ entries and new physical register (T) • Record previously mapped physical register (Told) • C (complete) • Write destination physical register • R (retire) • ROB head not complete ? Stall • Handle any exceptions • Store write LSQ head to D$ • Free ROB, LSQ entries • Free previous physical register (Told) EECS 470 • Record committed physical register (T) Lecture 9 Slide 15

R 10 K Dispatch (D) Map Table T+ T T R T Told value

R 10 K Dispatch (D) Map Table T+ T T R T Told value Head Retire Dispatch RS T T 1+ T 2+ == == T Free List Tail Dispatch ROB CDB. T op Arch. Map FU • Read preg (physical register) tags for input registers, store in RS • Read preg tag for output register, store in ROB (Told) • Allocate new preg (free list) for output register, store in RS, ROB, Map Table EECS 470 Lecture 9 Slide 16

R 10 K Complete (C) Map Table T+ T T R T Told value

R 10 K Complete (C) Map Table T+ T T R T Told value Head Retire Dispatch RS T T 1+ T 2+ == == T Free List CDB. T op Arch. Map Tail Dispatch ROB FU • Set insn’s output register ready bit in map table • Set ready bits for matching input tags in RS EECS 470 Lecture 9 Slide 17

R 10 K Retire (R) Map Table T+ T T R T Told value

R 10 K Retire (R) Map Table T+ T T R T Told value Head Retire Dispatch RS T T 1+ T 2+ == == T Free List CDB. T op Arch. Map Tail Dispatch ROB FU • Return Told of ROB head to free list • Record T of ROB head in architectural map table EECS 470 Lecture 9 Slide 18

R 10 K: Cycle 0 ROB ht # Insn 1 2 3 4 5

R 10 K: Cycle 0 ROB ht # Insn 1 2 3 4 5 6 7 T ALU LD ST FP 1 FP 2 EECS 470 S ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 Told X C Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#2+ PR#3+ PR#4+ Free List PR#5, PR#6, PR #7, PR#8 T T 1 PR#2 PR#3 PR#4 CDB T T 2 no no no Lecture 9 Slide 19

R 10 K: Cycle 1 ROB ht # Insn ht 1 2 3 4

R 10 K: Cycle 1 ROB ht # Insn ht 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 no yes no no no ldf Told S X PR#5 PR#2 C Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#5 PR#3+ PR#4+ Free List PR#5, PR#6, PR #7, PR#8 T PR#5 T 1 T 2 PR#4+ PR#1 PR#2 PR#3 PR#4 CDB T Allocate new preg (PR#5) to f 1 Remember old preg mapped to f 1 (PR#2) in ROB Lecture 9 Slide 20

R 10 K: Cycle 2 ROB ht # Insn T h t PR#5 PR#2

R 10 K: Cycle 2 ROB ht # Insn T h t PR#5 PR#2 PR#6 PR#3 1 2 3 4 5 6 7 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 no yes no Told S X c 2 C Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#6, PR#7, PR #8 T ldf PR#5 mulf PR#6 T 1 T 2 PR#4+ PR#1+ PR#5 PR#1 PR#2 PR#3 PR#4 CDB T Allocate new preg (PR#6) to f 2 Remember old preg mapped to f 3 (PR#3) in ROB Lecture 9 Slide 21

R 10 K: Cycle 3 ROB ht # Insn h t 1 2 3

R 10 K: Cycle 3 ROB ht # Insn h t 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 no no yes no Told PR#5 PR#2 PR#6 PR#3 S X c 2 c 3 C Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#7, PR#8, PR#9 T T 1 T 2 PR#1 PR#2 PR#3 PR#4 CDB T Stores are not allocated pregs Free stf mulf PR#6 PR#4+ PR#1+ PR#5 Lecture 9 Slide 22

R 10 K: Cycle 4 ROB ht # Insn h t 1 2 3

R 10 K: Cycle 4 ROB ht # Insn h t 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 yes no addi stf mulf Told PR#5 PR#2 PR#6 PR#3 S X C c 2 c 4 c 3 c 4 PR#7 PR#4 Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#7, PR#8, PR#9 T T 1 T 2 PR#7 PR#4+ PR#6 PR#4+ PR#1+ PR#5+ PR#1 PR#2 PR#3 PR#4 CDB T PR#5 ldf completes set Map. Table ready bit Match PR#5 tag from CDB & issue Lecture 9 Slide 23

R 10 K: Cycle 5 ROB ht # Insn h t 1 2 3

R 10 K: Cycle 5 ROB ht # Insn h t 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 yes yes no no addi ldf stf Told S X C PR#5 PR#2 PR#6 PR#3 c 2 c 4 c 3 c 5 c 4 PR#7 PR#4 PR#8 PR#5 c 5 Map Table Reg T+ Arch. Map Reg T+ f 0 f 1 f 2 r 1 PR#1+ PR#8 PR#6 PR#7 PR#1 PR#5 PR#3 PR#4 Free List PR#8, PR#2, PR#9 T T 1 PR#7 PR#8 PR#4+ PR#6 CDB T ldf retires Return PR#2 to free list Record PR#5 in Arch map T 2 PR#7 PR#4+ Free Lecture 9 Slide 24

Precise State in R 10 K • Problem with R 10 K design? Precise

Precise State in R 10 K • Problem with R 10 K design? Precise state has more overhead • Keep second (non-speculative) map table (architectural map table) which is only updated on retirement • On exception or mispredict, copy architectural map table into map table • Also need architectural free list? • Alternatively, serially rollback using T, Told ROB fields ± Slow, but less hardware EECS 470 Lecture 9 Slide 25

R 10 K: Cycle 5 (with precise state) ROB ht # Insn h t

R 10 K: Cycle 5 (with precise state) ROB ht # Insn h t 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 yes yes no no addi ldf stf Told S X C PR#5 PR#2 PR#6 PR#3 c 2 c 4 c 3 c 5 c 4 PR#7 PR#4 PR#8 PR#5 c 5 Map Table Reg T+ f 0 f 1 f 2 r 1 CDB T PR#1+ PR#8 PR#6 PR#7 Free List PR#8, PR#2, PR#9 T T 1 PR#7 PR#8 PR#4+ PR#6 T 2 PR#7 PR#4+ undo insns 3 -5 (doesn’t matter why) use serial rollback Lecture 9 Slide 26

R 10 K: Cycle 6 (with precise state) ROB ht # Insn h t

R 10 K: Cycle 6 (with precise state) ROB ht # Insn h t 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 yes no no addi stf Told S X C PR#5 PR#2 PR#6 PR#3 c 2 c 4 c 3 c 5 c 4 PR#7 PR#4 PR#8 PR#5 c 5 Map Table Reg T+ f 0 f 1 f 2 r 1 CDB T PR#1+ PR#5+ PR#6 PR#7 Free List PR#2, PR#8 PR#9 T T 1 PR#7 PR#4+ PR#6 T 2 undo ldf (ROB#5) 1. free RS 2. free T (PR#8), return to Free. List 3. restore MT[f 1] to Told (PR#5) 4. free ROB#5 PR#4+ insns may execute during rollback (not shown) Lecture 9 Slide 27

R 10 K: Cycle 7 (with precise state) ROB ht # Insn h t

R 10 K: Cycle 7 (with precise state) ROB ht # Insn h t 1 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 no no yes no no stf Told S X C PR#5 PR#2 PR#6 PR#3 c 2 c 4 c 3 c 5 c 4 PR#7 PR#4 c 5 Map Table Reg T+ f 0 f 1 f 2 r 1 CDB T PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2, PR#8, PR #7, PR#9 T T 1 T 2 PR#6 PR#4+ undo addi (ROB#4) 1. free RS 2. free T (PR#7), return to Free. List 3. restore MT[r 1] to Told (PR#4) 4. free ROB#4 Lecture 9 Slide 28

R 10 K: Cycle 8 (with precise state) ROB ht # Insn 1 ht

R 10 K: Cycle 8 (with precise state) ROB ht # Insn 1 ht 2 3 4 5 6 7 T ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) addi r 1, 4, r 1 ldf X(r 1), f 1 mulf f 0, f 1, f 2 stf f 2, Z(r 1) Reservation Stations # FU busy op 1 2 3 4 5 ALU LD ST FP 1 FP 2 EECS 470 no no no Told PR#5 PR#2 PR#6 PR#3 S X C c 2 c 4 c 3 c 5 c 4 Map Table Reg T+ f 0 f 1 f 2 r 1 CDB T PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2, PR#8, PR #7, PR#9 T T 1 T 2 undo stf (ROB#3) 1. free RS 2. free ROB#3 3. no registers to restore/free 4. how is D$ write undone? Lecture 9 Slide 29

Can we do better? • Early Branch Resolution • Recover from branch mispredicts before

Can we do better? • Early Branch Resolution • Recover from branch mispredicts before retirement • Maintain a stack of “map-table-checkpoints” for each branch, or “branch stack” • Keeps track of architectural state before branch executes • New structural hazard if checkpoint space runs out • Discuss more in a few weeks Branch Stack Recovery PC T+ ROB&LSQ tail BP repair EECS 470 Free list Lecture 9 Slide 30

P 6 vs. R 10 K (Renaming) Feature Value storage Register read Register write

P 6 vs. R 10 K (Renaming) Feature Value storage Register read Register write Speculative value free Data paths P 6 ARF, ROB, RS @D: ARF/ROB RS @R: ROB ARF @R: automatic (ROB) ARF/ROB RS R 10 K PRF @S: PRF FU @C: FU PRF @R: overwriting insn PRF FU RS FU FU PRF FU ROB ARF Precise state Simple: clear everything Complex: serial/checkpoint • R 10 K-style became popular in late 90’s, early 00’s • E. g. , MIPS R 10 K (duh), DEC Alpha 21264, Intel Pentium 4 • P 6 -style is perhaps making a comeback • Why? Frequency (power) is on the retreat, simplicity is important EECS 470 Lecture 9 Slide 31

Summary • Modern dynamic scheduling must support precise state • A software sanity issue,

Summary • Modern dynamic scheduling must support precise state • A software sanity issue, not a performance issue • Strategy: Writeback Complete (Oo. O) + Retire (i. O) • As an added benefit, we can do speculative execution with same mechanism • Two basic designs • P 6: Tomasulo + re-order buffer, copy based register renaming ± Precise state is simple, but fast implementations are difficult • R 10 K: implements true register renaming ± Easier fast implementations, but precise state is more complex EECS 470 Lecture 9 Slide 32

P 6 or R 10 K • You probably won't have an inherent advantage

P 6 or R 10 K • You probably won't have an inherent advantage with either in your final project • Area and routing aren't modeled in our simulations • Individual coding and optimization styles probably much more important EECS 470 Lecture 9 Slide 33

Dynamic Scheduling Summary • Out-of-order execution: a performance technique • Easier/more effective in hardware

Dynamic Scheduling Summary • Out-of-order execution: a performance technique • Easier/more effective in hardware than software (isn’t everything? ) • Idea: make scheduling transparent to software • Feature I: Dynamic scheduling (i. O Oo. O) • “Performance” piece: re-arrange insns into high-performance order • Decode (i. O) dispatch (i. O) + issue (Oo. O) • Two algorithms: Scoreboard, Tomasulo • Feature II: Precise state (Oo. O i. O) • “Correctness” piece: put insns back into program order • Writeback (Oo. O) complete (Oo. O) + retire (i. O) • Two designs: P 6, R 10 K • Next: memory scheduling EECS 470 Lecture 9 Slide 34

Next Time • How to handle loads/stores out of order • Lingering questions /

Next Time • How to handle loads/stores out of order • Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https: //bit. ly/3 o. Xr 4 Ah EECS 470 35 Lecture 9 Slide 35