Data Hazard Solution 2 Data Forwarding Our nave

Data Hazard Solution 2: Data Forwarding ¢ Our naïve pipeline would experience many data

Detecting Stall Condition # demo-h 2. ys 1 2 3 4 5 0 x

Data Forwarding Example # demo-h 2. ys 0 x 000: irmovl $10, %edx 0

Data Forwarding Example ¢ Register %edx § Generated by ALU during previous cycle §

PIPE Stat Write back stat W W_val. E stat icode m_stat val. E val.

Limitation of Forwarding ¢ Load-use dependency § Value needed by end of decode stage

Load/Use Hazard: Desired Behavior ¢ Best we can do in hardware § Stall reading

Addressing Load/Use Hazard Stat Write back stat W W_val. E stat icode m_stat val.

Interrupts and Exceptions ¢ Basic interrupt mechanism CPU running current process … instr i+1

Interrupt Handling ¢ Calling handler § Save return address (PC) on stack Address of

Exceptions § Events occurring within processor under which pipeline cannot continue normal operation ¢

Exception Examples ¢ ¢ Detect in fetch stage jmp $-1 # Invalid jump target

Exceptions in Pipeline Processor (#1) # demo-exc 1. ys irmovl $100, %eax rmmovl %eax,

Exceptions in Pipeline Processor (#2) # demo-exc 2. ys 0 x 000: xorl %eax,

Correct Exception Handling W stat icode M stat icode E stat icode ifun D

Avoiding Side Effects # demo-exc 3. ys irmovl $100, %eax rmmovl %eax, 0 x

Avoiding Side Effects ¢ Exception should disable state update for following instructions § When

PIPE: Fetch Details ¢ Main points § Branch prediction § Branch misprediction recovery §

PIPE: Decode and Write-back ¢ Main points stat icode ifun val. C § Forwarding

PIPE: Execute ¢ Main points § CC update inhibited by prior exceptions § Values

PIPE: Memory and Write-back ¢ Main points § Values forwarding § Stat update logic

W_icode, W_val. M Stall & Bubble W_val. E, W_val. M, W_dst. E, W_dst. M

Pipeline Control: Register Modes Input = y Normal Output = x x stall =0

PIPE Control Logic W_stat W_stall W stat icode val. E val. M dst. E

PIPE: Actual ret handling # prog 7 1 2 3 4 5 0 x

PIPE: Actual exception handling ¢ Scenario: pushl uses bad memory address § Actions: disable

$Special Control Cases: Exceptions ¢ ¢ Detection Condition Trigger Exception m_stat is in {SADR,$

$Special Control Cases: Non-exceptions ¢ Detection Condition Trigger Processing ret IRET in { D_icode,$

Pipeline Control, rev. 1. 0 bool F_stall = # Conditions for a load/use hazard

Analysis: Control Combinations § Special cases that can arise on same clock cycle ¢

Control Combination A ret 1 Mispredict M E M JXX D E D ret

Control Combination B ret 1 Load/use M E D M Load Use E D

Control Combination B: Correct Handling ret 1 Load/use M E D M Load Use

Corrected Pipeline Control Logic bool D_bubble = # Mispredicted branch (E_icode == IJXX &&

Lesson Learned ¢ ¢ Extensive and thorough testing is good, but it can’t prove

Performance Metrics ¢ Clock rate § Measured in Megahertz or Gigahertz § Function of

CPI for PIPE ¢ Ideal CPI = 1. 0 § Fetch instruction each clock

CPI for PIPE (Cont. ) B/I = LP + MP + RP § LP:

State-of-the-Art Pipelining ¢ What have we ignored in our Y 86 implementation? § Balancing

Fetch Logic Revisited ¢ During fetch cycle 1. Select PC 2. Read bytes from

Standard Fetch Timing need_regids, need_val. C Select PC Mem. Read Increment 1 clock cycle

A Fast PC Increment Circuit incr. PC High-order 29 bits 0 Slow Low-order 3

Modified Fetch Timing need_regids, need_val. C 3 -bit add Select PC Mem. Read MUX

State-of-the-Art Pipelining ¢ Other issues to consider § More complex instructions: consider FP divide,

Pentium 4 Pipeline ¢ Very deep pipeline § Enables very high clock rates, but

Multicycle FP Operations ¢ Multiple functional units: one approach § Special-purpose hardware for FP

Dynamic scheduling ¢ Out-of-order execution engine: one view (Pentium 4) Fetching, decoding, translation of

Branch Prediction: Simplistic ¢ Branch history § Encode history about prior history of each

Branch Prediction: Realistic ¢ Alpha 21264 “tournament” predictor 4 K entries, each 2 bits,

Discussion Questions ¢ Data hazards on register values can be dealt with by stalling

Slides: 53

Download presentation

Data Hazard Solution 2: Data Forwarding ¢ Our naïve pipeline would experience many data stalls § Register isn’t written until completion of write-back stage § Source operands are read from register file in decode stage Values need to be in register file at start of stage § Leads to many more stall cycles than necessary § ¢ Key observation § The value we want is generated in execute or memory stage § It is “available” 1 -2 cycles before write-back ¢ Trick: go get it! § Pass value directly from stage of generating instruction to decode stage § Must be available before end of decode stage to avoid stall

Detecting Stall Condition # demo-h 2. ys 1 2 3 4 5 0 x 000: irmovl $10, %edx F D E M W E M W D D E M W F F D E M 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop 6 bubble 0 x 00 e: addl %edx, %eax 0 x 010: halt F 7 Cycle 6 W W_dst. E = %eax W_val. E = 3 • • • D src. A = %edx src. B = %eax 8 9 10 11 W

Data Forwarding Example # demo-h 2. ys 0 x 000: irmovl $10, %edx 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop 0 x 00 e: addl % edx , %eax 0 x 010: halt § irmovl in write-back stage § Destination value in W pipeline 1 2 3 4 5 F D F E D F M E D F W M E D F 6 7 8 9 10 W M E D F W M E D W M E W M W Cycle 6 W R[% eax] f 3 register § “Forward” as val. B for decode stage § addl instruction can proceed without stalling W_dst. E= % eax W_val. E = 3 § When do we actually know src. A= % edx val. A f R[% edx] = 10 src. B= % eax val. B f W_val. E = 3 the values of %eax and %edx? • • • D

Data Forwarding Example ¢ Register %edx § Generated by ALU during previous cycle § Forward from memory as val. A ¢ ¢ Register %eax § Value just generated by ALU § Forward from execute as val. B When do we actually know the values of %eax and %edx?

PIPE Stat Write back stat W W_val. E stat icode m_stat val. E val. M data out dmem_error read Addr M_Cnd M stat icode m_val. M Data memory Mem. control write Memory data in M_val. A M_val. E Cnd val. E val. A dst. Edst. M e_Cnd ALU CC Execute E W_val. M dst. Edst. M ALU A stat icode ifun val. C e_val. E ALU fun. e_dst. E ALU B dst. E val. A val. B Forwarding Hardware § Feedback paths from E, M, and W registers to decode stage § Logic blocks to select source for val. A and val. B in decode stage dst. E dst. M src. A src. B d_src. Ad_src. B dst. E dst. M src. A src. B Sel+Fwd A Decode Fwd B W_val. M B Register. M file E D stat icodeifun r. A r. B val. C val. P stat imem_error instr_valid Fetch Instruction memory f_pc Select PC F pred. PC W_val. E PC incr. Predict PC M_val. A W_val. M § Note: we either do forwarding or stall on data hazards § Forwarding has better performance, higher cost

PIPE Stat Write back stat W W_val. E stat icode m_stat val. E val. M data out dmem_error read Addr M_Cnd M stat icode m_val. M Data memory Mem. control write Memory data in Cnd val. E val. A ALU CC Execute ALU A stat icode ifun val. C dst. Edst. M e_val. E ALU fun. e_dst. E ALU B dst. E val. A val. B dst. E dst. M src. A src. B d_src. Ad_src. B dst. E dst. M src. A src. B Sel+Fwd A Decode Fwd B W_val. M B Register. M file E D stat icodeifun r. A r. B val. C Fetch Instruction memory f_pc Select PC F pred. PC W_val. E val. P stat imem_error instr_valid PC incr. Forwarding Control M_val. A M_val. E e_Cnd E W_val. M dst. Edst. M Predict PC M_val. A W_val. M ## Actions of “Sel+Fwd A” block ## Pick the correct A value ## Order is important! int new_E_val. A = [ # Use incremented PC D_icode in {ICALL, IJXX} : D_val. P; # Forward val. E from execute d_src. A == E_dst. E : e_val. E; # Forward val. M from memory d_src. A == M_dst. M : m_val. M; # Forward val. E from memory d_src. A == M_dst. E : M_val. E; # Forward val. M from write back d_src. A == W_dst. M : W_val. M; # Forward val. E from write back d_src. A == W_dst. E : W_val. E; # Use value read from register file 1 : d_rval. A; ];

PIPE Stat Write back stat W W_val. E stat icode m_stat val. E val. M data out dmem_error read Addr M_Cnd M stat icode m_val. M data in M_val. A M_val. E Cnd val. E val. A dst. Edst. M e_Cnd ALU CC Execute E ALU A stat icode ifun val. C e_val. E ALU fun. e_dst. E ALU B dst. E val. A val. B dst. E dst. M src. A src. B d_src. Ad_src. B dst. E dst. M src. A src. B Sel+Fwd A Decode Fwd B W_val. M B Register. M file E D stat icodeifun r. A r. B val. C Fetch Instruction memory f_pc Select PC F pred. PC At clock cycle 4 d_src. A = d_src. B = E_dst. E = e_val. E = M_dst. E = M_val. E = What are the forwarding conditions? W_val. E val. P stat imem_error instr_valid Forwarding Example Data memory Mem. control write Memory W_val. M dst. Edst. M PC incr. Predict PC Highlight the data path in D, E, M stages. M_val. A W_val. M

PIPE Stat Write back stat W W_val. E stat icode m_stat val. E val. M data out dmem_error read Addr M_Cnd M stat icode m_val. M data in M_val. A M_val. E Cnd val. E val. A dst. Edst. M e_Cnd ALU CC Execute E ALU A stat icode ifun val. C e_val. E ALU fun. e_dst. E ALU B dst. E val. A val. B dst. E dst. M src. A src. B d_src. Ad_src. B dst. E dst. M src. A src. B Sel+Fwd A Decode Fwd B W_val. M B Register. M file E D stat icodeifun r. A r. B val. C Fetch Instruction memory f_pc Select PC F pred. PC W_val. E val. P stat imem_error instr_valid Forwarding Example Data memory Mem. control write Memory W_val. M dst. Edst. M PC incr. Predict PC M_val. A W_val. M At clock d_src. A = d_src. B = M_dst. E = m_val. E = E_dst. E = e_val. E = cycle 4 ecx edx 128 ecx 3 What are the forwarding conditions?

Limitation of Forwarding ¢ Load-use dependency § Value needed by end of decode stage in cycle 7 § Value read from memory in memory stage of cycle 8 ¢ Terminology § This is a load hazard § Only solution is a load stall § Preferred: compiler avoid

Load/Use Hazard: Desired Behavior ¢ Best we can do in hardware § Stall reading instruction for one cycle § Then forward value from memory stage ¢ Better yet § Have compiler avoid in code it generates mrmovl 0(%edx), %eax irmovl $10, %ebx addl %ebx, %eax

Addressing Load/Use Hazard Stat Write back stat W W_val. E stat icode m_stat val. E val. M data out dmem_error read Addr ¢ Detection M_Cnd M stat icode memory to src register § dst. M in E register matches src. A or src. B (and not 0 x. F) ¢ val. E M_val. A dst. Edst. M e_Cnd ALU CC Execute E ALU A stat icode ifun val. C e_val. E ALU fun. e_dst. E ALU B dst. E val. A val. B dst. E dst. M src. A src. B d_src. Ad_src. B dst. E dst. M src. A src. B Action § Stall instruction in decode data in M_val. E Cnd § Previous instr. is loading from m_val. M Data memory Mem. control write Memory W_val. M dst. Edst. M Sel+Fwd A Decode Fwd B W_val. M B Register. M file E D stat icodeifun r. A r. B val. C val. P stat imem_error instr_valid Fetch Instruction memory f_pc Select PC F pred. PC W_val. E PC incr. Predict PC M_val. A W_val. M

Interrupts and Exceptions ¢ Basic interrupt mechanism CPU running current process … instr i+1 instr i+2 instr i+3 instr i+4 instr i+5 instr i+6 … Event occurs that needs attention HW asserts CPU interrupt line (e. g. , Disc read finishes) Handler Control transferred to interrupt handler (think HW-induced function call) instr 1 instr 2 instr 3 … How is state of interrupted process saved? How is location of handler determined?

Interrupt Handling ¢ Calling handler § Save return address (PC) on stack Address of next instruction to be executed for this process – Depending on event, either current or next instruction § PC usually passed through pipeline along with instruction § Precise exception: all instructions to PC executed, none past (“Clean Break”) § § Jump to handler address ¢ § Usually obtained from table stored at fixed memory address § Index to table entry determined by exception type/interrupt priority level § Interrupt vector table written by software, accessed by hardware Implementation § Critical for real hardware § Seldom implemented in simulators: no OS running to pass control to!

Exceptions § Events occurring within processor under which pipeline cannot continue normal operation ¢ Possible causes § § § ¢ Halt instruction executed Bad address for instruction or data Invalid instruction Pipeline control error System calls, page faults, math errors (not in Y 86) Desired action § Complete instructions to specific point § Either current or previous (depends on exception type) § Discard instructions that follow § Transfer control to exception handler in OS § Save return address, get handler address from table (Current) (Previous)

Exception Examples ¢ ¢ Detect in fetch stage jmp $-1 # Invalid jump target . byte 0 x. FF # Invalid instruction code halt # Halt instruction Detect in memory stage irmovl $100, %eax rmmovl %eax, 0 x 10000(%eax) # invalid address (for Y 86 tools)

Exceptions in Pipeline Processor (#1) # demo-exc 1. ys irmovl $100, %eax rmmovl %eax, 0 x 10000(%eax) # Invalid address nop. byte 0 x. FF # Invalid instruction code 2 3 4 5 F D 0 x 006: rmmovl %eax, 0 x 10000(%eax) F 0 x 00 c: nop 0 x 00 d: . byte 0 x. FF E M W D E M F D F E 1 0 x 000: irmovl $100, %eax Exception detected D Exception detected ¢ Desired behavior § rmmovl should cause exception (1 st in sequential machine) § Tricky because invalid instruction code detected first

Exceptions in Pipeline Processor (#2) # demo-exc 2. ys 0 x 000: xorl %eax, %eax 0 x 002: jne t 0 x 007: irmovl $1, %eax 0 x 00 d: irmovl $2, %edx 0 x 013: halt 0 x 014: t: . byte 0 x. FF 0 x 000: xorl %eax, %eax 0 x 002: jne t 0 x 014: t: . byte 0 x. FF 0 x? ? ? : (I’m lost!) 0 x 007: ¢ # Set condition codes # Not taken # Target 1 2 3 4 5 6 7 8 9 F D E M F D E F D W M E D F W M E D W M E W M W F irmovl $1, %eax Desired behavior Exception detected § No exception should occur § Must match behavior and results of sequential execution

Correct Exception Handling W stat icode M stat icode E stat icode ifun D stat icode ifun F ¢ Cnd val. E val. M dst. E dst. M val. E val. A dst. E dst. M val. C r. A r. B val. A val. B val. C val. P dst. E dst. M src. A src. B pred. PC Challenges: respond to exceptions in program order, and only those that “really” occur § Motivation for exception status field (stat) in pipeline registers § Fetch stage sets to either “AOK, ” “ADR” (bad fetch address), or “INS” (illegal instruction) § Decode & execute stages pass values through § Memory stage either passes through or sets to “ADR” § CPU responds to exception only when instruction reaches write back

Avoiding Side Effects # demo-exc 3. ys irmovl $100, %eax rmmovl %eax, 0 x 10000(%eax) # invalid address addl %eax, %eax # Sets condition codes 1 2 3 4 5 F D 0 x 006: rmmovl %eax, 0 x 10000(%eax) F 0 x 00 c: addl %eax, %eax E M D E F D W M E 0 x 000: irmovl $100, %eax ¢ Desired behavior Condition code set § rmmovl should cause exception § No following instruction should change any state § Exception detected Note special challenge of condition codes!

Avoiding Side Effects ¢ Exception should disable state update for following instructions § When exception detected in memory stage § Disable condition code setting in execute § Must happen in same clock cycle § When exception passes to write-back stage ¢ § Disable memory write in memory stage § Disable condition code setting in execute stage Let’s see how these are handled in PIPE processor

PIPE: Fetch Details ¢ Main points § Branch prediction § Branch misprediction recovery § Return handling § Stat initialization D stat icode ifun r. A r. B val. C val. P M_icode M_Cnd M_val. A W_icode W_val. M Predict PC stat Need val. C Instr valid Need regids icode ifun Split Align Byte 0 imem_error Instruction memory f_pc Select PC F pred. PC Bytes 1 -5 PC incr.

PIPE: Decode and Write-back ¢ Main points stat icode ifun val. C § Forwarding logic, paths § Forwarding priority § val. A and val. P merged val. A E val. B Sel+Fwd A Fwd B d_rval. A dst. E dst. M src. A src. B t. E ds al. E _ e _v t. E e _ds l. E M _va st. M M _d l. M M _va st. M m _d l. M W _va st. E W _d al. E W _v W d_rval. B A B dst. M Register M file dst. E src. A src. B E d_src. B d_src. A stat icode ifun r. A r. B val. C D val. P dst. E dst. M src. A src. B

PIPE: Execute ¢ Main points § CC update inhibited by prior exceptions § Values forwarding § Special handling for dst. E (? ) M stat icode Cnd val. E val. A dst. E dst. M e_val. E e_dst. E e_Cnd cond dst. E W_stat m_stat E Set CC stat icode ifun ALU fun. ALU CC ALU A val. C ALU B val. A val. B dst. E dst. M src. A src. B

PIPE: Memory and Write-back ¢ Main points § Values forwarding § Stat update logic § Feedback for branch misprediction recovery Stat W_val. E W_val. M W_dst. E W_dst. M stat W_icode W stat icode m_stat val. E val. M dmem_error Mem. read Mem. write dst. E dst. M m_val. M data out read M_dst. E M_dst. M Data memory write Addr data in M_val. A M_icode M_Cnd M M_val. E stat icode Cnd val. E val. A dst. E dst. M

W_icode, W_val. M Stall & Bubble W_val. E, W_val. M, W_dst. E, W_dst. M W val. M Memory Data memory M_icode, M_Cnd, M_val. A Addr, Data M Cnd 1 2 3 4 5 6 val. E CC Execute ALU alu. A, alu. B E val. A, val. B d_src. A, d_src. B Decode A BM Register file E D icode, ifun, r. A, r. B, val. C Fetch Instruction memory Write back val. P PC increment pred. PC f_pc F

Pipeline Control: Register Modes Input = y Normal Output = x x stall =0 Output = x x stall =1 Output = x x stall =0 y _ Rising clock _ Output = x x bubble =0 Input = y Bubble _ Output = y bubble =0 Input = y Stall _ Rising clock bubble =1 _ Rising clock _ n o p Output = nop

PIPE Control Logic W_stat W_stall W stat icode val. E val. M dst. E dst. M val. E val. A dst. E dst. M M_icode M_bubble M stat icode Cnd m_stat e_Cnd stat set_cc Pipe control logic CC E_dst. M E_icode E_bubble E stat icode ifun val. C val. A val. B d_src. A D_icode dst. E dst. M src. A src. B src. A D_bubble D_stall F_stall ¢ D F stat icode ifun r. A r. B val. C val. P pred. PC Handles special cases § Handles ret, load/use hazards, misprediction recovery, exceptions § Existing PIPE logic handles forwarding, branch prediction

PIPE: Actual ret handling # prog 7 1 2 3 4 5 0 x 000: irmovl Stack, %edx F D F E D F M E D F W M E D F 0 x 006: call proc 0 x 020: ret bubble 0 x 00 b: irmovl $10, %edx 6 W M E D F # Return point 7 8 9 W M E D F W M E D W M E W M 7 8 9 10 1 2 3 4 5 6 0 x 000: irmovl Stack, %edx F D F E D F M E D F W M E W M W E M W D F E D M E 0 x 020: ret 0 x 021: rrmovl %edx, %ebx # Not executed bubble 0 x 00 b: irmovl $10, %edx # Return point D F 11 Simplified view # prog 7 0 x 006: call proc 10 W 11 What hardware actually does W M W

PIPE: Actual exception handling ¢ Scenario: pushl uses bad memory address § Actions: disable CC, inject bubbles into memory stage, stall write-back # prog 10 1 2 3 4 5 0 x 000: irmovl $1, %eax F D E M W F D F E D F M E D F 0 x 006: xorl %esp, %esp #CC = 100 0 x 008: pushl %eax 0 x 00 a: addl %eax, %eax 0 x 00 c: irmovl $2, %eax 6 W M E D 7 8 9 W W W E Cycle 6 M mem_error = 1 E New CC = 000 set_cc 0 10 W

$Special Control Cases: Exceptions ¢ ¢ Detection Condition Trigger Exception m_stat is in {SADR,$

Special Control Cases: Exceptions ¢ ¢ Detection Condition Trigger Exception m_stat is in {SADR, SINS, SHLT} || W_stat is in {SADR, SINS, SHLT} Action (on next cycle) Condition F D E M W Exception normal bubble stall § Also: disable setting of condition codes in execute in current cycle

$Special Control Cases: Non-exceptions ¢ Detection Condition Trigger Processing ret IRET in { D_icode,$

Special Control Cases: Non-exceptions ¢ Detection Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dst. M in { d_src. A, d_src. B } Mispredicted Branch E_icode = IJXX & !e_Cnd ¢ Action (on next cycle) Condition F D E M W Processing ret stall bubble normal Load/Use Hazard stall bubble normal Mispredicted Branch normal

Pipeline Control, rev. 1. 0 bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dst. M in { d_src. A, d_src. B } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dst. M in { d_src. A, d_src. B }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dst. M in { d_src. A, d_src. B}; How do we know this works?

Analysis: Control Combinations § Special cases that can arise on same clock cycle ¢ Combination A § Not-taken branch § ret instruction at branch target ¢ Combination B § Instruction that reads from memory to %esp § Followed by ret instruction

Control Combination A ret 1 Mispredict M E M JXX D E D ret Combination A Condition Processing ret Mispredicted branch Combination § § F D E M W stall bubble normal bubble normal stall bubble normal Should be handled as mispredicted branch Combination will also stall F pipeline register But PC selection logic will be using M_val. A anyway Correct action taken!

Control Combination B ret 1 Load/use M E D M Load Use E D ret Combination B Condition F D E M W Processing ret stall bubble normal Load/use hazard stall bubble normal Combination stall bubble + bubble stall normal § Would assert both bubble and stall for pipeline register D § Should be signaled by processor as pipeline error § Combination not handled correctly in control code 1. 0 § But it passed many simulation tests; caught only with systematic analysis

Control Combination B: Correct Handling ret 1 Load/use M E D M Load Use E D ret Combination B Condition F D E M W Processing ret stall bubble normal Load/use hazard stall bubble normal Combination stall bubble normal § Load/use hazard should get priority § ret instruction should be held in decode stage for additional cycle

Corrected Pipeline Control Logic bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes IRET in { D_icode, E_icode, M_icode # but not condition for a load/use && !(E_icode in { IMRMOVL, IPOPL } && E_dst. M in { d_src. A, d_src. B Condition through pipeline } hazard New }); F D E M W Processing ret stall bubble normal Load/use hazard stall bubble normal Combination stall bubble normal § Load/use hazard should get priority § ret instruction should be held in decode stage for additional cycle

Lesson Learned ¢ ¢ Extensive and thorough testing is good, but it can’t prove a design correct Formal verification important, but field not mature enough for large-scale designs § Important and active research area

Performance Metrics ¢ Clock rate § Measured in Megahertz or Gigahertz § Function of stage partitioning and circuit design § ¢ To increase: keep amount of work per stage small Rate at which instructions executed § CPI: cycles per instruction § On average, how many clock cycles does each instruction require (after completion of previous instruction)? § CPI a function of pipeline design and the program § How frequently are branches mispredicted? § How frequent are load stalls? § How frequent are ret instructions?

CPI for PIPE ¢ Ideal CPI = 1. 0 § Fetch instruction each clock cycle § Process new instruction every cycle § ¢ Although each individual instruction has latency of 5 cycles Actual CPI > 1. 0 § Due to pipeline stalls, branch mispredictions ¢ Computing CPI § C clock cycles § I instructions executed to completion § B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1. 0 + B/I § B/I represents average penalty (per instruction) due to bubbles

CPI for PIPE (Cont. ) B/I = LP + MP + RP § LP: Penalty due to load/use hazard stalling Fraction of instructions that are loads § Fraction of load instructions requiring stall § Number of bubbles injected each time LP = 0. 25 * 0. 20 * 1 = 0. 05 MP: Penalty due to mispredicted branches § Fraction of instructions that are cond. jumps § Fraction of cond. jumps mispredicted § Number of bubbles injected each time MP = 0. 20 * 0. 40 * 2 = 0. 16 RP: Penalty due to ret instructions § Fraction of instructions that are returns § Number of bubbles injected each time RP = 0. 02 * 3 = 0. 06 Net effect of penalties: 0. 05 + 0. 16 + 0. 06 = 0. 27 CPI = 1. 27 (Not bad! Assumes perfect memories. ) § § Typical Values 0. 25 0. 20 1 0. 20 0. 40 2 0. 02 3

State-of-the-Art Pipelining ¢ What have we ignored in our Y 86 implementation? § Balancing delay in each stage Which stage is longest, how might we speed it up? § Multicycle instructions § Realistic memory systems §

Fetch Logic Revisited ¢ During fetch cycle 1. Select PC 2. Read bytes from instruction memory 3. Examine icode to determine instruction length 4. Increment PC ¢ Timing D r. B val. C val. P W_val. M Predict PC stat Need val. C Instr valid Need regids icode ifun Split Align Byte 0 imem_error Instruction memory f_pc § Steps 2 & 4 require significant amount of time stat icode ifun r. A M_icode M_Cnd M_val. A W_icode Select PC F pred. PC Bytes 1 -5 PC incr.

Standard Fetch Timing need_regids, need_val. C Select PC Mem. Read Increment 1 clock cycle § Must perform everything in sequence: Can’t compute incremented PC until we know value to increment it with § Why is increment slow? § How could we speed this up? §

A Fast PC Increment Circuit incr. PC High-order 29 bits 0 Slow Low-order 3 bits MUX carry 1 29 -bit incrementer 3 -bit adder carry Fast 1 need_regids 0 need_Val. C High-order 29 bits Low-order 3 bits PC

Modified Fetch Timing need_regids, need_val. C 3 -bit add Select PC Mem. Read MUX Incrementer Standard cycle 1 clock cycle ¢ 29 -bit incrementer § Acts as soon as PC selected § Output not needed until final MUX § Works in parallel with memory read

State-of-the-Art Pipelining ¢ Other issues to consider § More complex instructions: consider FP divide, sqrt Take many cycles to execute § Forwarding can’t resolve hazards: more data stalls § Important for compiler to schedule code Deeper pipelines to allow faster cycle times § Increased penalty from misprediction, data hazard stalls, etc. § Increased emphasis on branch prediction Actual memory hierarchy issues (will increase CPI) § Difficult to complete memory access in one cycle! § Possibility of cache misses, TLB misses, page faults Superscalar/VLIW: process multiple instructions/cycle Dynamic scheduling (discussed in Chapter 5) § Scheduling = determining instruction execution order § Hardware decides based on data dependencies, resources § § §

Pentium 4 Pipeline ¢ Very deep pipeline § Enables very high clock rates, but 20+ cycle branch penalty § Slower than Pentium III for a given clock rate 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 Que Sch 11 12 13 14 15 Sch Disp RF 16 17 RF Ex 18 19 20 Flgs Br Ck Drive

Multicycle FP Operations ¢ Multiple functional units: one approach § Special-purpose hardware for FP operations § Increased latency causes more frequent stalls Single cycle integer unit Fully pipelined multiplier F D Fully pipelined FP adder M W Non-pipelined divider can cause “structural” hazards

Dynamic scheduling ¢ Out-of-order execution engine: one view (Pentium 4) Fetching, decoding, translation of x 86 instrs to uops to support precise exceptions and recovery from mispredicted branches Image from www. xbitlabs. com

Branch Prediction: Simplistic ¢ Branch history § Encode history about prior history of each individual branch instruction; store as hash table on instr. address § Use history to predict branch outcome NT T Yes! Yes? T ¢ NT NT No? T No! T State machine stores history § § Each time branch taken, move to left Each time branch not taken, move to right In state Yes*, predict taken; in state No*, predict not taken Can be encoded using 2 bits per table entry NT

Branch Prediction: Realistic ¢ Alpha 21264 “tournament” predictor 4 K entries, each 2 bits, to select predictor 12 bits 12 bit shift register of last branch outcomes globally 4 K entries, standard 2 -bit branch predictors Global Predictor Addr. of branch instr 8 K 8 K 10 bits of Local branch address Predictor 1 K entries, each 10 bits, history of behavior of this branch 10 K 1 K entries, each a 3 -bit saturation counter 3 K Predictor size: 8 K + 10 K + 3 K = 29 K bits!

Discussion Questions ¢ Data hazards on register values can be dealt with by stalling or by forwarding. § Can hazards occur on condition codes? § Can hazards occur on data memory accesses? ¢ Can software be responsible for pipeline correctness? § Schedule instructions, use nops § DSP chips historically have had exposed pipelines ¢ Relationship of hazards and dependencies § Does every data dependence cause a data hazard? § Is every data hazard caused by a data dependence?