Power PC 604 Superscalar Microprocessor IBM Motorola Apple

Power. PC 604 Superscalar Microprocessor IBM, Motorola, Apple

11/13 2

PPC 604 e Overview • RISC Power. PC family • Power. PC architecture : – 32 -bit effective (logical) addresses, – 8, 16, and 32 bits integer data types, and floatingpoint data types of 32 and 64 bits (single- and double-precision, respectively). • A superscalar processor : can issue four instructions • Up to seven instructions can execute in parallel. 11/13 3

Overview: 604 e has 7 units The 604 e has seven parallel – independent execution units • Floating-point unit (FPU) • Branch processing unit (BPU) • Condition register unit (CRU) • Load/store unit (LSU) • Three integer units (IUs): — Two single-cycle integer units (SCIUs) — One multiple-cycle integer unit (MCIU) 11/13 4

Three-stage pipelined floatingpoint unit (FPU) • Fully IEEE 754 compliant FPU • Supports non-IEEE mode for time-critical operations • Fully pipelined, single-pass double-precision design • Two-entry reservation station to minimize stalls • Thirty-two 64 -bit FPRs for single- or doubleprecision operands 11/13 5

BPU & CRU • BPU Branch Processing Unit with dynamic branch prediction Ø Two-entry reservation station ØOut-of-order execution through two branches Ø 64 -entry fully-associative branch target address cache (BTAC), 512 -entry branch history table (BHT) ØTwo bits per entry predictions Ø Condition register unit (CRU) ØTwo-entry reservation station 11/13 6

Condition resolution takes time

Solution: Branch speculation

Branch History Table (BHT) Table of predictors • Each branch given predictor • BHT is table of “Predictors” Branch PC Predictor 0 Predictor 1 – Could be 1 -bit or more – Indexed by PC address of Branch • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) – Misprediction Flush Reorder Buffer • In Fetch state of branch: Predictor 7 – Use Predictor to make prediction • When branch completes – Update corresponding Predictor 11/13 9

BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Branch PC =? Predicted PC Yes: instruction is prediction state branch and use bits predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 11/13 10

11/13 11

PPC 604 Pipeline 11/13 12

Power. PC 604 Pipeline overview • Instruction fetch (IF) — loads decode queue (DEQ) with instructions from I - cache and determines next instruction address • Instruction decode (ID)— time-critical decoding on instructions in dispatch queue (DISQ). • Instruction dispatch (DS)— o up to 4 instructions dispatched – max – in order o one per functional unit o non- time-critical instructions decoding. o determines when instruction can be dispatched to EX Units o At end of DS, instructions and their operands are latched into the execution input latches or into unit’s reservation station. o 11/13 Rename registers and reorder buffer entries allocated 13

Execute (E), Complete (C), Writeback • Execute (E) Øinstruction flow split among six execution units. Instructions enter execute from dispatch or reservation station. Ø results written into rename buffer entry ; notifies complete stage • Complete (C) Ø ensures correct machine state maintained ; monitors instructions in complete and execute stages. ØInstructions removed from reorder buffer (ROB) when complete ØResults written back from rename buffers to register at complete or writeback • Writeback (W) writes back results from rename buffers not written back during complete 11/13 14

604 Block Diagram – Internal Data paths 11/13 15

Reservation Stations & Result Buses 11/13 16

Execution Latencies 11/13 17

PPC 604 e Unit Pipeline Stages 11/13 18

Example 1: Instruction timing for Cache HIT 11/13 19

11/13 20

Example 1: Instruction Timing for cache Hit Clock 0 1 2 3 4 5 6 7 8 9 10 11 0 AND Fet DQ DS EX C/WB 1 OR Fet DQ DS EX C/WB 2 FADD Fet DQ DS EX EX EX C/WB 3 FSUB Fet DQ DS DS EX EX EX C/WB 4 ADDC Fet DQ DS EX C C C/WB 5 SUBFC Fet DQ DS EX C C C/WB 6 FMADD Fet DQ DS EX EX EX C/WB 7 FMSUB Fet DQ DS DS EX EX EX C/WB 8 XOR Fet DQ DS DS EX C C C/WB 9 NEG Fet DQ DS DS EX C C C/WB 10 FADDS Fet DQ DQ DS EX EX EX C/WB 11 FSUBS Fet DQ DQ DS DS EX EX EX C/WB 12 ADD Fet DQ DQ DS DS EX C C C/WB 13 SUB Fet DQ DQ DS DS EX C C C/WB 11/13 21

BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Branch PC =? Predicted PC Yes: instruction is prediction state branch and use bits predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 11/13 22

Example 2 : Branch Taken with BTAC hit No branch penalty; 4 OR is from target stream Clock 0 1 2 3 0 AND Fet DQ DS EX 4 C/WB 1 LD Fet DQ DS EX EX 2 ADD Fet DQ DS EX C 3 BC Fet 5 6 8 9 10 C/WB add DQ DS EX C C taken Waits for LD waits for bc 4 OR Fet DQ DS EX C C/WB 5 CMP Fet DQ DS EX C C/WB 6 LD Fet DQ DS EX EX C/WB 7 MULLI Fet DQ DS EX EX C 11/13 7 C/WB Cycle 2: instructions 4 – 7 fetched from Target based on address from BTAC HIT Cycle 5: inst. 2 -3 wait for LD to retire (WB) & retire with it 23

Example 2: Branch taken with BTAC HIT No penalty 11/13 24

Example 3: Branch taken, BTAC HIT, Icache MISS 11/13 25

Ex 4: Branch taken, BTAC Miss, correct at Decode stage One clock penalty, to fetch target group (2, 3, 4, 5) Correction at Decode includes branch on CR (flags), LR 11/13 26

Ex 5: Branch taken, BTAC Miss, correct at Dispatch stage - 2 clock branch penalty 11/13 27

Example 6: Branch taken, BTAC Miss, correct at Execute 11/13 --- 3 clock penalty 28

Class Example – real dependencies 1 2 3 4 5 6 7 ADD OR SUB FMUL FSUB AND Clock R 1, R 2, R 3 R 2, R 1, R 4 R 3, R 2, R 3 F 7, F 5, F 6 F 8, F 10, F 7 R 4, R 1, R 3 0 1 2 3 ; R 1 = R 2 + R 3 4 5 6 7 8 9 10 C/WB 1 ADD Fet Dq DS EX C/WB 2 ADD Fet DQ DS EX EX 3 OR Fet DQ DS DS EX C/WB 4 SUB Fet DQ DS DS EX EX C/WB Fet DQ DS EX EX EX C/WB 5 FMUL 6 FSUB Fet DQ DS DS EX EX EX C/WB 7 AND FET DQ DQ DS EX EX C C/WB 11/13 29

11/13 30

PPC 604 Pipeline 11/13 31

Pipeline Details: Fetch Stage • Fetches instructions from I cache and loads decode queue (DEQ) • Determines address of next instruction to be fetched. • Keeps queue supplied with instructions for dispatch • Instructions fetched from I cache in groups of four, from a cache block • If only two instructions remain in the cache block, only two instructions are fetched. 11/13 32

next instruction fetch address: • Each stage offers candidate address to be fetched, latest stage has highest priority • As a block is prefetched, branch target address cache (BTAC) and branch history table (BHT) searched with fetch address. • If address is in BTAC, next instruction fetched from that address • DECODE may indicate, based on BHT or an unconditional branch decode, that earlier BTAC prediction was incorrect • BPU can indicate that a previous branch prediction, from the BTAC or DECODE was incorrect 11/13 33

Decode Stage • Handles time-critical decoding of instructions in instruction buffer. • Contains four-instruction buffer (DEQ); shifts one or two pairs of instructions into dispatch buffer as space becomes available. • Branch correction predicts branches whose target is taken from the CTR or LR. Occurs if no CTR or LR updates are pending. 11/13 34

Dispatch Stage • non–time-critical decoding of instructions supplied by decode • determines which instructions can be dispatched • source operands read from register file and dispatched to execute units • dispatched instructions and their operands latched into reservation stations or execution unit input latches. • Dispatched Instructions issued a position in 16 -entry completion buffer • Rename Buffer allocated to instruction if needed 11/13 35

Execute Stage • Instruction passed to appropriate execution unit after fetch, decode, and dispatch. EX units have different latencies • Floating-point unit has fully pipelined, three-stage execution unit • EX units write results into appropriate rename buffer & notifies complete stage 11/13 36

Branch Mispredict / Exceptions ? • What if a branch instruction was mispredicted in an earlier Stage ? • Instructions from mispredicted path flushed • Fetching resumes at the correct address. • If an instruction causes an exception, the execution unit reports the exception to the complete stage and continues executing instructions 11/13 37

Complete Stage • maintains correct architectural machine state. • As instruction finish EX, their status is recorded in completion buffer (FIFO) entry. • . entries examined in order in which instructions dispatched. • Retains program order, ensures instructions completed in order • four entries examined during each cycle for writeback • completion buffer is used to ensure a precise exception model. 11/13 38

Write-Back Stage • Write back results from rename buffers not written back by the complete stage. • Each rename buffers has two read ports for write-back, corresponding to the two ports provided for write-back for the GPRs, FPRs, and CR. • Two results can be copied from the write-back buffers to registers per clock cycle. 11/13 39