The Basics Pipelining J Nelson Amaral University of

The Basics: Pipelining J. Nelson Amaral University of Alberta 1

The Pipeline Concept 2 Bauer p. 32

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline. 3

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often is an instruction completed? Pipeline latency: how long does it take to execute an instruction in the pipeline? Is this right? 4

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction I 1 IF ID EX MEM WB L(I 1) = 28 ns I 2 IF ID EX MEM WB L(I 2) = 33 ns I 3 IF ID EX MEM WB L(I 3) = 38 ns I 4 IF ID EX MEM WB L(I 5) = 43 ns We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every stage the same length as the longest one. 5

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns The slowest pipeline state also limits the latency!! I 1 IF 0 ID I 2 IF 10 EX ID I 3 IF 20 MEM EX ID I 4 IF 30 WB MEM EX ID 40 WB L(I 2) = 50 ns MEM WB EX MEM 50 60 L(I 1) = L(I 2) = L(I 3) = L(I 4) = 50 ns 6

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubbles caused by branches, cache misses, and hazards) How long would it take using the same modules without pipelining? What is the speedup due to pipelining? 7

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns The speedup that we got from the pipeline is: How can we improve this pipeline design? We need to reduce the unbalance to increase the clock speed. 8

Pipeline Throughput and Latency IF ID EX MEM 1 MEM 2 WB 5 ns 4 ns 5 ns 4 ns Now we have one more pipeline stage. What is the throughput now? What is the new latency for a single instruction? 9

Pipeline Throughput and Latency I 1 IF ID EX MEM 1 MEM 2 WB 5 ns 4 ns 5 ns 4 ns IF ID EX MEM 1 WB I 2 IF ID EX MEM 1 WB I 3 IF ID EX MEM 1 WB I 4 IF ID EX MEM 1 WB I 5 IF ID EX MEM 1 WB I 6 IF ID EX MEM 1 WB I 7 IF ID EX MEM 1 WB 10

Pipeline Throughput and Latency IF ID EX MEM 1 MEM 2 WB 5 ns 4 ns 5 ns 4 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubles caused by branches, cache misses, etc, for now) What is the speedup that we get from pipelining? 11

Pipeline Throughput and Latency IF ID EX MEM 1 MEM 2 WB 5 ns 4 ns 5 ns 4 ns What have we learned from this example? 1. It is important to balance the delays in the stages of the pipeline 2. The throughput of a pipeline is 1/max(delay). 3. The latency is N max(delay), where N is the number of stages in the pipeline. 12

Execution Snapshot 13 Bauer p. 33

Pipeline with Control Unit 14 Bauer p. 34

Data Hazards and Forwarding Example 1: i: R 7 ← R 12 + R 15 i+1: R 8 ← R 7 – R 12 i+2: R 15 ← R 8 + R 7 Read-After-Write (RAW) dependencies (true dependencies) Write-After-Read (WAR) dependencies (anti dependencies) 15 Bauer p. 35

Data Hazards and Forwarding v v v 16 Bauer p. 36

Forwarding 17 Bauer p. 37

Load-ALU RAW Dependency Example 2: i: R 6 ← Mem[R 2] i+1: R 7 ← R 6 + R 4 The data from the load is not available until the Mem/WB of instruction i, but it is needed at the ID/EX of instruction i+1 Cannot forward back on time! 18 Bauer p. 36

Bubble because of load 19 Bauer p. 38

Priority on Forwarding Example: The RAW from i+1 to i+2 must take priority over the RAW from i to i+2. i: R 10 ← R 4 + R 5 i+1: R 10 ← R 4 – R 10 i+2: R 8 ← R 10 + R 7 20 Bauer p. 38

Forwarding from Mem/WB to Mem Example: i: R 5 ← Mem[R 6] i+1: Mem[R 8] ← R 5 After the load, the contents of the Mem/WB register must be forwarded to be written to memory (not only to R 5). 21 Bauer p. 39

Pipelining with Forwarding and Stall 22 Bauer p. 38

Control Hazards (branches) 23 Bauer p. 40

Control Hazards: Exceptions and Interruptions • Exceptions can occur in any stage (except WB) – IF: page faults – ID: Illegal opcodes – EX: arithmetic exceptions – Mem: illegal address, page faults • Interruptions: – I/O termination, time-outs – Power failures 24 Bauer p. 40

Handling Exceptions/Interruptions Save the Process State Clear Exception Condition ? Abort Program “Correct” Exception Perform Unrelated Task Schedule Process Restart 25 Bauer p. 41

Precise Exceptions in a Pipeline • If an exceptions happens in instruction i: • Instructions i-1, i-2, … complete normally and contribute to the saved state of the process • Instructions i, i+1, i+2, … become no-ops • After the exception is handled, execution re-starts at instruction i – The PC saved is the PC of instruction i. ⋅⋅⋅ i-2 i-1 Exception happens here → i i+1 i+2 ⋅⋅⋅ Complete normally no-op ←Execution re-starts here 26 Bauer p. 41

Implementing Precise Exceptions in the Pipeline 1. Flag the pipeline register at the right of the stage where exception was detected – This Flag moves along the pipeline 2. Set all control lines at a stage with the flag to transform the instruction into a no-op 3. Stop instruction fetching 4. When the flag reaches the Mem/WB stage, save the PC of that instruction as the exception PC 27 Bauer p. 41

Program Order X Temporal Order divide-by-zero exception page-fault exception Which exception occurs first in time? Which exception should be handled first? 28 Bauer p. 41

Can’t avoid Load/ALU instr. bubble Design Issues: Branch resolution in EX stage → Two-cycle branch penalty Mem stage unused for ALU instr 29 Bauer p. 38

Alternative Pipelining Design: Avoiding the load latency penalty Example: i: i+1: Bauer p. 43 R 4 ← Mem[R 8] R 7 ← R 4 + R 5 30

Avoiding the load latency penalty Example: i: i+1: R 4 ← Mem[R 8] R 7 ← R 4 + R 5 31 Bauer p. 43

Address Generation Latency Penalty Example: i: i+1: R 5 ← R 6 + R 7 R 9 ← Mem[R 5] Can’t forward from future. Has to stall. 32 Bauer p. 43

Other changes AG used for branch resolution AG unused for ALU operations Bauer p. 43 33

Tradeoffs: Bauer p. 43 Avoids load/ALU bubble X additional ALU unit Move branch resolution to AG → same penalty AG stage unused for ALU operations Stalls for ALU/Store instr. dependency 34

Which one is better? MIPS Intel 486 Bauer p. 44 35

Pipelining Functional Units: the EX stage • Parameters of interest: – number of stages – minimum number of cycles before two independent (no RAW) instructions of the same type can enter the functional unit 36 Bauer p. 44

Single-Precision Floating Point Representation Most standard floating point representation use: 1 bit for the sign (positive or negative) 8 bits for the range (exponent field) 23 bits for the precision (fraction field) 1 8 23 S E F sign exponent P-H. p. 245 fraction Bauer p. 45 37 From: Patt and Patel, pp. 33

Special Floating Point Representations In the 8 -bit field of the exponent we can represent numbers from 0 to 255. We studied how to read numbers with exponents from 0 to 254. What is the value represented when the exponent is 255 (i. e. 1111 2)? An exponent equal 255 = 11112 in a floating point representation indicates a special value. When the exponent is equal 255 = 11112 and the fraction is 0, the value represented is infinity. When the exponent is equal 255 = 11112 and the fraction is non-zero, the value represented is Not a Number (Na. N). P-H. p. 246 Bauer p. 45 Hen/Patt, pp. 301 38

Floating Point Addition (S 1, E 1, F 1) (S 2, E 2, F 2) yes E 1 < E 2 swap operands Stage 1 Insert 1 to left of F 1 and to left of F 2 S 1 ≠ S 2 yes replace F 2 by its 2 -complement D = E 1 – E 2 F 2 ← F 2 << D Stage 2 -3 add mantissas Normalize and round off Bauer p. 46 Stage 39 4