Pipelining Appendix A and Chapter 3 Pipelining Its

  • Slides: 27
Download presentation
Pipelining Appendix A and Chapter 3

Pipelining Appendix A and Chapter 3

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes A B C D

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e r A B C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take?

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Time 30

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Time 30 40 T a s k O r d e r 40 40 40 20 A B C D • Pipelined laundry takes 3. 5 hours for 4 loads Midnight

Key Definitions Pipelining is a key implementation technique used to build fast processors. It

Key Definitions Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time. A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment. The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline.

Pipeline Stages We can divide the execution of an instruction into the following 5

Pipeline Stages We can divide the execution of an instruction into the following 5 “classic” stages: IF: Instruction Fetch ID: Instruction Decode, register fetch EX: Execution MEM: Memory Access WB: Register write Back

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline.

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often an instruction is completed. Pipeline latency: how long does it take to execute an instruction in the pipeline. Is this right?

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5

Pipeline Throughput and Latency IF ID EX MEM WB 5 ns 4 ns 5 ns 10 ns 4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction I 1 IF ID EX MEM WB L(I 1) = 28 ns I 2 IF ID EX MEM WB L(I 2) = 33 ns I 3 IF ID EX MEM WB L(I 3) = 38 ns I 4 IF ID EX MEM WB L(I 5) = 43 ns We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.

Food for thought? • What is the impact of latency when we have synchronous

Food for thought? • What is the impact of latency when we have synchronous pipelines? • A synchronous pipeline is one where even if there are non-uniform stages, each stage has to wait until all the stages have finished • Assess the impact of clock skew on synchronous pipelines if any.

Pipelining Lessons T a s k O r d e r • Pipelining doesn’t

Pipelining Lessons T a s k O r d e r • Pipelining doesn’t help latency of single task, it helps throughput of 6 PM 7 8 9 entire workload Time • Pipeline rate limited by slowest pipeline stage 30 40 40 20 • Multiple tasks operating simultaneously A • Potential speedup = Number pipe stages B • Unbalanced lengths of pipe stages reduces C speedup • Time to “fill” pipeline D and time to “drain” it reduces speedup

Other Definitions • Pipe stage or pipe segment – A decomposable unit of the

Other Definitions • Pipe stage or pipe segment – A decomposable unit of the fetch-decode-execute paradigm • Pipeline depth – Number of stages in a pipeline • Machine cycle – Clock cycle time • Latch – Per phase/stage local information storage unit

Design Issues • Balance the length of each pipeline stage Throughput = Depth of

Design Issues • Balance the length of each pipeline stage Throughput = Depth of the pipeline Time per instruction on unpipelined machine • Problems – Usually, stages are not balanced – Pipelining overhead – Hazards (conflicts) • Performance (throughput – Decrease of the CPI – Decrease of cycle time CPU performance equation)

MIPS Instruction Formats I opcode 0 R 5 6 opcode 0 J rs 1

MIPS Instruction Formats I opcode 0 R 5 6 opcode 0 J rs 1 5 6 opcode 0 rd 10 11 rs 1 immediate 15 16 rs 2 10 11 31 rd 15 16 Shamt/function 20 21 31 address 5 6 31 Fixed-field decoding

1 st and 2 nd Instruction cycles • Instruction fetch (IF) IR NPC Mem[PC];

1 st and 2 nd Instruction cycles • Instruction fetch (IF) IR NPC Mem[PC]; PC + 4 • Instruction decode & register fetch (ID) A Regs[IR 6. . 10]; B Regs[IR 11. . 15]; Imm ((IR 16)16 # # IR 16. . 31)

3 rd Instruction cycle • Execution & effective address (EX) – Memory reference •

3 rd Instruction cycle • Execution & effective address (EX) – Memory reference • ALUOutput A + Imm – Register - Register ALU instruction • ALUOutput A func B – Register - Immediate ALU instruction • ALUOutput A op Imm – Branch • ALUOutput NPC + Imm; Cond (A op 0)

4 th Instruction cycle • Memory access & branch completion (MEM) – Memory reference

4 th Instruction cycle • Memory access & branch completion (MEM) – Memory reference • PC NPC • LMD Mem[ALUOutput] • Mem[ALUOutput] B (load) (store) – Branch • if (cond) PC ALUOutput; else PC NPC

5 th Instruction cycle • Write-back (WB) – Register - register ALU instruction •

5 th Instruction cycle • Write-back (WB) – Register - register ALU instruction • Regs[IR 16. . 20] ALUOutput – Register - immediate ALU instruction • Regs[IR 11. . 15] ALUOutput – Load instruction • Regs[IR 11. . 15] LMD

5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc

5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 Zero? RS 1 WB Data L M D MUX Sign Extend Data Memory ALU Imm MUX RD Reg File Inst Memory Address RS 2 Write Back MUX Next PC Memory Access

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ

5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Adder 4 Zero? RS 1 RD • Data stationary control RD RD – local decode for each instruction phase / pipeline stage MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS 2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch

Steps to Execute Each Instruction Type

Steps to Execute Each Instruction Type

DETAILED IMPLEMENTATION 0 M 1 u x 2 Target 4 Conc/ 32 Shift left

DETAILED IMPLEMENTATION 0 M 1 u x 2 Target 4 Conc/ 32 Shift left 2 26 PC 0 M u 1 x Read address Memory Write address Mem. Data Write data Instruction [31 -26] Instruction [25 -0] Instruction register I[25 -21] I[20 -16] 0 M u 1 x [15 -11] 0 M u 1 x I[15 -0] Read register 1 Read data 1 Read register 2 Write Read register data 2 Write data Registers 0 M u 1 x 4 32 16 Sign ext. Shift left 2 0 1 M u 2 x 3 Zero ALU result ALU

Control Step 1 Step 2 RR ALU Imm Store Load Step 3 Step 4

Control Step 1 Step 2 RR ALU Imm Store Load Step 3 Step 4 Step 5

Basic Pipeline Instr # i i +1 i +2 i +3 i +4 Clock

Basic Pipeline Instr # i i +1 i +2 i +3 i +4 Clock number 4 5 6 1 2 3 7 8 IF ID EX MEM WB IF ID EX MEM 9 WB

Pipeline Resources IM Reg IM ALU Reg IM DM ALU Reg IM Reg DM

Pipeline Resources IM Reg IM ALU Reg IM DM ALU Reg IM Reg DM ALU Reg DM Reg ALU DM Reg

Pipelined Datapath IF/ID 4 Add ID/EX M ux EX/MEM MEM/WB Zero? M ux PC

Pipelined Datapath IF/ID 4 Add ID/EX M ux EX/MEM MEM/WB Zero? M ux PC Instr. Cache ALU Regs M ux Sign extend Data Cache

Performance limitations • Imbalance among pipe stages – limits cycle time to slowest stage

Performance limitations • Imbalance among pipe stages – limits cycle time to slowest stage • Pipelining overhead – Pipeline register delay – Clock skew • Clock cycle > clock skew + latch overhead