Lecture 7 An Introduction to pipelining Pipelining Its

  • Slides: 19
Download presentation
Lecture 7 An Introduction to pipelining

Lecture 7 An Introduction to pipelining

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes A B C D

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e r A B C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take?

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Time 30

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Time 30 40 T a s k O r d e r 40 40 40 20 A B C D • Pipelined laundry takes 3. 5 hours for 4 loads Midnight

Pipelining Lessons T a s k O r d e r • Pipelining doesn’t

Pipelining Lessons T a s k O r d e r • Pipelining doesn’t help latency of single task, it helps throughput of 6 PM 7 8 9 entire workload Time • Pipeline rate limited by slowest pipeline stage 30 40 40 20 • Multiple tasks operating simultaneously A • Potential speedup = Number pipe stages B • Unbalanced lengths of pipe stages reduces C speedup • Time to “fill” pipeline D and time to “drain” it reduces speedup

Definitions • • • Pipe stage or pipe segment Pipeline depth Machine cycle Latency

Definitions • • • Pipe stage or pipe segment Pipeline depth Machine cycle Latency Throughput

Design Issues • Balance the length of each pipeline stage Throughput = Depth of

Design Issues • Balance the length of each pipeline stage Throughput = Depth of the pipeline Time per instruction on unpipelined machine • Problems – Usually, stages are not balanced – Pipelining overhead – Hazards (conflicts) • Performance (throughput – Decrease of the CPI – Decrease of cycle time CPU performance equation)

DLX Implementation • Integer subset of DLX – load/store word – branch – integer

DLX Implementation • Integer subset of DLX – load/store word – branch – integer ALU – NO jumps, NO FP • Unpipelined implementation – maximum five cycles per instruction

Instruction Formats I opcode 0 R 5 6 opcode 0 J rs 1 5

Instruction Formats I opcode 0 R 5 6 opcode 0 J rs 1 5 6 opcode 0 rd 10 11 rs 1 immediate 15 16 rs 2 10 11 31 rd 15 16 function 20 21 31 name 5 6 31 Fixed-field decoding

1 st and 2 nd Instruction cycles • Instruction fetch (IF) IR NPC Mem[PC];

1 st and 2 nd Instruction cycles • Instruction fetch (IF) IR NPC Mem[PC]; PC + 4 • Instruction decode & register fetch (ID) A Regs[IR 6. . 10]; B Regs[IR 11. . 15]; Imm ((IR 16)16 # # IR 16. . 31)

3 rd Instruction cycle • Execution & effective address (EX) – Memory reference •

3 rd Instruction cycle • Execution & effective address (EX) – Memory reference • ALUOutput A + Imm – Register - Register ALU instruction • ALUOutput A func B – Register - Immediate ALU instruction • ALUOutput A op Imm – Branch • ALUOutput NPC + Imm; Cond (A op 0)

4 th Instruction cycle • Memory access & branch completion (MEM) – Memory reference

4 th Instruction cycle • Memory access & branch completion (MEM) – Memory reference • PC NPC • LMD Mem[ALUOutput] • Mem[ALUOutput] B (load) (store) – Branch • if (cond) PC ALUOutput; else PC NPC

5 th Instruction cycle • Write-back (WB) – Register - register ALU instruction •

5 th Instruction cycle • Write-back (WB) – Register - register ALU instruction • Regs[IR 16. . 20] ALUOutput – Register - immediate ALU instruction • Regs[IR 11. . 15] ALUOutput – Load instruction • Regs[IR 11. . 15] LMD

Datapath M ux Zero? Add Cond NPC 4 M ux A PC Instr. Cache

Datapath M ux Zero? Add Cond NPC 4 M ux A PC Instr. Cache IR Regs B Sign extend IF ALU ID M ux ALU Output LMD M ux Data Cache Imm EX MEM WB

Control Step 1 Step 2 RR ALU Imm Store Load Step 3 Step 4

Control Step 1 Step 2 RR ALU Imm Store Load Step 3 Step 4 Step 5

Basic Pipeline Instr # i i +1 i +2 i +3 i +4 Clock

Basic Pipeline Instr # i i +1 i +2 i +3 i +4 Clock number 4 5 6 1 2 3 7 8 IF ID EX MEM WB IF ID EX MEM 9 WB

Pipeline Resources IM Reg IM ALU Reg IM DM ALU Reg IM Reg DM

Pipeline Resources IM Reg IM ALU Reg IM DM ALU Reg IM Reg DM ALU Reg DM Reg ALU DM Reg

Pipelined Datapath IF/ID 4 Add ID/EX M ux EX/MEM MEM/WB Zero? M ux PC

Pipelined Datapath IF/ID 4 Add ID/EX M ux EX/MEM MEM/WB Zero? M ux PC Instr. Cache ALU Regs M ux Sign extend Data Cache

Performance limitations • Imbalance among pipe stages – limits cycle time to slowest stage

Performance limitations • Imbalance among pipe stages – limits cycle time to slowest stage • Pipelining overhead – Pipeline register delay – Clock skew • Clock cycle > clock skew + latch overhead