CS 704 Advanced Computer Architecture Lecture 10 Computer

  • Slides: 26
Download presentation
CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control

CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr. M. Ashraf Chughtai

Recap: Lecture 9 Single cycle verses multi cycle datapath Key components of multi cycle

Recap: Lecture 9 Single cycle verses multi cycle datapath Key components of multi cycle data path Design and information flow in multi cycle data path Multi cycle control unit design Finite State Machine–based control Unit Microprogram-based controller MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 2

What is pipelining? Pipelining is a fundamental concept It utilizes capabilities of the Datapath

What is pipelining? Pipelining is a fundamental concept It utilizes capabilities of the Datapath by MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 3

Pipelining is Natural! Laundry Example! Four loads: A, B, C, D Four laundry operations:

Pipelining is Natural! Laundry Example! Four loads: A, B, C, D Four laundry operations: A B C D Wash, Dry, fold and place into drawers Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 4

Sequential Laundry 6 PM T a s k O r d e r A

Sequential Laundry 6 PM T a s k O r d e r A 7 8 9 10 11 12 1 2 AM 30 30 30 30 Time B C D Explanation next please ……………. . MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 5

Pipelined Laundry: Start work ASAP 6 PM T a s k O r d

Pipelined Laundry: Start work ASAP 6 PM T a s k O r d e r 7 8 9 10 30 30 11 12 1 2 AM Time A B C D Pipelined laundry takes 3. 5 hours for 4 loads! MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 6

Features of Pipelined Processor All the functional units operate independently Multiple tasks operating simultaneously

Features of Pipelined Processor All the functional units operate independently Multiple tasks operating simultaneously using different resources Pipelining doesn’t help latency of single it helps throughput of entire workload task, Potential speedup = Number pipe stages ……… Cont’d MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) Next please! 7

Pipelining Lessons Pipeline rate limited by: - Slowest pipeline stage - Time to “fill”

Pipelining Lessons Pipeline rate limited by: - Slowest pipeline stage - Time to “fill” pipeline and time to “drain” it reduces speedup - Unbalanced lengths of pipe stages reduces speedup If washer takes longer time than the dryer then dryer has to wait! Stall for Dependences MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 8

Five Steps of Datapath Ins. fetch Dec/Reg Exec Mem Wr MAC/VU-Advanced Computer Architecture Lecture

Five Steps of Datapath Ins. fetch Dec/Reg Exec Mem Wr MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 9

Pipelined Processor Design Lecture 10 –Computer Hardware Design (4) Equal WB Ctrl Write Back

Pipelined Processor Design Lecture 10 –Computer Hardware Design (4) Equal WB Ctrl Write Back (Reg. Wrt) Reg. File IRmem B IRwb Ex Ctrl Exec S Mem Ctrl IRex A Memory Rd/Wrt Mem Access Dcd Ctrl IR Inst. Mem PC Next PC MAC/VU-Advanced Computer Architecture Execute/ Address Reg File ID/Register Read Instruction Fetch M Data Mem 10

Pipeline Control IR <- Mem[PC]; PC <– PC+4; Instruction Fetch A <- R[rs]; B<–

Pipeline Control IR <- Mem[PC]; PC <– PC+4; Instruction Fetch A <- R[rs]; B<– R[rt] ID/Reg. Rd Exe/Address S <– A + B; S <– A or ZX; Memory Rd/Wrt Reg. Wrt (WB) MAC/VU-Advanced Computer Architecture S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Lecture 10 –Computer Hardware Design (4) 11

Pipelined Registers Included MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) Equal WB

Pipelined Registers Included MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) Equal WB Ctrl Write Back (Reg. Wrt) Reg. File IRwb Mem Ctrl S Memory Rd/Wrt Mem Access IRmem B Ex Ctrl IRex A Execute/ Address Exec Dcd Ctrl ID/Register Read Reg File PC Next PC Inst. Mem IR Instruction Fetch M Data Mem 12

Five Steps as Stages of Pipeline Load Cycle 1 Cycle 2 Cycle 3 Ifetch

Five Steps as Stages of Pipeline Load Cycle 1 Cycle 2 Cycle 3 Ifetch Reg/Dec Exec Cycle 4 Mem Cycle 5 Wr . MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 13

Multiple Cycle verses Pipeline – Pipeline enhances performance Cycle 1 2 3 4 5

Multiple Cycle verses Pipeline – Pipeline enhances performance Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr Explanation next slide……. MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 14

3 Instructions program reconsidered Load Store R-type (ADD) MAC/VU-Advanced Computer Architecture Lecture 10 –Computer

3 Instructions program reconsidered Load Store R-type (ADD) MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 15

Example The cycle time of a single cycle machine is 45 ns, and of

Example The cycle time of a single cycle machine is 45 ns, and of multi cycle and pipelined machines is 10 ns; and average CPI due to instruction mix on multi cycle machine is 4. 6. What is the execution time on each type of machine? Ans: Single Cycle Machine – 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multi Cycle Machine – 10 ns/cycle x 4. 6 CPI x 100 inst = 4600 ns Pipelined machine – 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 16

Another Example Consider a multicycle, unpiplined processor requires 4 cycles for the ALU and

Another Example Consider a multicycle, unpiplined processor requires 4 cycles for the ALU and Branch operations and 5 cycles for the memory operation. Assume the relative frequency of these operations is 40%, 25% and 35% respectively; and the clock cycle is of 1 n sec. In pipelined implementation, due to clock skew and setup processor adds 0. 2 n sec. to the clock Ignoring any latency impact, how much is the speedup from the pipelined processor? MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 17

Solution Unpiplined Processor: Average Execution Time/Instruction = Clock Cycle x Average CPI = 1

Solution Unpiplined Processor: Average Execution Time/Instruction = Clock Cycle x Average CPI = 1 n sec. x [{(0. 4 +. 25)} x 4 + 0. 35 x 5] = 1 n sec x (0. 65 x 4 + 0. 35 x 5) = 1 n sec. x (2. 60 + 1. 75) = 4. 35 n sec Pipelined Processor: Average Execution Time/ Instruction = Clock cycle + overhead = 1 n sec. + 0. 2 n. sec = 1. 2 n sec Speed up = 4. 35 / 1. 2 = 3. 62 times MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 18

Pipelined Execution Representation Conventional Representation - Helps showing the program flow viz-a-viz time Time

Pipelined Execution Representation Conventional Representation - Helps showing the program flow viz-a-viz time Time Program Flow 1 st Inst. IFetch Dcd 2 nd Inst. 3 rd Inst 4 th Inst 5 th Inst. MAC/VU-Advanced Computer Architecture Exec IFetch Dcd Mem Exec IFetch Dcd WB Mem WB Exec Mem IFetch Dcd Lecture 10 –Computer Hardware Design (4) Exec WB Mem WB 19

Graphical Representation Instr 4 Instr 5 Reg D. Mem Reg I. Mem Reg D.

Graphical Representation Instr 4 Instr 5 Reg D. Mem Reg I. Mem Reg D. Mem I. Mem Reg ALU Instr 3 CC 5 ALU Instr 2 I. Mem CC 4 ALU O r d e r Instr 1 CC 3 ALU I n s t r. CC 1 ALU Time (clock cycles) CC 2 CC 6 CC 7 CC 8 CC 9 Reg Reg Mem Reg Explanation…… Next Please MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 20

Why Pipeline? Because the resources are there! Time (clock cycles) Inst 3 MAC/VU-Advanced Computer

Why Pipeline? Because the resources are there! Time (clock cycles) Inst 3 MAC/VU-Advanced Computer Architecture Reg Im Reg Dm Im Reg Lecture 10 –Computer Hardware Design (4) Reg Dm ALU Inst 4 Im Dm ALU Inst 2 Reg ALU Inst 1 Im ALU O r d e r Inst 0 ALU I n s t r. Reg Dm Reg 21

Can pipelining get us into trouble? Structural hazards – Data hazards – Control hazards

Can pipelining get us into trouble? Structural hazards – Data hazards – Control hazards MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 22

How Stall degrades the performance? The pipelined CPI with stalls = Ideal CPI +

How Stall degrades the performance? The pipelined CPI with stalls = Ideal CPI + Stall clock cycles per instruction MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 23

How Stall degrades the performance? 1. Speedup w. r. t unpiplined = CPI Unpiplined

How Stall degrades the performance? 1. Speedup w. r. t unpiplined = CPI Unpiplined 1 + stall cycles per instruction 2. Speedup w. r. t. pipeline depth: : Speedup w. r. t pipeline depth = pipeline depth 1 + stall cycles per instruction MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 24

Summary multi cycle datapath verses pipeline datapath Key components of pipeline data path Performance

Summary multi cycle datapath verses pipeline datapath Key components of pipeline data path Performance enhancement due to pipeline Hazards in pipelined datapath MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 25

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4)

Asslam-u-a. Lacum and ALLAH Hafiz MAC/VU-Advanced Computer Architecture Lecture 10 –Computer Hardware Design (4) 26