Pipelined Design 9 Latency and Throughput Latency Total

  • Slides: 13
Download presentation
Pipelined Design 9 תרגול

Pipelined Design 9 תרגול

Latency and Throughput Latency - Total time to perform a single operation from start

Latency and Throughput Latency - Total time to perform a single operation from start to end. Throughput – operations / latency ratio or GOPS, giga-operations per second. The clock cycle is always bounded by the slowest operation. Picoseconds - 10^-12 Nanoseconds - 10^-9

Latency and Throughput 80 ps 30 ps 60 ps 50 ps 70 ps 10

Latency and Throughput 80 ps 30 ps 60 ps 50 ps 70 ps 10 ps 20 ps R A B C D E F E G How to maximize throughput using an additional register? Put it between C and D How to maximize throughput using 2 additional registers? AB, CD, EF How to maximize throughput using 3 additional registers? A, BC, D, EF What is the cycle time, latency and throughput in that case? 110 ps, 440 ps, (1/110) * 1000 = 9. 09 How to get max throughput with min stages? 5 stages, since we cannot go lower than 100 ps for a clock cycle

 Pipeline registers Select A, Since only jxx and call need val. P at

Pipeline registers Select A, Since only jxx and call need val. P at further stages (E and M resp. ) Predicted PC

Control Hazards • PC prediction (Fetch stage) jxx , ret → ? call, jmp

Control Hazards • PC prediction (Fetch stage) jxx , ret → ? call, jmp → Val. C Other → Val. P • Branch prediction strategies • Always-taken • Never-taken • Backward taken, forward not taken Why are forward branches less common ? How can we deal with stack prediction (ret) ? 5

Data Hazards Data dependencies in Processors Result from one instruction is being used as

Data Hazards Data dependencies in Processors Result from one instruction is being used as an operand for another (nop example) Stalling / forwarding / load-use hazard 6

Data Dependencies: 2 Nop’s # demo-h 2. ys 0 x 000: irmovl $10, %edx

Data Dependencies: 2 Nop’s # demo-h 2. ys 0 x 000: irmovl $10, %edx 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop 0 x 00 e: addl 0 x 010: halt %edx, %eax 1 2 3 4 5 6 7 8 9 F D E M W F D E M W F D E M Cycle 6 W R[ %eax] R[ %eax 3%eax] ] ←� 3 • • D • val. A←�R[ %edx ] = 10 %edx] = 10 val. B←�R[ %eax ] = 0 %eax] = 0 Error 10 W

Stalling solution 1 2 3 4 5 6 7 8 9 D E M

Stalling solution 1 2 3 4 5 6 7 8 9 D E M W F D E M W D D E M W F F D E M 10 # demo-h 2. ys 0 x 000: irmovl $10, %edx 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop F bubble 0 x 00 e: addl %edx, %eax 0 x 010: halt � F W If instruction follows too closely after one that writes register, slow it down � Hold instruction in decode � Dynamically inject nop into execute stage

Data Forwarding Solution # demo-h 2. ys 1 2 3 4 5 0 x

Data Forwarding Solution # demo-h 2. ys 1 2 3 4 5 0 x 000: irmovl $10, %edx F D E M W F D E M W F D E M 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop 0 x 00 e: addl%edx, %eax 6 0 x 010: halt 8 9 Cycle 6 � irmovl in writeback stage � Destination value in W pipeline register � Forward as val. B for decode stage 7 W R[ %eax ] ← 3 W_dst. E= %eax W_val. E = 3 • • • D src. A= %edx src. B= %eax val. A ← R[ %edx ] = 10 val. B ← W_val. E= 3 W

Limitation of Forwarding # demo-luh. ys 0 x 000: 0 x 006: 0 x

Limitation of Forwarding # demo-luh. ys 0 x 000: 0 x 006: 0 x 00 c: 0 x 012: 0 x 018: 0 x 01 e: 0 x 020: 1 2 3 4 irmovl $128, %edx F D E M irmovl $3, %ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10, %ebx F mrmovl 0(%edx), %eax # Load %eax addl%ebx, %eax # Use %eax halt Load-use dependency � Value needed by end of decode stage in cycle 7 � Value read from memory in memory stage of cycle 8 5 6 W M E D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 7 Cycle 8 M M M_dst. E= %ebx M_val. E= 10 M_dst. M= %eax ← m_val. M M[128] = 3 • • • D val. A← �M_val. E= 10 ← val. B �R[%eax] = 0 Error

Avoiding Load/Use Hazard # demo-luh. ys 1 2 3 4 5 6 F D

Avoiding Load/Use Hazard # demo-luh. ys 1 2 3 4 5 6 F D E M W 0 x 006: irmovl $3, %ecx 0 x 00 c: rmmovl %ecx, 0(%edx) 0 x 012: irmovl $10, %ebx F D E M W F D F E D M E W M F D E 0 x 000: irmovl $128, %edx 0 x 018: mrmovl 0(%edx), %eax# Load %eax bubble 0 x 01 e: addl%ebx, %eax# Use %eax 0 x 020: halt � Stall using instruction for one cycle � Can then pick up loaded value by forwarding from memory stage F 7 8 9 10 W D M E D W M E W M W F F D E M Cycle 8 W W_dst. E= %ebx W_val. E= 10 M M_dst. M= %eax ← M[128] = 3 m_val. M 0 M[128] = 3 • D • val. A ← � W_val. E= 10 val. B ← �m_val. M= 3 W 11

Control Logic Processing “ret” Must stall until instruction reaches write back Load/use hazard Must

Control Logic Processing “ret” Must stall until instruction reaches write back Load/use hazard Must stall between read memory and use Mis-predicted branch removing instructions from the pipe 12

 • Implementing the forwarding logic • Note: 5 forwarding sources

• Implementing the forwarding logic • Note: 5 forwarding sources