Pipelined Design 9 Latency and Throughput Latency Total
- Slides: 13
Pipelined Design 9 תרגול
Latency and Throughput Latency - Total time to perform a single operation from start to end. Throughput – operations / latency ratio or GOPS, giga-operations per second. The clock cycle is always bounded by the slowest operation. Picoseconds - 10^-12 Nanoseconds - 10^-9
Latency and Throughput 80 ps 30 ps 60 ps 50 ps 70 ps 10 ps 20 ps R A B C D E F E G How to maximize throughput using an additional register? Put it between C and D How to maximize throughput using 2 additional registers? AB, CD, EF How to maximize throughput using 3 additional registers? A, BC, D, EF What is the cycle time, latency and throughput in that case? 110 ps, 440 ps, (1/110) * 1000 = 9. 09 How to get max throughput with min stages? 5 stages, since we cannot go lower than 100 ps for a clock cycle
Pipeline registers Select A, Since only jxx and call need val. P at further stages (E and M resp. ) Predicted PC
Control Hazards • PC prediction (Fetch stage) jxx , ret → ? call, jmp → Val. C Other → Val. P • Branch prediction strategies • Always-taken • Never-taken • Backward taken, forward not taken Why are forward branches less common ? How can we deal with stack prediction (ret) ? 5
Data Hazards Data dependencies in Processors Result from one instruction is being used as an operand for another (nop example) Stalling / forwarding / load-use hazard 6
Data Dependencies: 2 Nop’s # demo-h 2. ys 0 x 000: irmovl $10, %edx 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop 0 x 00 e: addl 0 x 010: halt %edx, %eax 1 2 3 4 5 6 7 8 9 F D E M W F D E M W F D E M Cycle 6 W R[ %eax] R[ %eax 3%eax] ] ←� 3 • • D • val. A←�R[ %edx ] = 10 %edx] = 10 val. B←�R[ %eax ] = 0 %eax] = 0 Error 10 W
Stalling solution 1 2 3 4 5 6 7 8 9 D E M W F D E M W D D E M W F F D E M 10 # demo-h 2. ys 0 x 000: irmovl $10, %edx 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop F bubble 0 x 00 e: addl %edx, %eax 0 x 010: halt � F W If instruction follows too closely after one that writes register, slow it down � Hold instruction in decode � Dynamically inject nop into execute stage
Data Forwarding Solution # demo-h 2. ys 1 2 3 4 5 0 x 000: irmovl $10, %edx F D E M W F D E M W F D E M 0 x 006: irmovl $3, %eax 0 x 00 c: nop 0 x 00 d: nop 0 x 00 e: addl%edx, %eax 6 0 x 010: halt 8 9 Cycle 6 � irmovl in writeback stage � Destination value in W pipeline register � Forward as val. B for decode stage 7 W R[ %eax ] ← 3 W_dst. E= %eax W_val. E = 3 • • • D src. A= %edx src. B= %eax val. A ← R[ %edx ] = 10 val. B ← W_val. E= 3 W
Limitation of Forwarding # demo-luh. ys 0 x 000: 0 x 006: 0 x 00 c: 0 x 012: 0 x 018: 0 x 01 e: 0 x 020: 1 2 3 4 irmovl $128, %edx F D E M irmovl $3, %ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10, %ebx F mrmovl 0(%edx), %eax # Load %eax addl%ebx, %eax # Use %eax halt Load-use dependency � Value needed by end of decode stage in cycle 7 � Value read from memory in memory stage of cycle 8 5 6 W M E D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 7 Cycle 8 M M M_dst. E= %ebx M_val. E= 10 M_dst. M= %eax ← m_val. M M[128] = 3 • • • D val. A← �M_val. E= 10 ← val. B �R[%eax] = 0 Error
Avoiding Load/Use Hazard # demo-luh. ys 1 2 3 4 5 6 F D E M W 0 x 006: irmovl $3, %ecx 0 x 00 c: rmmovl %ecx, 0(%edx) 0 x 012: irmovl $10, %ebx F D E M W F D F E D M E W M F D E 0 x 000: irmovl $128, %edx 0 x 018: mrmovl 0(%edx), %eax# Load %eax bubble 0 x 01 e: addl%ebx, %eax# Use %eax 0 x 020: halt � Stall using instruction for one cycle � Can then pick up loaded value by forwarding from memory stage F 7 8 9 10 W D M E D W M E W M W F F D E M Cycle 8 W W_dst. E= %ebx W_val. E= 10 M M_dst. M= %eax ← M[128] = 3 m_val. M 0 M[128] = 3 • D • val. A ← � W_val. E= 10 val. B ← �m_val. M= 3 W 11
Control Logic Processing “ret” Must stall until instruction reaches write back Load/use hazard Must stall between read memory and use Mis-predicted branch removing instructions from the pipe 12
• Implementing the forwarding logic • Note: 5 forwarding sources