Lecture Static ILP Topics loop unrolling software pipelines

LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST

LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST

Loop Unrolling Loop: L. D F 0, 0(R 1) ADD. D F 4, F

Scheduled and Unrolled Loop: L. D F 0, 0(R 1) L. D F 6,

Loop Unrolling • Increases program size • Requires more registers • To unroll an

Automating Loop Unrolling • Determine the dependences across iterations: in the example, we knew

Superscalar Pipelines Integer pipeline FP pipeline Handles L. D, S. D, ADDUI, BNE Handles

Superscalar Pipelines Loop: Integer pipeline L. D F 0, 0(R 1) L. D F

Software Pipeline? ! L. D ADD. D DADDUI BNE Loop: S. D L. D

Software Pipeline L. D Original iter 1 ADD. D S. D L. D ADD.

Software Pipelining Loop: L. D F 0, 0(R 1) ADD. D F 4, F

Slides: 21

Download presentation

Lecture: Static ILP • Topics: loop unrolling, software pipelines (Sections C. 5, 3. 2) 1

LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST : 2 stalls Int. ALU -> BR : 1 stall Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: Source code L. D F 0, 0(R 1) ADD. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, # -8 BNE R 1, R 2, Loop NOP ; F 0 = array element ; add scalar ; store result ; decrement address pointer ; branch if R 1 != R 2 L. D F 0, 0(R 1) stall ADD. D F 4, F 0, F 2 stall S. D F 4, 0(R 1) DADDUI R 1, # -8 stall BNE R 1, R 2, Loop stall ; F 0 = array element Assembly code ; add scalar ; store result ; decrement address pointer 10 -cycle schedule ; branch if R 1 != R 2 2

LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST : 2 stalls Int. ALU -> BR : 1 stall Smart Schedule Loop: L. D F 0, 0(R 1) stall ADD. D F 4, F 0, F 2 stall S. D F 4, 0(R 1) DADDUI R 1, # -8 stall BNE R 1, R 2, Loop stall Loop: L. D F 0, 0(R 1) DADDUI R 1, # -8 ADD. D F 4, F 0, F 2 stall BNE R 1, R 2, Loop S. D F 4, 8(R 1) • By re-ordering instructions, it takes 6 cycles per iteration instead of 10 • We were able to violate an anti-dependence easily because an immediate was involved • Loop overhead (instrs that do book-keeping for the loop): 2 Actual work (the ld, add. d, and s. d): 3 instrs Can we somehow get execution time to be 3 cycles per iteration? 3

LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls Int. ALU -> BR : 1 stall Problem 1 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: Source code L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 2) DADDUI R 1, # -8 DADDUI R 2, #-8 BNE R 1, R 3, Loop NOP ; F 0 = array element ; multiply scalar ; store result ; decrement address pointer ; branch if R 1 != R 3 Assembly code • How many cycles do the default and optimized schedules take? 4

Loop Unrolling Loop: L. D F 0, 0(R 1) ADD. D F 4, F 0, F 2 S. D F 4, 0(R 1) L. D F 6, -8(R 1) ADD. D F 8, F 6, F 2 S. D F 8, -8(R 1) L. D F 10, -16(R 1) ADD. D F 12, F 10, F 2 S. D F 12, -16(R 1) L. D F 14, -24(R 1) ADD. D F 16, F 14, F 2 S. D F 16, -24(R 1) DADDUI R 1, #-32 BNE R 1, R 2, Loop • Loop overhead: 2 instrs; Work: 12 instrs • How long will the above schedule take to complete? 6

Scheduled and Unrolled Loop: L. D F 0, 0(R 1) L. D F 6, -8(R 1) L. D F 10, -16(R 1) L. D F 14, -24(R 1) ADD. D F 4, F 0, F 2 ADD. D F 8, F 6, F 2 ADD. D F 12, F 10, F 2 ADD. D F 16, F 14, F 2 S. D F 4, 0(R 1) S. D F 8, -8(R 1) DADDUI R 1, # -32 S. D F 12, 16(R 1) BNE R 1, R 2, Loop S. D F 16, 8(R 1) LD -> any : 1 stall FPALU -> any: 3 stalls FPALU -> ST : 2 stalls Int. ALU -> BR : 1 stall • Execution time: 14 cycles or 3. 5 cycles per original iteration 7

Loop Unrolling • Increases program size • Requires more registers • To unroll an n-iteration loop by degree k, we will need (n/k) iterations of the larger loop, followed by (n mod k) iterations of the original loop 8

Automating Loop Unrolling • Determine the dependences across iterations: in the example, we knew that loads and stores in different iterations did not conflict and could be re-ordered • Determine if unrolling will help – possible only if iterations are independent • Determine address offsets for different loads/stores • Dependency analysis to schedule code without introducing hazards; eliminate name dependences by using additional registers 9

LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls Int. ALU -> BR : 1 stall Problem 2 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: Source code L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 2) DADDUI R 1, # -8 DADDUI R 2, #-8 BNE R 1, R 3, Loop NOP ; F 0 = array element ; multiply scalar ; store result ; decrement address pointer ; branch if R 1 != R 3 Assembly code • How many unrolls does it take to avoid stall cycles? 10

Superscalar Pipelines Integer pipeline FP pipeline Handles L. D, S. D, ADDUI, BNE Handles ADD. D • What is the schedule with an unroll degree of 4? 12

Superscalar Pipelines Loop: Integer pipeline L. D F 0, 0(R 1) L. D F 6, -8(R 1) L. D F 10, -16(R 1) L. D F 14, -24(R 1) L. D F 18, -32(R 1) S. D F 4, 0(R 1) S. D F 8, -8(R 1) S. D F 12, -16(R 1) DADDUI R 1, # -40 S. D F 16, 16(R 1) BNE R 1, R 2, Loop S. D F 20, 8(R 1) FP pipeline ADD. D F 4, F 0, F 2 F 8, F 6, F 2 F 12, F 10, F 2 F 16, F 14, F 20, F 18, F 2 • Need unroll by degree 5 to eliminate stalls • The compiler may specify instructions that can be issued as one packet • The compiler may specify a fixed number of instructions in each packet: Very Large Instruction Word (VLIW) 13

LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls Int. ALU -> BR : 1 stall Problem 3 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: Source code L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 2) DADDUI R 1, # -8 DADDUI R 2, #-8 BNE R 1, R 3, Loop NOP ; F 0 = array element ; multiply scalar ; store result ; decrement address pointer ; branch if R 1 != R 3 Assembly code • How many unrolls does it take to avoid stalls in the superscalar pipeline? 14

Software Pipeline? ! L. D ADD. D DADDUI BNE Loop: S. D L. D ADD. D DADDUI BNE L. D F 0, 0(R 1) ADD. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, # -8 BNE R 1, R 2, Loop S. D L. D ADD. D DADDUI BNE … 16

Software Pipeline L. D Original iter 1 ADD. D S. D L. D ADD. D New iter 1 Original iter 2 New iter 2 Original iter 3 New iter 3 Original iter 4 L. D New iter 4 17

Software Pipelining Loop: L. D F 0, 0(R 1) ADD. D F 4, F 0, F 2 S. D F 4, 0(R 1) DADDUI R 1, # -8 BNE R 1, R 2, Loop: S. D ADD. D L. D DADDUI BNE F 4, 16(R 1) F 4, F 0, F 2 F 0, 0(R 1) R 1, # -8 R 1, R 2, Loop • Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead • Disadvantages: does not reduce loop overhead, may require more registers 18

LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls Int. ALU -> BR : 1 stall Problem 4 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: Source code L. D F 0, 0(R 1) MUL. D F 4, F 0, F 2 S. D F 4, 0(R 2) DADDUI R 1, # -8 DADDUI R 2, #-8 BNE R 1, R 3, Loop NOP ; F 0 = array element ; multiply scalar ; store result ; decrement address pointer ; branch if R 1 != R 3 Assembly code • Show the SW pipelined version of the code and does it cause stalls? 19

Title • Bullet 21