CMPUT 680 Winter 2006 Topic I Superblock and

CMPUT 680 - Winter 2006 Topic I: Superblock and Hyperblock Formation José Nelson Amaral http: //www. cs. ualberta. ca/~amaral/courses/680 CMPUT 329 - Computer Organization and Architecture II 1

Instruction Level Parallelism Optimizations The objective of an optimizer is to reduce the number and complexity of the instructions executed by the processor. Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderately increases, as long as the dependence height is reduced. CMPUT 329 - Computer Organization and Architecture II 2

Speculative and Predicated Execution Speculative Execution: execution of an instruction before knowing that its execution is required. Superblock: structure used to implement compiler-controlled speculative execution. Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction. If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication. CMPUT 329 - Computer Organization and Architecture II 3

Trace Scheduling (Fisher, 1981) Some optimization and scheduling decisions may decrease the execution time for one control path while increasing the execution time for another path. Thus decisions should favor more frequently executed paths to improve overall performance. Trace scheduling divides a procedure in a set of frequently executed traces (paths). CMPUT 329 - Computer Organization and Architecture II 4

Trace Scheduling There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, bookeeping is required to ensure the correct execution of off-trace code. CMPUT 329 - Computer Organization and Architecture II 5

Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 2 Instr 3 Instr 4 Instr 1 Instr 5 What bookeeping is required when Instr 1 is moved below the side entrance in the trace? CMPUT 329 - Computer Organization and Architecture II 6

Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 2 Instr 3 Instr 4 Instr 1 Instr 5 CMPUT 329 - Computer Organization and Architecture II Instr 3 Instr 4 7

Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 1 Instr 5 Instr 2 Instr 3 Instr 4 What bookeeping is required when Instr 5 moves above the side entrance in the trace? CMPUT 329 - Computer Organization and Architecture II 8

Bookeeping for Trace Scheduling Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 1 Instr 5 Instr 2 Instr 3 Instr 4 CMPUT 329 - Computer Organization and Architecture II Instr 5 9

Superblocks A superblock is a trace without side entrances, i. e. , control can only enter from the top, but it can leave at one or more exit points. The formation of superblocks creates additional optimization opportunities because constraints associated with infrequently executed paths of control are ignored (thus these constraints do not inhibit optimizations that favor frequently executed paths). CMPUT 329 - Computer Organization and Architecture II 10

Superblock Formation (Example) Y Y 1 90 0 B 90 D 0 0 D 100 90 90 E 90 F 100 1 Z 10 C 10 10 0 99 D 0 0 1 D 100 90 B 90 90 E 90 90 F 100 10 C 10 10 99 1 CMPUT 329 - Computer Organization and Architecture II Z 11

Superblock Formation (Example) Y 0 D 0 0 1 D 100 90 B 90 90 E 90 90 F 100 1 Z Is this a superblock? 10 C 10 10 99 No, a superblock cannot have side entrances, and this set of nodes has two side entrances into node F. How do we convert it into a superblock? CMPUT 329 - Computer Organization and Architecture II 12

Superblock Formation (Example) Y 0 D 0 9. 9 0 1 D 100 90 B 90 90 E 90 90 F 90 0. 9 Z F’ 10 10 C 10 10 89. 1 Tail duplication, is the duplication of basic blocks that appear after a side entrance to eliminate side entrances and transform a trace into a superblock. 10 0. 1 CMPUT 329 - Computer Organization and Architecture II 13

Common Subexpression Elimination in Superblocks op. A: mul r 1, r 2, 3 1 op. B: add r 2, 1 99 op. C: mul r 3, r 2, 3 Original Code 1 op. B: add r 2, 1 op. C’: mul r 3, r 2, 3 Code After Superblock Formation op. A: mul r 1, r 2, 3 99 op. C: mov r 3, r 1 1 op. B: add r 2, 1 op. C’: mul r 3, r 2, 3 Code After Common CMPUT 329 -Elimination Computer Subexpression Organization and Architecture II 14

Operation Migration in Superblocks … mov r 0, r 1 … mov r 0, r 2 … mov r 0, r 3 … add r 1, 4 add r 2, 4 add r 3, 4 Original Code … … X mov r 0, r 1 … Y Z … add r 1, 4 add r 2, 4 add r 3, 4 mov r 0, r 2 X mov r 0, r 3 Y Z After Operation Migration CMPUT 329 - Computer Organization and Architecture II 15

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 1 10 20 30 r 4 CMPUT 329 - Computer Organization and Architecture II 16

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 1 r 4 10 CMPUT 329 - Computer Organization and Architecture II 10 20 30 17

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 1 r 4 11 CMPUT 329 - Computer Organization and Architecture II 10 20 30 18

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 1 r 4 11 CMPUT 329 - Computer Organization and Architecture II 11 20 30 19

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 2 r 4 11 CMPUT 329 - Computer Organization and Architecture II 11 20 30 20

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 2 r 4 11 CMPUT 329 - Computer Organization and Architecture II 11 20 30 21

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 2 r 4 12 CMPUT 329 - Computer Organization and Architecture II 11 20 30 22

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 0 MEM[r 0+x] Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 1 r 1 2 r 4 12 CMPUT 329 - Computer Organization and Architecture II 12 20 30 23

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 MEM[r 0+x] 0 Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 2 r 1 2 r 4 12 CMPUT 329 - Computer Organization and Architecture II 12 20 30 24

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 MEM[r 0+x] 0 Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 2 r 1 2 r 4 20 CMPUT 329 - Computer Organization and Architecture II 12 20 30 25

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 MEM[r 0+x] 0 Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 2 r 1 2 r 4 21 CMPUT 329 - Computer Organization and Architecture II 12 20 30 26

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 100 MEM[r 0+x] 0 Op. E: add r 0, 1 Op. D: add r 1, 1 Original Program Segment r 0 2 r 1 2 r 4 21 CMPUT 329 - Computer Organization and Architecture II 12 21 30 27

Global Variable Migration in Superblock Loops Op. A: ld_I r 4, x, r 0 Op. B: add r 4, r 1 Op. C: st_I x, r 0, r 4 Op. B: add r 4, r 1 100 0 Op. E: add r 0, 1 Op. D: add r 1, 1 100 0 Op. D: add r 1, 1 Original Program Segment Op. C’: st_i x, r 0, r 4 Op. E: add r 0, 1 Op. C: st_i x, r 0, r 4 After Variable Migration CMPUT 329 - Computer Organization and Architecture II 28

Superblock Enlarging Optimizations By enlarging a superblock, we can provide the scheduler with more independent instructions to choose from for each cycle Superblock enlarging optimizations: Branch target expansion Loop unrolling Loop peeling CMPUT 329 - Computer Organization and Architecture II 29

Branch Target Expansion Idea: To expand the superblock with the target of a likely taken branch. L 1: blt r 1, r 2, L 3 20 L 2: blt r 1, r 2, L 3 100 L 3: jump L 4 beq r 3, r 4, L 5 CMPUT 329 - Computer Organization and Architecture II 20 L 2: jump L 4 30

Superblock Loops A superblock loop is a superblock that has a frequently taken backedge from its last node to its first node. We will study the extension of some common loop optimizations to superblocks. CMPUT 329 - Computer Organization and Architecture II 31

Dependence Removing Optimizations The goal is to eliminate data dependences between instructions within frequently executed superblocks. Dependence removing optimizations include: Register renaming Accumulator variable expansion Induction variable expansion Search variable expansion Operation combining Strength reduction Tree height reduction CMPUT 329 - Computer Organization and Architecture II 32

Instruction Latencies for Examples CMPUT 329 - Computer Organization and Architecture II 33

Register Renaming Example For (j=0; j<n; j++) { C(j) = A(j)+B(j) } L 1: ld_f add_f st_f add blt Original Loop f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, r 5, L 1 (a) (b) (c) (d) (e) (f) Assembly Code For all the examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule. CMPUT 329 - Computer Organization and Architecture II 34

Register Renaming Example L 1: For (j=0; j<n; j++) { C(j) = A(j)+B(j) } ld_f add_f st_f add blt Original Loop f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, r 5, L 1 (a) (b) (c) (d) (e) (f) Assembly Code Instr. a a b b 0 c c c d e 7 cycles / 1 iteration f 5 cycles Code Schedule CMPUT 329 - Computer Organization and Architecture II 35

Register Renaming Example L 1: ld_f add_f st_f add blt f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, r 5, L 1 (a) (b) (c) (d) (e) (f) L 1: ld_f add_f st_f add ld_f add_f st_f add blt Original Assembly Code f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, 4 f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, r 5, L 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) After Loop Unrolling CMPUT 329 - Computer Organization and Architecture II 36

Loop Unrolling L 1: ld_f add_f st_f add ld_f add_f st_f add blt f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, 4 f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, r 5, L 1 After Loop Unrolling (a) Instr. (b) a a b b (c) c c c (d) d e (e) f f (f) g g h h h (g) i j (h) k k (i) l l m m m (j) n (k) o p (l) (m) 0 5 10 15 cycles (n) (o) Code Schedule (p) 19 cycles / 3 iterations = 6. 3 cycles / iteration CMPUT 329 - Computer Organization and Architecture II 37

Register Renaming L 1: ld_f add_f st_f add ld_f add_f st_f add blt f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, 4 f 2, A, r 1 f 3, B, r 1 f 4, f 2, f 3 C, r 1, f 4 r 1, r 5, L 1 After Loop Unrolling (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) L 1: ld_f add_f st_f add ld_f add_f st_f add blt f 21, A, r 11 f 31, B, r 11 f 41, f 21, f 31 C, r 11, f 41 r 12, r 11, 4 f 22, A, r 12 f 32, B, r 12 f 42, f 22, f 32 C, r 12, f 42 r 13, r 12, 4 f 23, A, r 13 f 33, B, r 13 f 43, f 23, f 33 C, r 13, f 43 r 11, r 13, 4 r 11, r 5, L 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) After Register Renaming CMPUT 329 - Computer Organization and Architecture II 38

Loop Unrolling and Register Renaming L 1: ld_f add_f st_f add ld_f add_f st_f add blt f 21, A, r 11 f 31, B, r 11 f 41, f 21, f 31 C, r 11, f 41 r 12, r 11, 4 f 22, A, r 12 f 32, B, r 12 f 42, f 22, f 32 C, r 12, f 42 r 13, r 12, 4 f 23, A, r 13 f 33, B, r 13 f 43, f 23, f 33 C, r 13, f 43 r 11, r 13, 4 r 11, r 5, L 1 (a) Instr. (b) a a (c) b b c c c (d) d (e) e f f (f) g g (g) h h h i (h) j k k (i) l l (j) m m m n (k) o (l) p (m) (n) 0 5 10 15 (o) Code Schedule (p) After Register Renaming cycles 8 cycles / 3 iterations = 2. 7 cycles / iteration CMPUT 329 - Computer Organization and Architecture II 39

Accumulator Variable Expansion An accumulator variable accumulates a sum or product in each iteration of a loop. Accumulator variable expansion eliminates redefinitions of an accumulator variable within an unrolled loop by creating k temporary accumulators (k is the number of accumulation instructions). The values of all temporary accumulators must be summed at the exit points of the loop where the accumulator is live. CMPUT 329 - Computer Organization and Architecture II 40

Accumulator Expansion Example For (k=0; k<n; k++) { C(i, j) = C(i, j) + A(i, k) * B(k, j) } ld_f mul_f add add blt st_f L 1: Original Loop f 1, C, r 2 f 3, A, r 4 f 5, B, r 6 f 7, f 3, f 5 f 1, f 7 r 4, 4 r 6, r 8 r 4, r 9, L 1 C, r 2, f 1 (-) (a) (b) (c) (d) (e) (f) (g) (-) Assembly Code For all examples we assume a superscalar processor with infinite resources and no register renaming hardware. Thus for the code above, we obtain the following schedule. CMPUT 329 - Computer Organization and Architecture II 41

Accumulator Expansion Example For (k=0; k<n; k++) { C(i, j) = C(i, j) + A(i, k) * B(k, j) } ld_f mul_f add add blt st_f L 1: Original Loop Instr. a a b b (-) (a) (b) (c) (d) (e) (f) (g) (-) Assembly Code c c c d d d e f 0 f 1, C, r 2 f 3, A, r 4 f 5, B, r 6 f 7, f 3, f 5 f 1, f 7 r 4, 4 r 6, r 8 r 4, r 9, L 1 C, r 2, f 1 8 cycles / 1 iteration g 5 Code Schedule cycles CMPUT 329 - Computer Organization and Architecture II 42

Loop Unrolling and Register Renaming L 1: ld_f mul_f add add blt st_f f 1, C, r 2 f 3, A, r 4 f 5, B, r 6 f 7, f 3, f 5 f 1, f 7 r 4, 4 r 6, r 8 r 4, r 9, L 1 C, r 2, f 1 (-) (a) (b) (c) (d) (e) (f) (g) (-) Assembly Code CMPUT 329 - Computer After Unrolling and Renaming Organization and Architecture II ld_f mul_f add_f add add ld_f mul_f add add blt st_f f 1, C, r 2 f 31, A, r 41 f 51, B, r 61 f 71, f 31, f 51 f 1, f 71 r 42, r 41, 4 r 62, r 61, r 8 f 32, A, r 42 f 52, B, r 62 f 72, f 32, f 52 f 1, f 72 r 43, r 42, 4 r 63, r 62, r 8 f 33, A, r 43 f 53, B, r 63 f 73, f 33, f 53 f 1, f 73 r 41, r 43, 4 r 61, r 63, r 8 r 4, r 9, L 1 C, r 2, f 1 (-) (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) (r) (s) (-) 43

Loop Unrolling and Register Renaming L 1: ld_f mul_f add_f add add ld_f mul_f add add blt st_f f 1, C, r 2 f 31, A, r 41 f 51, B, r 61 f 71, f 31, f 51 f 1, f 71 r 42, r 41, 4 r 62, r 61, r 8 f 32, A, r 42 f 52, B, r 62 f 72, f 32, f 52 f 1, f 72 r 43, r 42, 4 r 63, r 62, r 8 f 33, A, r 43 f 53, B, r 63 f 73, f 33, f 53 f 1, f 73 r 41, r 43, 4 r 61, r 63, r 8 r 4, r 9, L 1 C, r 2, f 1 (-) Instr. (a) a a (b) b b c c c (c) d d d (d) e f (e) g g (f) h h i i i (g) j j j k (h) l (i) m m n n (j) o o o (k) p p p q (l) r s (m) 0 5 10 15 cycles (n) (o) Code Schedule (p) (q) 14 cycles / 3 iterations = 4. 7 cycles / iteration (r) CMPUT 329 - Computer (s) Organization and Architecture II 44 (-)

L 1: ld_f f 11, C, r 2 mov_f f 12, 0 mov_f f 13, 0 ld_f f 31, A, r 41 ld_f f 51, B, r 61 mul_f f 71, f 31, f 51 add_f f 11, f 71 add r 42, r 41, 4 add r 62, r 61, r 8 ld_f f 32, A, r 42 ld_f f 52, B, r 62 mul_f f 72, f 32, f 52 add_f f 12, f 72 add r 43, r 42, 4 add r 63, r 62, r 8 ld_f f 33, A, r 43 ld_f f 53, B, r 63 mul_f f 73, f 33, f 53 add_f f 13, f 73 add r 41, r 43, 4 add r 61, r 63, r 8 blt r 4, r 9, L 1 add_f f 11, f 12 add_f f 11, f 13 st_f C, r 2, f 1 Accumulator Expansion (-) (-) (a) (b) Instr. (c) a a (d) b b c c c (e) d d d (f) e f (g) g g (h) h h i i i (i) j j j k (j) l (k) m m n n (l) o o o (m) p p p q (n) r s (o) 0 5 10 15 (p) cycles (q) Code Schedule (r) (s) 10 cycles / 3 iterations = 3. 3 cycles / iteration (-) CMPUT 329 - Computer (-) Organization and Architecture II 45 (-)

Induction Variable Expansion An induction variable is used to index through loop iterations and through regular data structure, such as arrays. Induction variable expansion eliminates dependences between definitions of induction variables and their uses in unrolled loops. CMPUT 329 - Computer Organization and Architecture II 46

Induction Variable Expansion Example For (i=0; i<n; i++) { C(j) = A(j) * B(j) j=j+K } L 1: ld_f f 3, A, r 2 ld_f f 4, B, r 2 mul_f f 5, f 3, f 4 st_f C, r 2, f 5 add r 2, r 7 add r 1, 1 blt r 1, r 6, L 1 Original Loop Instr. (a) (b) (c) (d) (e) (f) (g) Assembly Code a a b b c c c d e f 0 6 cycles / 1 iteration g 5 Code Schedule cycles CMPUT 329 - Computer Organization and Architecture II 47

Loop Unrolling and Register Renaming L 1: ld_f f 3, A, r 2 ld_f f 4, B, r 2 mul_f f 5, f 3, f 4 st_f C, r 2, f 5 add r 2, r 7 add r 1, 1 blt r 1, r 6, L 1 Assembly Code (a) (b) (c) (d) (e) (f) (g) ld_f mul_f st_f add ld_f mul_f st_f add blt CMPUT 329 - Computer After Unrolling Organization and Architecture II f 31, A, r 21 f 41, B, r 21 f 51, f 31, f 41 C, r 21, f 51 r 22, r 21, r 7 f 32, A, r 22 f 42, B, r 22 f 52, f 32, f 42 C, r 22, f 52 r 23, r 22, r 7 f 33, A, r 23 f 43, B, r 23 f 53, f 33, f 43 C, r 23, f 53 r 21, r 23, r 7 r 1, 3 r 1, r 6, L 1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) and Renaming 48

Loop Unrolling and Register Renaming L 1: ld_f mul_f st_f add ld_f mul_f st_f add blt f 31, A, r 21 f 41, B, r 21 f 51, f 31, f 41 C, r 21, f 51 r 22, r 21, r 7 f 32, A, r 22 f 42, B, r 22 f 52, f 32, f 42 C, r 22, f 52 r 23, r 22, r 7 f 33, A, r 23 f 43, B, r 23 f 53, f 33, f 43 C, r 23, f 53 r 21, r 23, r 7 r 1, 3 r 1, r 6, L 1 (a) Instr. (b) a a (c) b b (d) c c c d (e) e f f (f) g g (g) h h h i (h) j (i) k k l l (j) m m m n (k) o (l) p q (m) (n) (o) 0 5 10 15 (p) Code Schedule (q) After Unrolling and Renaming cycles 8 cycles / 3 iterations = 2. 6 cycles / iteration CMPUT 329 - Computer Organization and Architecture II 49

mov add mul L 1: ld_f mul_f st_f ld_f mul_f st_f add add After Unrolling blt Induction Variable Expansion r 21, r 2 (-) r 22, r 21, r 7 (-) r 23, r 22, r 7 (-) Instr. r 71, r 7, 3 (-) a a b b f 31, A, r 21 (a) c c c f 41, B, r 21 (b) d f f f 51, f 31, f 41 (c) g g C, r 21, f 51 (d) h h h i f 32, A, r 22 (f) k k l l f 42, B, r 22 (g) m m m f 52, f 32, f 42 (h) n e C, r 22, f 52 (i) j f 33, A, r 23 (k) o p f 43, B, r 23 (l) q f 53, f 33, f 43 (m) C, r 23, f 53 (n) 0 5 10 15 cycles r 21, r 71 (e) Code Schedule r 22, r 71 (j) r 23, r 71 (o) r 1, 3 (p) 6 cycles / 3 iterations = 2 cycles / iteration and r 1, r 6, Renaming L 1 (q) CMPUT 329 - Computer Organization and Architecture II 50

Search Variable Expansion A search variable is a single value (p. e. , a minimum or a maximum) computed for a collection of data. Search variable expansion eliminates dependences between definitions of search variables and their uses in unrolled loops. Each search variable is expanded into k temporary independent variables. At the exit of the loop the value of the original search variable is obtained by comparing the values of the temporary search variables. CMPUT 329 - Computer Organization and Architecture II 51

Superblock Scheduling Superblock scheduling is a two step process: Step 1: Build dependence graph Step 2: List scheduling using the dependence graph, instruction latencies, and resource constraints of the processor CMPUT 329 - Computer Organization and Architecture II 52

List Scheduling List scheduling employs heuristics to choose among all ready nodes, the combination of nodes that should be scheduled in the current cycle. A node is ready if: (i) all its parents in the dependence graph have been scheduled; (ii) the result produced by each parent is available; and (iii) the resources required by the node are available. CMPUT 329 - Computer Organization and Architecture II 53

Speculative Execution in Superblocks To produce an efficient schedule, the compiler must be able to move instructions above and below branches. LIVE-OUT(BR) is the set of SB 1 R: x y+z variables that may be used before being redefined when … the branch BR is taken S: bnz r 1 P. . . In the example, LIVE-OUT(S) B 2. . . is the set of variables that is live at point P. CMPUT 329 - Computer Organization and Architecture II 54

Speculative Execution in Superblocks If we want to move instruction R below the branch instruction S, two situations might occur: SB 1 R: x y+z … S: bnz r 1. . . 1) x LIVE-OUT(S) 2) x LIVE-OUT(S) P B 2 . . . What is the code that the compiler should produce for each situation? CMPUT 329 - Computer Organization and Architecture II 55

Speculative Execution in Superblocks If we want to move instruction R below the branch instruction S, two situations might occur: SB 1 R: x y+z … S: bnz r 1. . . P B 2 . . . 1) x LIVE-OUT(S) insert a copy of instruction R in the branch target. 2) x LIVE-OUT(S) no compensation code is required CMPUT 329 - Computer Organization and Architecture II 56

Speculative Execution in Superblocks SB 1 … S: bnz r 1 … R: x y+z P B 2 R’: x y+z. . . 1) x LIVE-OUT(S) must introduce R’ in basic block B 2 … S: bnz r 1 … R: x y+z P B 2 . . . 2) x LIVE-OUT(S) no compensation code is required CMPUT 329 - Computer Organization and Architecture II 57

Speculative Execution in Superblocks Upward code motion is more common to reduce the critical path of a superblock. (p. e. moving a load instruction upward to hide the load latency). There are two major restrictions to move an instruction J from below to above a branch BR: Restriction 1: The destination of J is not in LIVE-OUT(BR). Restriction 2: J will never cause an exception that may terminate program execution when BR is taken. CMPUT 329 - Computer Organization and Architecture II 58

Speculative Execution in Superblocks Restriction 1 is usually removed by register renaming. By renaming the destination register of instruction J, we ensure that it is not in LIVE-OUT(BR). There are two extreme interpretations to restriction 2. Restricted Speculation Model: fully enforce restriction 2. Therefore only instructions that cannot cause expections are candidates for speculative execution (p. e. memory load, memory store, integer divide, and all floating point instructions cannot be speculated). CMPUT 329 - Computer Organization and Architecture II 59

Speculative Execution in Superblocks General Speculation Model: completely ignore restriction 2. Requires that the processor provide non-excepting or silent versions of all potentially excepting instructions in the instruction set architecure. If an exception occurs for a silent instruction, it is simply ignored, and garbage is written in the destination. CMPUT 329 - Computer Organization and Architecture II 60

Example for Speculative Execution avg = 0; weight = 0; count = 0; while(prt != NULL) { count = count + 1; if(prt->wt > 0) weight = weight - prt->wt; else weight = weight + prt->wt; prt = prt -> next; } if(count != 0) avg = weight/count C code segment (i 1) (i 2) (i 3) (i 4) (i 5) (i 6) L 0: (i 7) (i 8) (i 9) (i 10) (i 11) L 1: (i 12) L 2: (i 13) (i 14) L 3: (i 15) (i 16) (i 17) L 4: ld_i r 1, prt, 0 mov r 7, 0 mov r 2, 0 mov r 3, 0 beq r 1, 0, L 3 add r 2, 1 ld_i r 4, r 1, 0 bge r 4, 0, L 1 sub r 3, r 4 jmp L 2 add r 3, r 4 ld_i r 1, 4 bne r 1, 0, L 0 beq r 2, 0, L 4 div r 7, r 3, r 2 st_i avg, 0, r 7 // avg // count // weight // prt->wt Assembly code segment CMPUT 329 - Computer Organization and Architecture II 61

Example for Speculative Execution (i 1) (i 2) (i 3) (i 4) (i 5) (i 6) L 0: (i 7) (i 8) (i 9) (i 10) (i 11) L 1: (i 12) L 2: (i 13) (i 14) L 3: (i 15) (i 16) (i 17) L 4: ld_i r 1, prt, 0 mov r 7, 0 mov r 2, 0 mov r 3, 0 beq r 1, 0, L 3 add r 2, 1 ld_i r 4, r 1, 0 bge r 4, 0, L 1 sub r 3, r 4 jmp L 2 add r 3, r 4 ld_i r 1, 4 bne r 1, 0, L 0 beq r 2, 0, L 4 div r 7, r 3, r 2 st_i avg, 0, r 7 1 // avg // count // weight BB 2 i 6 i 7 i 8 // prt->wt Assembly code segment 10 BB 3 i 9 i 10 90 BB 4 99 i 11 90 10 i 12 i 13 BB 5 1 Trace Selection for the Loop CMPUT 329 - Computer Organization and Architecture II 62

Example for Speculative Execution 1 SB 1 BB 2 i 6 i 7 i 8 10 BB 3 i 9 i 10 1 99(1/10) 90 BB 4 BB 2 SB 2 10 99 i 11 90 10 BB 3’ i 9 i 12’ i 13’ i 12 i 13 BB 5 BB 4 i 6 i 7 i 8 90 99(1/10) i 11 90 BB 5 i 12 i 13 1(1/10) 1 Trace Selection for the Loop 1(9/10) After superblock formation and branch target expansion CMPUT 329 - Computer Organization and Architecture II 63

Example for Speculative Execution ld_i r 1, prt, 0 mov r 7, 0 // avg mov r 2, 0 // count SB 1 mov r 3, 0 // weight i 6 BB 2 beq r 1, 0, L 3 i 7 (i 6) L 0: add r 2, 1 i 8 (i 7) ld_i r 4, r 1, 0 // prt->wt 10 (i 8) bge r 4, 0, LA 90 99(1/10) (i 11) add r 3, r 4 BB 4 i 9 i 11 (i 12) ld_i r 1, 4 // prt->next i 12’ (i 13) bne r 1, 0, L 0 i 13’ 90 (i 9) LA: sub r 3, r 4 i 12 (i 12’) ld_i r 1, 4 // prt->next BB 5 i 13 (i 13’) bne r 1, 0, L 0 (i 14) L 3: beq r 2, 0, L 4 1(1/10) (i 15) div r 7, r 3, r 2 1(9/10) (i 16) st_i avg, 0, r 7 After superblock formation (i 17) L 4: and branch target expansion CMPUT 329 - Computer Assembly code segment 99(1/10) SB 2 BB 3’ 1 Organization and Architecture II 64

Example for Speculative Execution ld_i r 1, prt, 0 mov r 7, 0 // avg mov r 2, 0 // count ld_i r 1, prt, 0 mov r 3, 0 // weight mov r 7, 0 // avg L 3: beq r 2, beq r 1, 0, L 3 mov r 2, 0 // count div (I 1) L 0: add r 2, 1 mov r 3, 0 // weight st_ (I 2) ld_i r 4, r 1, 0 // prt->wt beq r 1, 0, L 3 (I 3) blt r 4, 0, L 1 (I 4) add r 3, r 4 (I 1) L 0: add r 2, 1 (I 5) ld_i r 5, r 1, 4 // prt->next (I 2) ld_i r 4, r 1, 0 // prt->wt (I 6) beq r 5, 0, L 3 (I 3) blt r 4, 0, L 1 (I 7) add r 2, 1 (I 8) ld_i r 6, r 5, 0 // prt->wt L 1’: mov (I 4) add r 3, r 4 (I 9) blt r 6, 0, L 1’ mov (I 5) ld_i r 5, r 1, 4 // prt->next (I 10) add r 3, r 6 (I 6) beq r 5, 0, L 3 (I 11) ld_i r 1, r 5, 4 // prt -> next L 1: sub r 3 (I 12) bne r 1, 0, L 0 (I 7) add r 2, 1 ld_i r 1 L 3: beq r 2, 0, L 4 (I 8) ld_i r 6, r 5, 0 // prt->wt bne r 1 div r 7, r 3, r 2 (I 9) blt r 6, 0, L 1’ st_I avg, 0, r 7 L 4: L 1’: mov r 1, r 5 (I 10) add r 3, r 6 mov r 4, r 6 (I 11) ld_i r 1, r 5, 4 // prt -> next L 1: sub r 32, r 3, r 4 (I 12) bne r 1, 0, L 0 CMPUT 329 - Computer ld_i r 1, 4 Organization and Architecture II 65 bne r 1, 0, L 0

Example for Speculative Execution ld_i mov mov beq (I 1) (I 2) (I 3) (I 4) (I 5) (I 6) (I 7) (I 8) (I 9) (I 10) (I 11) (I 12) L 0: r 1, prt, 0 r 7, 0 r 2, 0 r 3, 0 r 1, 0, L 3 // avg // count // weight L 3: div r 7, r 3, r 2 st_I avg, 0, r 7 add r 2, 1 ld_i r 4, r 1, 0 // prt->wt blt r 4, 0, L 1 add r 3, r 4 ld_i r 5, r 1, 4 // prt->next beq r 5, 0, L 3 add r 2, 1 ld_i r 6, r 5, 0 // prt->wt blt r 6, 0, L 1’ add r 3, r 6 ld_i r 1, r 5, 4 // prt -> next bne r 1, 0, CMPUT L 0 329 - Computer Organization and Architecture II beq r 2, 0, L 4: L 1’: L 1: mov r 1, r 5 mov r 4, r 6 sub r 32, r 3, r 4 ld_i r 1, 4 bne r 1, 0, L 0 66

Hyperblocks Suggested Reading Scott A. Mahlke’s Ph. D. Thesis, chap. 7. CMPUT 329 - Computer Organization and Architecture II 67

Hyperblock A hyperblock is a collection of connected basic blocks in which control may only enter through the first block (entry block). Control flow may leave from any number of blocks in the hyperblock. Before scheduling, all control flow between basic blocks within a hyperblock is removed via if-conversion. CMPUT 329 - Computer Organization and Architecture II 68

Hyperblock Formation A five-step procedure is used to form hyperblocks: 1. region identification 2. loop backedge coalescing 3. block selection 4. tail duplication 5. if-conversion CMPUT 329 - Computer Organization and Architecture II 69

Running Example: wc Mahlke uses the inner loop of wc, the program that counts the number of characters, words, and lines in a file for linux, as a running example. CMPUT 329 - Computer Organization and Architecture II 70

The source code A: C: B: D: E: F: linect =wordct = charct = token = 0; for ( ; ; ) if (--(fp)->cnt < 0) c = filbuf(fp); else c = *(fp)->ptr++; if (c == EOF) break; charct++; if ((‘ ‘ < c) && (c < 0177)) { H: K: G: I: J: L: M: } CMPUT 329 - Computer Organization and Architecture II if(! token) { wordct++; token++; } continue; } if (c == ‘n’) linec++; else if ((c != ‘ ‘) && (c != ‘t’)) continue; token = 0; 71

The Assembly Code LK: ld_I r 36, r 72, 0 LA: ld_i r 98, r 3, 0 add r 35, r 36, 1 add r 27, r 98, -1 st_I r 72, 0, r 35 st_i r 3, 0, 27 add r 2, 1 blt r 98, 1, LC jmp LA LB: ld_i r 30, r 3, 4 LG: beq r 4, r 10, LI add r 29, r 30, 1 LJ: bne r 4, 32, LL st_i r 3, 4, r 29 LM: mov r 2, 0 ld_c r 4, r 30, 0 jmp LA LD: beq r 4, -1, EXIT LI: ld_I r 39, r 71, 0 LE: ld_I r 33, r 73, 0 add r 38, r 39, 1 add r 32, r 33, 1 st_I r 71, 0, r 38 st_I r 73, 0, r 32 jmp LM bge 32, r 4, LG LL: bne r 4, 9, LA LF: bge r 4, 127, LG jmp LM LH: bne 0, r 2, LA LC: mov Parm 0, r 3 jsr filbuf mov r 4, Ret 0 CMPUT 329 - Computer jmp LD Organization and Architecture II 72

Control Flow Graph A 105 K 14 B C 105 K D 105 K E 77 K F 0 77 K H 61 K 16 K 14 1 28 K 4 K G I 16 K K EXIT 4 K 24 K J 2 K 22 K L M 2 K CMPUT 329 - Computer Organization and Architecture II 25 28 K 73

Statistics of the Example wc is formed by small basic blocks with a large percentage of branches It contains 13 basic blocks and 34 instructions: 14 branches: 8 conditional 5 unconditional 1 subroutine call CMPUT 329 - Computer Organization and Architecture II 74

Step 1: Region Identification A region is a group of basic blocks with a single entry block that dominates all the blocks in the region. A basic block can only reside in a single region. Regions are used because they provide easy to compute outer boundaries for hyperblocks. A second constraint imposed on region formation is that regions may not contain internal cycles (this constraint is relaxed later). In wc, the entire control flow graph forms a region. CMPUT 329 - Computer Organization and Architecture II 75

Step 2: Backedge Coalescing If-conversion only can remove non-loop branches. Thus we need to coaslece all back edges into a single backedge. This allows the control logic that choses which backedge is taken to be eliminated by if-conversion. To coalesce the backedges, we introduce a new node that will be the origin of the new single backedge. Then we retarget all existing backedges to this new node CMPUT 329 - Computer Organization and Architecture II 76

CFG Before Backedge Coalescing A 105 K 14 B C 105 K D 105 K E 77 K F 0 77 K H 61 K 16 K 14 1 28 K 4 K G I 16 K K EXIT 4 K 24 K J 2 K 22 K L M 2 K CMPUT 329 - Computer Organization and Architecture II 25 28 K 77

CFG After Backedge Coalescing A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 1 28 K 0 4 K H K G 24 K I 16 K 4 K 105 K EXIT J 2 K 22 K L 61 K M 2 K 25 28 K N 16 K CMPUT 329 - Computer Organization and Architecture II 78

Step 3: Block Selection Two conflicting goals: (1) More blocks can potentially improve performance by eliminating branches among the blocks included. (2) Too many blocks may result in performance loss due to over-saturation of processor resources or increased dependence height. CMPUT 329 - Computer Organization and Architecture II 79

Enumerating Execution Paths An execution path is a path of control flow from the entry block to an exit block in the region. Mahlke assigns a priority to each execution path. This priority indicates the path relative importance. Mahlke also estimates the available resources and the resource use of each path. Paths are included in the hyperblock from the highest to the lowest priority based on the available resources. CMPUT 329 - Computer Organization and Architecture II 80

Path Priority Function The path priority function combines four elements: (1) path execution frequency; (2) number of instructions in the path; (3) path dependence height; (4) hazard conditions on the path; Intuition: include paths with fewer instructions, with lower dependence height, that have few hazard conditions, and that are executed very often. Hazard conditions include procedure calls and unresolvable memory stores. CMPUT 329 - Computer Organization and Architecture II 81

Path Priority Function Malhke use a hazard multiplier of 0. 25 for all paths containing a subroutine call or an unresolvable memory reference, and 1. 0 for all other paths. CMPUT 329 - Computer Organization and Architecture II 82

Path Priority Function The constant K makes the path with the largest dependence height and the most operations have a non-zero probability. Malhke used K=0. 1. CMPUT 329 - Computer Organization and Architecture II 83

Block Selection Algorithm ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */ RES_MULTIPLIER = 2 MAX_DEP_GROWTH = 3 MIN_PATH_PRIORITY_RATIO = 0. 10 block_selection(region) { enumerate all paths in the region calculate priority of each path sort paths from highest to lowest priority /* Initialization of loop variables */ avail_resources = ISSUE_WIDTH dep_height 1 RES_MULTIPLIER used_resources = 0 last_priority = 0. 0 selected_paths = 0 for (i = 1 to num_paths) { /* Check if there are enough resources available to include the path */ if ((num_opsi + used_resources) > avail_resources) { continue } /* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height 1 MAX_DEP_GROWTH)) { continue CMPUT 329 - Computer } Organization and Architecture II 84

Block Selection Algorithm /* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height 1 MAX_DEP_GROWTH)) { continue } /* Do not include paths with a small relative priority to that of the last included path */ if (priorityi < (last_priority MIN_PATH_PRIORITY_RATIO)) { continue } /* Include the path in the hyperblock */ selected_paths = selected_paths pathi used_resources = used_resources + num_opsi last_priority = priorityi } selected_blocks = all blocks contained within selected_paths return selected_blocks } CMPUT 329 - Computer Organization and Architecture II 85

Block Selection A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 1 28 K 0 4 K H K G 24 K I 16 K 4 K 105 K EXIT J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D 8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N 10. A-C-D-E-G-J-M-N 11. A-C-D-E-G-J-L-M-N 12. A-C-D-E-G-I-M-N 13. A-C-D-E-G-J-L-N 14. A-C-D 15. A-B-D-E-F-G-I-M-N 16. A-B-D-E-F-G-J-M-N 17. A-B-D-E-F-G-J-L-M-N 18. A-B-D-E-F-G-J-L-N 19. A-C-D-E-F-G-I-M-N 20. A-C-D-E-F-G-J-M-N 21. A-C-D-E-F-G-J-L-M-N 22. A-C-D-E-F-G-J-L-N 86

Block Selection A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 1 28 K 0 4 K H K G 24 K I 16 K 4 K 105 K EXIT J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D 8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N 10. A-C-D-E-G-J-M-N 11. A-C-D-E-G-J-L-M-N 12. A-C-D-E-G-I-M-N 13. A-C-D-E-G-J-L-N 14. A-C-D 15. A-B-D-E-F-G-I-M-N 16. A-B-D-E-F-G-J-M-N 17. A-B-D-E-F-G-J-L-M-N 18. A-B-D-E-F-G-J-L-N 19. A-C-D-E-F-G-I-M-N 20. A-C-D-E-F-G-J-M-N 21. A-C-D-E-F-G-J-L-M-N 22. A-C-D-E-F-G-J-L-N 87

Path Selection Some paths that are not selected by the block selection algorithms are also included in the hyperblocks because all their blocks belong to selected paths. An alternative procedure could have eliminated these paths from the path set before the selection. But the cost of such elimination would be higher than maintaining these extra paths in the set. CMPUT 329 - Computer Organization and Architecture II 88

Block Selection A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 1 28 K 0 4 K H K G 24 K I 16 K 4 K 105 K EXIT J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D 8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N 10. A-C-D-E-G-J-M-N 11. A-C-D-E-G-J-L-M-N 12. A-C-D-E-G-I-M-N 13. A-C-D-E-G-J-L-N 14. A-C-D 15. A-B-D-E-F-G-I-M-N 16. A-B-D-E-F-G-J-M-N 17. A-B-D-E-F-G-J-L-M-N 18. A-B-D-E-F-G-J-L-N 19. A-C-D-E-F-G-I-M-N 20. A-C-D-E-F-G-J-M-N 21. A-C-D-E-F-G-J-L-M-N 22. A-C-D-E-F-G-J-L-N 89

Step 4: Tail Duplication To convert the set of selected blocks into a hyperblock (with a single entry block), control flow from non-selected blocks (side entry points) must be eliminated. The tail duplication algorithm first marks all blocks that have side entry points. Then the algorithm marks all blocks that can be reached from marked blocks. All marked blocks form the tails that must be duplicated. CMPUT 329 - Computer Organization and Architecture II 90

Tail Duplication A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 1 28 K 0 4 K H K G 24 K I 16 K 4 K 105 K EXIT J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 91

Tail Duplication A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 1 28 K 0 4 K H K G 24 K I 16 K 4 K 105 K EXIT J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 92

Tail Duplication A 105 K 14 B 105 K D 105 K E 77 K F 77 K 14 28 K 4 K G 4 K F’ 24 K E’ 4 0 1 H’ 2 K 22 K G’ K’ 2 K 25 28 K N 3 I’ J’ 2 L 61 K 16 K 10 J M 14 10 EXIT I 16 K D’ 0 1 0 H K 14 C 8 2 CMPUT 329 - Computer Organization and Architecture II 0 3 1 M’ L’ 0 4 0 N’ 93

Anatomy of a Predicate Computation Operation p<cmp> Pout 1( type ), Pout 2( type ), src 1, src 2 (Pin) This instruction assigns value to Pout 1 and Pout 2: The value assigned depends on: The result of the comparison The value of Pin The type of Pout 1 and Pout 2 CMPUT 329 - Computer Organization and Architecture II 94

Anatomy of a Predicate Computation Operation p<cmp> Pout 1( type ), Pout 2( type ), src 1, src 2 (Pin) <cmp> = eq | ne | gt <type> = U | OR | AND Example: pge p 4(OR), p 2(/U), r 4, 127 (p 1) cmp = ge, Pin = p 1, Pout 1 = p 4, Pout 2 = p 2, src 1 = r 4, src 2 = 127 CMPUT 329 - Computer Organization and Architecture II 95

Anatomy of a Predicate Computation Operation p<cmp> Pout 1( type ), Pout 2( type ), src 1, src 2 (Pin) <type> = U | OR | AND U or U Always write into the destination register: if type = U then if Pin = 0 then Pout = 0 elseif src 1 <cmp> src 2 then Pout = 1 else Pout = 0 if type = U then if Pin = 0 then Pout = 0 elseif src 1 <cmp> src 2 then Pout = 0 else Pout = 1 CMPUT 329 - Computer Organization and Architecture II 96

Anatomy of a Predicate Computation Operation p<cmp> Pout 1( type ), Pout 2( type ), src 1, src 2 (Pin) <type> = U | OR | AND OR or OR Write into the destination register only if Pin = 1 and <cmp> is true: if type = OR and Pin = 1 and src 1 <cmp> src 2 then Pout = 1 if type = OR and Pin = 1 and src 1 !<cmp> src 2 then Pout = 1 Used when the execution of a block is enabled by one of multiple conditions. CMPUT 329 - Computer OR type predicates must be Organization andinitialized Architecture IIto 0 before their use. 97

Anatomy of a Predicate Computation Operation p<cmp> Pout 1( type ), Pout 2( type ), src 1, src 2 (Pin) <type> = U | OR | AND or AND Write into the destination register only if Pin = 1 and <cmp> is false: if type =AND and Pin = 1 and src 1 !<cmp> src 2 then Pout = 0 if type = AND and Pin = 1 and src 1 <cmp> src 2 then Pout = 0 Used when the execution of a block requires several conditions to be true. CMPUT 329 - Computer AND type predicates are often initialized Organization and Architecture II to 1. 98

Predicate Comparison Truth Table p<cmp> Pout 1( type ), Pout 2( type ), src 1, src 2 (Pin) • Pin predicates the entire predicate computation instruction. • Notice that for an unconditional type, the value 0 is written in Pout even when Pin is 0. CMPUT 329 - Computer Organization and Architecture II 99

Predicate Comparison Truth Table Example: pge p 4(OR), p 2(/U), r 4, 127 (p 1) CMPUT 329 - Computer Organization and Architecture II 100

Predicate Types Unconditional predicates are used for control dependence sets that have a single edge. OR-type predicates are used for predicates with multiple edges in their control dependence sets. (OR-type predicates must be cleared before entering the hyperblock). CMPUT 329 - Computer Organization and Architecture II 101

Step 5: If-conversion For graph drawing, Malhke uses the convention that the left edge out of a basic block is the true condition and the right one is the false. G I J In this control flow graph the control dependencies on blocks I and J are: I: br. G J: /br. G CMPUT 329 - Computer Organization and Architecture II 102

Step 5: If-conversion A 105 K 14 B 105 K D 105 K E 77 K F 77 K 14 D’-N’ 1 28 K 0 4 K H EXIT G 4 K 105 K 24 K I 16 K K 14 C J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 103

Step 5: If-conversion A 105 K 14 B 105 K D 105 K E 77 K F 77 K 14 D’-N’ 1 28 K 0 4 K H EXIT G 4 K 105 K 24 K I 16 K K 14 C J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 104

Step 5: If-conversion (example) A 105 K 14 B 105 K D 105 K E 77 K F 77 K 14 D’-N’ 1 28 K 0 4 K H EXIT G 4 K 105 K 24 K I 16 K K 14 C J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N CMPUT 329 - Computer Organization and Architecture II 105

Step 5: If-conversion (example) A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 14 D’-N’ 1 28 K 0 4 K H EXIT G 4 K 105 K 24 K I 16 K K 14 J 2 K 22 K L 61 K 16 K M 2 K 25 28 K N LA: ld_i r 98, r 3, 0 add r 27, r 98, -1 st_i r 3, 0, 27 blt r 98, 1, LC LB: ld_i r 30, r 3, 4 add r 29, r 30, 1 st_i r 3, 4, r 29 ld_c r 4, r 30, 0 LD: beq r 4, -1, EXIT LE: ld_I r 33, r 73, 0 add r 32, r 33, 1 st_I r 73, 0, r 32 bge 32, r 4, LG LF: bge r 4, 127, LG LH: bne 0, r 2, LA CMPUT 329 - Computer Organization and Architecture II LK: ld_I r 36, r 72, 0 add r 35, r 36, 1 st_I r 72, 0, r 35 add r 2, 1 jmp LA LG: beq r 4, r 10, LI LJ: bne r 4, 32, LL LM: mov r 2, 0 jmp LA LI: ld_I r 39, r 71, 0 add r 38, r 39, 1 st_I r 71, 0, r 38 jmp LM LL: bne r 4, 9, LA jmp LM LC: mov Parm 0, r 3 jsr filbuf mov r 4, Ret 0 jmp LD 106

Step 5: If-conversion (example) A 105 K 14 B C 105 K D 105 K E 77 K F 77 K 1 28 K 0 4 K H I 16 K K 4 K 61 K 16 K pclr p 4, p 6 ld_i r 98, r 3, 0 14 add r 27, r 98, -1 14 st_i r 3, 0, r 27 blt r 98, 1, LC D’-N’ ld_i r 30, r 3, 4 add r 29, r 30, 1 st_i r 3, 4, r 29 EXIT 105 K ld_c r 4, r 30, 0 beq r 4, -1, EXIT G 24 K ld_I r 33, r 73, 0 add r 32, r 33, 1 J st_I r 73, 0, r 32 2 K pge p 4(OR), p 1(/U), 32, r 4 22 K pge p 4(OR), p 2(/U), r 4, 127 (p 1) L peq p 3(U), -, 0, r 2 (p 2) 2 K peq p 6(OR), p 5(/U), r 4, r 10 (p 4) 25 M 28 K peq p 7(U), -, r 4, r 10 (p 4) N 329 - Computer. . . CMPUT Organization and Architecture II 107

105 K E 77 K F 77 K 0 Step 5: If-conversion (example) 1 28 K 4 K H I LA: ld_i r 98, r 3, 0 add r 27, r 98, -1 st_i r 3, 0, 27 blt r 98, 1, LC LB: ld_i r 30, r 3, 4 add r 29, r 30, 1 st_i r 3, 4, r 29 ld_c r 4, r 30, 0 LD: beq r 4, -1, EXIT LE: ld_I r 33, r 73, 0 add r 32, r 33, 1 st_I r 73, 0, r 32 bge 32, r 4, LG LF: bge r 4, 127, LG LH: bne 0, r 2, LA EXIT G 24 K LK: pclr p 4, p 6 J ld_I r 36, r 72, 0 add r 35, r 36, 1 ld_i r 98, r 3, 0 st_I r 72, 0, r 35 add r 27, r 98, -1 add r 2, 1 st_i r 3, 0, r 27 jmp LA blt r 98, 1, LC LG: beq r 4, r 10, LI ld_i r 30, r 3, 4 LJ: bne r 4, 32, LL add r 29, r 30, 1 LM: mov r 2, 0 st_i r 3, 4, r 29 jmp LA ld_c r 4, r 30, 0 LI: ld_I r 39, r 71, 0 beq r 4, -1, EXIT add r 38, r 39, 1 ld_I r 33, r 73, 0 st_I r 71, 0, r 38 add r 32, r 33, 1 jmp LM st_I r 73, 0, r 32 LL: bne r 4, 9, LA pge p 4(OR), p 1(/U), 32, r 4 jmp LM pge p 4(OR), p 2(/U), r 4, 127 (p 1) LC: mov Parm 0, r 3 peq p 3(U), -, 0, r 2 (p 2) jsr filbuf peq p 6(OR), p 5(/U), r 4, r 10 (p 4) mov r 4, Ret 0 peq p 7(U), -, r 4, r 10 (p 4) jmp LD CMPUT 329 - Computer. . . Organization and Architecture II 108

Inner Loop After Ifconversion ld_I r 33, r 73, 0 add r 32, r 33, 1 st_I r 73, 0, r 32 pge p 4(OR), p 1(/U), 32, r 4 pge p 4(OR), p 2(/U), r 4, 127 (p 1) peq p 3(U), -, 0, r 2 (p 2) peq p 6(OR), p 5(/U), r 4, r 10 (p 4) peq p 7(U), -, r 4, r 10 (p 4) peq p 6(OR), p 8(/U), r 4, 32 (p 5) ld_I r 36, r 72, 0 (p 3) add r 35, r 36, 1 (p 3) st_I r 72, 0, r 35 (p 3) add r 2, 1 (p 3) ld_I r 39, r 71, 0 (p 7) add r 38, r 39, 1 (p 7) st_I r 71, 0, r 38 (p 7) peq p 6(OR), -, r 4, 9 (p 8) mov r 2, 0 (p 6) jmp loop CMPUT 329 - Computer pclr p 4, p 6 ld_I r 98, r 3, 0 add r 27, r 98, -1 st_I r 3, 0, r 27 blt r 98, 1, LC ld_i r 30, r 3, 4 add r 29, r 30, 1 st_I r 3, 4, r 29 ld_c r 4, r 30, 0 beq r 4, -1, EXIT Organization and Architecture II 109

Predicate Hierarchy Graph The Predicate Hierarchy Graph (PHG) is a directed acyclic graph representing the Boolean equations used to compute all the predicates in a hyperblock. The PHG is used to derive relationships among predicates. There are two types of nodes in the PHG: predicate nodes and condition nodes. Two PHG nodes x and y are connected if the value specified by x is used to directly compute the value of y. CMPUT 329 - Computer Organization and Architecture II 110

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] [c 6] CMPUT 329 - Computer Organization and Architecture II 111

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] [c 6] CMPUT 329 - Computer Organization and Architecture II 112

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127(p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 p 4 /c 2 p 2 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] [c 6] CMPUT 329 - Computer Organization and Architecture II 113

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2(p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] /c 2 p 2 c 3 p 3 [c 6] CMPUT 329 - Computer Organization and Architecture II 114

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10(p 4) p 6(OR), p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II /c 2 p 2 /c 4 p 5 c 3 p 3 115

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, -, r 4, r 10(p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 p 7 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II /c 2 p 2 /c 4 p 5 c 3 p 3 116

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 32(p 5) p 6(OR), r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 /c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] [c 5, c 4 p 2 c 4 /c 4 p 7 c 3 p 5 c 5 p 3 /c 5 p 8 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II 117

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, -, r 4, 99(p 8) p 6(OR), r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 /c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 p 2 c 4 /c 4 p 7 c 3 p 5 c 5 p 3 /c 5 p 8 c 6 [c 6] CMPUT 329 - Computer [c 6] p 6 Organization and Architecture II 118

Example of PHG Construction pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 /c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 p 2 c 4 /c 4 p 7 c 3 p 5 c 5 p 3 /c 5 p 8 c 6 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II 119

Purpose of PHG The PHG is used to allow the compiler to derive relations among the predicates. Mahlke identifies three predicate relations: Ancestor: pi is an ancestor of pj if all conditions used to compute pj are derived from pi. The compiler can be sure that pj may be true only when pi is also true. Control Path: There is a control path between pi and pj if there is at least one set of conditions under which both pj and pi are true. The compiler knows that pi and pj may be true at the same time. Implies: pi implies pj if the conditions that make pi true guatantee. CMPUT that 329 pj will also be true. - Computer Organization and Architecture II 120

Imply Relationship pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 /c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 p 2 c 4 /c 4 p 7 c 3 p 5 c 5 p 3 /c 5 p 8 c 6 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II p 7 implies p 6 121

Ancestor Relationship pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 /c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 p 2 c 4 /c 4 p 7 c 3 p 5 c 5 p 3 /c 5 p 8 c 6 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II Which predicate nodes are ancestors of p 5? T, p 4, and p 5 122

Ancestor Relationship pclr ld_I add st_I blt ld_i add st_I ld_c beq ld_I add st_I pge peq peq ld_I add st_I add ld_I add st_I peq mov jmp p 4, p 6 r 98, r 3, 0 r 27, r 98, -1 r 3, 0, r 27 r 98, 1, LC r 30, r 3, 4 r 29, r 30, 1 r 3, 4, r 29 r 4, r 30, 0 r 4, -1, EXIT r 33, r 73, 0 r 32, r 33, 1 r 73, 0, r 32 p 4(OR), p 1(/U), 32, r 4 p 4(OR), p 2(/U), r 4, 127 (p 1) p 3(U), -, 0, r 2 (p 2) p 6(OR), p 5(/U), r 4, r 10 (p 4) p 7(U), -, r 4, r 10 (p 4) p 6(OR), p 8(/U), r 4, 32 (p 5) r 36, r 72, 0 (p 3) r 35, r 36, 1 (p 3) r 72, 0, r 35 (p 3) r 2, 1 (p 3) r 39, r 71, 0 (p 7) r 38, r 39, 1 (p 7) r 71, 0, r 38 (p 7) p 6(OR), -, r 4, 9 (p 8) r 2, 0 (p 6) loop T c 1 /c 1 p 1 c 2 /c 2 p 4 [c 1, /c 1] [c 2, /c 2] [c 3] [c 4, /c 4] [c 5, /c 5] c 4 p 2 c 4 /c 4 p 7 c 3 p 5 c 5 /c 5 c 6 [c 6] CMPUT 329 - Computer p 6 Organization and Architecture II p 3 p 8 Which predicate nodes are in the same control path as p 5? T, p 1, p 4, p 5, p 6, p 8 123

Classical/ILP Optimizations in Predicated Code Example: Copy Propagation A: B: C: mov add ld_i r 1, r 2 (p 1) r 2, r 3, r 4 (p 2) r 5, r 1, 0 (p 3) A: B: C: mov add ld_i r 1, r 2 (p 1) r 2, r 3, r 4 (p 2) r 5, r 2, 0 (p 3) Is the copy propagation from instruction A to instruction C legal? Depends on what we know about the relationship between p 1, p 2, and p 3. If it is possible that p 1 is false and p 3 is true, would be wrong! CMPUTthe 329 - propagation Computer Organization and Architecture II 124

Classical/ILP Optimizations in Predicated Code p 1 Example: Copy Propagation A: B: C: mov add ld_i r 1, r 2 (p 1) r 2, r 3, r 4 (p 2) r 5, r 1, 0 (p 3) pk cm /cm p 2 p 3 For instance, if we know that: (1) p 1 is an ancestor of both p 2 and p 3, and (2) p 2 and p 3 are mutually exclusive Then we can do the copy propagation safely. CMPUT 329 - Computer Organization and Architecture II 125

Classical/ILP Optimizations in Predicated Code Example: Instruction Scheduling A: B: C: D: ld_i add ld_i mul r 1, r 2, r 3 (p 2) r 4, r 1, 4 (p 2) r 1, r 5, 0 (p 3) r 6, r 1, r 7 (p 3) What are the data dependencies in the code above? Depends on what we know about the relationship between p 2, and p 3. CMPUT 329 - Computer Organization and Architecture II 126

Classical/ILP Optimizations in Predicated Code Example: Instruction Scheduling A: B: C: D: ld_i add ld_i mul r 1, r 2, r 3 (p 2) r 4, r 1, 4 (p 2) r 1, r 5, 0 (p 3) r 6, r 1, r 7 (p 3) For instance, if we know that p 2 and p 3 are mutually exclusive, we have this DDG: CMPUT 329 - Computer Organization and Architecture II pk cm /cm p 2 p 3 A C B D 127

Classical/ILP Optimizations in Predicated Code Example: Instruction Scheduling A: B: C: D: ld_i add ld_i mul r 1, r 2, r 3 (p 2) r 4, r 1, 4 (p 2) r 1, r 5, 0 (p 3) r 6, r 1, r 7 (p 3) But if p 2 implies p 3, then have this DDG: CMPUT 329 - Computer Organization and Architecture II pk cm cm p 2 p 3 A B C D 128

Predicate-Specific Optimizations - Predicate Promotion - Branch Combining - Predicate Loop Peeling CMPUT 329 - Computer Organization and Architecture II 129

Predicate Promotion The idea it to speculate the execution of instructions by replacing their predicate by a less constrained predecessor predicate. Because the ancestor predicate is computed with fewer conditions, the execution of the promoted instruction is speculative. The advantage of predicate promotion is the reduction of the dependence chain in a hyperblock. CMPUT 329 - Computer Organization and Architecture II 130

Conditions for Simple Predicate Promotion The predicate of an instruction op(x) can be promoted to its predecessor predicate if all the following conditions are true: 1. op(x) is predicated 2. op(x) has a destination register 3. op(x) has a speculative version 4. there is a unique op(y) lexically before op(x) such that dest(y) = pred(x) 5. dest(x) is not live at op(y) 6. for any op(j) such that there is a path op(j)…op(y), dest(x) dest(j) 7. It is profitable to promote op(x) CMPUT 329 - Computer Organization and Architecture II 131

Example of Predicate Promotion (qsort) 1 LA: ld_i 2 ld_i 3 pge 4 LB: ld_i 5 add 6 add 7 add 8 LC: ld_i 9 add 10 add 11 add 12 LD: st_i 13 st_i 14 add 15 add 16 bge 17 LE: blt r 20, r 24, r 101 r 23, r 2, r 102 p 126(U), p 127(U), r 20, r 23 r 6, r 123, 0 (p 126) r 123, 8 (p 126) r 9, 1 (p 126) r 101, 8 (p 126) r 6, r 124, 8 (p 127) r 124, 8 (p 127) r 102, 8 (p 127) r 114, 0, r 23 r 114, 4, r 6 r 7, 1 r 114, 8 r 9, r 3, EXIT r 8, r 1, LA 1 LA: ld_i 2 ld_i 3 pge 4 LB: ld_i 5 add 6 add 7 add 8 LC: ld_i 8 a mov 9 add 10 add 11 add 12 LD: st_i 13 st_i 14 add 15 add 16 bge 17 LE: blt CMPUT 329 - Computer Organization and Architecture II r 20, r 24, r 101 r 23, r 2, r 102 p 126(U), p 127(U), r 20, r 23 r 6, r 123, 0 r 123, 8 (p 126) r 9, 1 (p 126) r 101, 8 (p 126) r 60, r 124, 8 r 6, r 60 (p 127) r 124, 8 (p 127) r 102, 8 (p 127) r 114, 0, r 23 r 114, 4, r 6 r 7, 1 r 114, 8 r 9, r 3, EXIT r 8, r 1, LA 132

Branch Combining Problem: too many infrequently executed branches in a hyperblock 1 A: bge r 1, r 5, EXIT 1 2 ld_c r 3, r 1, 0 3 beq r 3, 10, EXIT 2 4 beq r 3, 0, EXIT 3 5 bge r 2, r 6, EXIT 4 6 st_c r 2, 0, r 3 7 add r 1, 1 8 add r 2, 1 9 jmp A 14 4035 0 0 Example: a loop in grep CMPUT 329 - Computer Organization and Architecture II 133

Branch Combining Solution: replace a group of exit branches by a corresponding group of predicate define instructions. All predicate definitions write into the same predicate register using the OR-type semantics. The resultant predicate will be set to 1 if any of the exit branches were to be taken. Because not exiting the hyperblock is the most common case, the predicate will be false. CMPUT 329 - Computer Organization and Architecture II 134

Branch Combining CMPUT 329 - Computer Organization and Architecture II 135

Instruction Between Combined Branches Instructions between combined branches are speculated. For instructions that are between combined branches but cannot be speculated, the following must be done: (1) move the instructions below the combined exit branch in the hyperblock. (2) replicate these instructions in their original position with respect to the exit branches in the decode block. CMPUT 329 - Computer Organization and Architecture II 136

Backend Compilation with Hyperblocks Lcode generation Classical Optim. PHG predicate relations Hyperblock/Superblock Formation ILP/Predicate-Specific Optimizations CFG Generator Equation Solver Classical Optim. dataflow information Instruction Scheduling Register Allocation CMPUT 329 - Computer Organization and Architecture II predicate aware 137