Optimizing Compilers CISC 673 Spring 2011 Gobal Instruction

Optimizing Compilers CISC 673 Spring 2011 Gobal Instruction Scheduling John Cavazos (Ben Perry) University of Delaware UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT

Overview n n Introduction Pipelining n n n Instruction Pipeline Execution Constraints and Dependences UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 2

Current Processors n n Can execute several operations in a single cycle “How fast can a program run on a processor with instruction-level parallelism? ” n n Potential parallelism in the program Available parallelism on the processor Ability to parallelize a sequential program Find best schedule given constraints UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT

Best targets n Programs with operations that are completely dependent on each other are no good n n Focus on constraints instead of scheduling Numeric applications with large aggregate data structures are good. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 4

Pipelines n n Instruction Pipelines are found in every processor Instructions go through multiple steps in the pipeline from read to execute n n n Fetch, decode, execute, access memory, write result Parallel processors: new instruction can be fetched while current instruction is processed. Each step in the pipeline takes a clock cycle UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 5

Example pipeline i i+1 i+2 i+3 i+4 1 Fetch 2 Identify Fetch 3 Execute Identify Fetch 4 Read Execute Identify Fetch 5 Write Read Execute Identify Fetch 6 Write Read Execute Identify Write Read Execute Write Read 7 8 9 Write UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 6

Pipelines – Speculative Computing n n n Load next instruction even if it may be branched over (speculative) On a branch event, the pipeline is emptied and the branch must be fetched. (delay) Hardware can predict which branch to fetch, but it may be wrong UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 7

Pipeline Execution n n Execution of an instruction is pipelined if succeeding instructions not dependent on the result are allowed to proceed. Hardware can often detect dependencies (superscaler machines) and pause execution if operand isn’t available UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 8

Pipeline Execution n Some processors (Android phone, perhaps), leave batch execution to compilers. Very-long-instruction-words (VLIW) are created by compiler that indicate a batch of instructions to execute in parallel. Out-of-order instructions can be scheduled by advanced schedulers; UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 9

Code-scheduling Constraints n n n Control-dependence – All operations executed in original must be executed Data-dependence – Must produce same results as original Resource UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 10

Data dependence n n X = 5; Y = 6 Obviously, we can reorder these operations. X = 5; Y = X Obviously, we cannot reorder these. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 11

Data dependence n RAW – Read after write. True dependence. n n If a write is followed by a read of the same location, the read depends on the value written WAR – Write after Read. Antidependence n If the write happens before the read, the read will get the wrong value. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 12

Dependence n WAW – Write after Write. If two writes go to the same location, the value will be wrong WAR and WAW can be eliminated using different locations to store different values. n n UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 13

Finding dependences n n n Compiler: GUILTY until proven innocent! (always assume operations refer to same location, and prove it otherwise). Pointers p and (p + 10) cannot possibly refer to the same location Array data dependence analysis: n n for i=0 to n: a[2 i] = a[2 i + 1]. No dependency in array during this loop UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 14

Finding dependences n Pointer alias analysis n n Two pointers are aliased if they refer to the same object. Difficult problem. Interprocedural Analysis n Parameters passed by reference, or if globals are passed UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 15

Register allocation n LD temporary_register 1, a ST b, temporary_register 1 LD temporary_register 2, c ST d, temporary_register 2 Two RAWs, but can be reordered. If temporary_registers 1 and 2 get mapped to the same physical register, we create another dependency UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 16

Control dependence n All operations in a basic block are guaranteed to execute. n n n But they’re small And often highly related. Optimize across other basic blocks is crucial. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 17

Control dependence n n An instruction i 1 is control dependent on instruction i 2 if the outcome of i 2 determines whether i 1 is to be executed Speculatively execute across different basic-blocks UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 18

Speculative computing n Prefectching n n Bring data from memory to the cache before it is needed Poison bits n Don’t throw exceptions when speculatively computing. Instead, set poison bit. If poison registered is really used, then throw exception. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 19

Speculative computing n Predicated Execution n Change if (a == 0) b = c To st r 4, r 3 movif r 2, r 4, r 1 Processor supports a conditional store, enabling combination of basic blocks UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 20

Basic Block List Scheduling n n n NP-complete, but don’t give up. Basic blocks are typically small. Start with data-dependence graph n n Nodes are instructions and resource annotations Edges are data dependences with a delay destination has to wait (some instructions may take 10 cycles, others only 1). UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 21

List Scheduling n n Data dependence cannot have cycles Build a topological ordering of the nodes n n several such orderings may exist, though some are better than others Choose an ordering of the nodes such that for each node, any following node cannot create a dependence on it. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 22

List Scheduling RT = an empty reservation table Foreach n in Sorted. Nodes: -Find the earliest time instruction could begin -Delay the instruction until resources are available -Schedule node after all delays -claim resources UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 23

List Scheduling – better topologies n n Longest path through the datadependence graph is shortest schedule. Resources available constrain; critical resource is the one with the largest ratio of uses to the number of units of that resource available. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 24

Global Code Scheduling n n n Optimize use of resources across blocks. Global Code Scheduling - Moving instructions from one basic block to another Data AND control dependencies. All instructions still must be performed Speculative computing cannot be disruptive. U D C I S D NIVERSITY OF ELAWARE • OMPUTER & NFORMATION CIENCES EPARTMENT 25

Global Code Scheduling example n n if (!a) {c=b; } e=d+d What are the data dependences? What are the control dependences? What can intuitively be ran in parallel? UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 26

Global Code Scheduling Example n n n if (!a) {c=b; } e=d+d Loads take two clock ticks, always hit. R 1 = a, R 2 = b, …, Processor can execute two instructions UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 27

if (!a) {c=b; } e=d+d Block 1 Block 2 Block 3 load r 6, r 1 idle load r 7, r 2 idle load r 8, r 4 idle noop idle jumpz r 6, b 3 idle store r 3, r 7 idle add r 8, r 8 idle st r 5, r 8 idle UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 28

if (!a) {c=b; } e=d+d Block 1 Block 2 Block 3 load r 6, r 1 idle load r 7, r 2 idle load r 8, r 4 idle noop idle jumpz r 6, b 3 idle store r 3, r 7 idle add r 8, r 8 idle st r 5, r 8 idle Block 1 load r 6, r 1 Block 2 load r 8, r 4 st r 5, r 8 Block 3 idle st r 5, r 8 st r 3, r 7 Load r 7, r 2 idle add r 8, r 8 jumpz r 6, b 3 UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 29

Code movement n Definitions: n n Dominates – A dominates B if all paths through B pass through A. Post-dominates – B post-dominates A if all paths that pass through A pass through B. Downward – Move operation down along control Upward – Move operation up along U C I S D control. D NIVERSITY OF ELAWARE • OMPUTER & NFORMATION CIENCES EPARTMENT 30

Upward Code Movement n n Moving instruction from block src to block dest. Block src comes after block dest in the topological-sorted graph. Assume no dependencies. If dest dominates src and src postdominates dest, then we’re done. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 31

Upward Code Movement n If src does not postdominate dst, then we have to speculatively compute n n n Only desirable if the operation is cheap Only useful if src is reached. If dst does not dominate src, copies of the instruction are needed UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 32

Downward Code Movement n n Moving instruction from block src to block dest. Block src comes before block dest in the topological-sorted graph. Assume no dependencies If src dominates dest and dest dominates src, we’re done. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 33

Downward Code Movement n If src does not dominate dest, n n n Writes are often overwritten Extra operations will be needed. Replicate basic blocks and place operation in new copy of dest Alternatively, use predicated instructions (speculative) If dest does not post-dominate src, n Compensation code UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 34

Conclusion n Processors can execute several instructions in parallel We take advantage of this by moving code Code can be moved if no dependencies occur, but sometimes at a cost. UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 35