Processor Architectures and Program Mapping Exploiting ILP part














































- Slides: 46
Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5 kk 10 Henk Corporaal Jef van Meerbergen Bart Mesman Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Overview • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C 6 – TM – TTA • Clustering • Code generation • Hands-on 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 2
Compiler basics • Overview – Compiler trajectory / structure / passes – Control Flow Graph (CFG) – Mapping and Scheduling – Basic block list scheduling – Extended scheduling scope – Loop schedulin 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 3
Compiler basics: trajectory Source program Preprocessor Compiler Assembler Library code Error messages Loader/Linker Object program 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 4
Compiler basics: structure / passes Source code Lexical analyzer Parsing Intermediate code Code optimization Code generation Register allocation Sequential code Scheduling and allocation Object code 9/17/2020 Processor Architectures and Program Mapping token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph coloring spill code insertion caller / callee save and restore code exploiting ILP H. Corporaal, J. van Meerbergen, and B. Mesman 5
Compiler basics: structure Simple compilation example position : = initial + rate * 60 Lexical analyzer id : = id + id * 60 temp 1 : = intoreal(60) temp 2 : = id 3 * temp 1 temp 3 : = id 2 + temp 2 id 1 : = temp 3 Syntax analyzer Code optimizer temp 1 : = id 3 * 60. 0 id 1 : = id 2 + temp 1 : = id + id * id 60 Intermediate code generator 9/17/2020 Processor Architectures and Program Mapping Code generator movf mulf movf addf movf H. Corporaal, J. van Meerbergen, and B. Mesman id 3, r 2 #60, r 2 id 2, r 1 r 1, id 1 6
Compiler basics: Control flow graph (CFG) C input code: if (a > b) else CFG: { r = a % b; } { r = b % a; } 1 sub t 1, a, b bgz t 1, 2, 3 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 …………. . Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports, . . 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 7
Mapping / Scheduling: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; a b 2 * * d z + + e f y + - x r Data Dependence Graph (DDG) 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 8
How to map these operations? b a * + e * d + f r 9/17/2020 Architecture constraints: • One Function Unit • All operations single cycle latency 2 z y cycle 1 * 2 * 3 + 4 5 + 6 + + x Processor Architectures and Program Mapping - H. Corporaal, J. van Meerbergen, and B. Mesman 9
How to map these operations? b a * + e Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency 2 * d + f r z y cycle 1 + x 2 Mul * Add-sub + * + 3 + 4 5 - 6 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 10
There are many mapping solutions x Pareto curve (solution space) T execution x x x xx x x x 0 9/17/2020 x x x x x Cost Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 11
Basic Block Scheduling • • Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: – Scheduling order sequential – Priority determined by used heuristic; e. g. slack 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 12
Basic Block Scheduling B C ASAP cycle ALAP cycle A ADD <2, 2> SUB <3, 3> NEG <1, 1> slack A LD <2, 3> ADD <4, 4> X 9/17/2020 Processor Architectures and Program Mapping C ADD <1, 3> LD A B MUL <2, 4> y <1, 4> z H. Corporaal, J. van Meerbergen, and B. Mesman 13
Cycle based list scheduling proc Schedule(DDG = (V, E)) beginproc ready = { v | (u, v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if Resource. Confl(v, current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u, v) E, u sched } ready’ = { v | v ready (u, v) E, cycle(u) + delay(u, v) current_cycle} endwhile endproc 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 14
Extended basic block scheduling: Code Motion A a) add r 4, 4 b) beq. . . B C c) add r 1, r 2 d) sub r 1, r 2 D e) st r 1, 8(r 4) • Downward code motions? — a B, a C, a D, c D, d D • Upward code motions? — c A, d A, e B, e C, e A 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 15
Extended Scheduling scope Code: A; If cond Then B Else C; D; If cond Then E Else F; G; CFG: Control Flow Graph A B C D E F G 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 16
Scheduling scopes Trace 9/17/2020 Superblock Decision tree Processor Architectures and Program Mapping Hyperblock/region H. Corporaal, J. van Meerbergen, and B. Mesman 17
Code movement (upwards) within regions destination block Legend: Copy needed I I Check for off-liveness Code movement I add 9/17/2020 Intermediate block source block Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 18
Extended basic block scheduling: Code Motion • A dominates B A is always executed before B – Consequently: • A does not dominate B code motion from B to A requires code duplication • B post-dominates A B is always executed after A – Consequently: • B does not post-dominate A code motion from B to A is speculative A Q 1: does C dominate E? B Q 2: does C dominate D? C Q 3: does F post-dominate D? D E Q 4: does D post-dominate B? F 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 19
Scheduling: Loops Loop Optimizations: A C B C B C’ C’ C’’ D Loop peeling 9/17/2020 A D Processor Architectures and Program Mapping D Loop unrolling H. Corporaal, J. van Meerbergen, and B. Mesman 20
Scheduling: Loops Problems with unrolling: • Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion resource utilization Basic block scheduling and unrolling Software pipelining time 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 21
Software pipelining • Software pipelining a loop is: – Scheduling the loop such that iterations start before preceding iterations have finished Or: – Moving operations across the backedge Example: y = a. x LD LD ML ST LD ML ST ST ST 3 cycles/iteration 9/17/2020 Unroling Software pipelining 5/3 cycles/iteration 1 cycle/iteration Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 22
Software pipelining (cont’d) Basic techniques: • Modulo scheduling (Rau, Lam) – list scheduling with modulo resource constraints • Kernel recognition techniques – – unroll the loop schedule the iterations identify a repeating pattern Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) • Enhanced pipeline scheduling (Ebcioğlu) – fill first cycle of iteration – copy this instruction over the backedge 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 23
Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld mul sub st r 1, (r 2) r 3, r 1, 3 r 4, r 3, 1 r 4, (r 5) (b) Code without loop control Prologue ld mul sub st (c) Software pipeline r 1, (r 2) r 3, r 1, 3 r 4, r 3, 1 r 4, (r 5) ld mul sub st r 1, (r 2) r 3, r 1, 3 r 4, r 3, 1 r 4, (r 5) Kernel Epilogue • Prologue fills the SW pipeline with iterations • Epilogue drains the SW pipeline 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 24
Software pipelining: determine II, Initation Interval For (i=0; . . . ) Cyclic data dependences A[i+6]= 3*A[i]-1 ld r 1, (r 2) (0, 1) (1, 0) (delay, distance) mul r 3, r 1, 3 (0, 1) (1, 0) (1, 6) sub r 4, r 3, 1 (0, 1) (1, 0) st r 4, (r 5) cycle(v) cycle(u) + delay(u, v) - II. distance(u, v) 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 25
Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ Res. MII, Rec. MII } Resources: Cycles: Therefore: Or: 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 26
The Role of the Compiler 9 steps required to translate an HLL program • • • 9/17/2020 Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 27
Division of responsibilities between hardware and compiler Application Frontend Determine Dependencies Binding of Operands Scheduling Binding of Operations Binding of Transports Superscalar Dataflow Multi-threaded Indep. Arch VLIW TTA Determine Dependencies Binding of Operands Scheduling Binding of Operations Binding of Transports Execute Responsibility of compiler 9/17/2020 Processor Architectures and Program Mapping Responsibility of Hardware H. Corporaal, J. van Meerbergen, and B. Mesman 28
Overview • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C 6 – TM – TTA • Clustering • Code generation • Hands-on 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 29
Hands-on (not this year) • Map JPEG to a TTA processor – see web page: http: //www. ics. ele. tue. nl/~heco/courses/pam • • Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU • 1 or 2 page report in 2 weeks 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 30
Hands-on • Let’s look at DSE: Design Space Exploration • We will use the Imagine processor • http: //cva. stanford. edu/projects/imagine/ 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 31
Mapping applications to processors MOVE framework User intercation x Architecture parameters Parametric compiler feedback exec. time feedback Optimizer x Pareto curve (solution space) x x x x x cost Hardware generator Move framework Parallel object code chip TTA based system 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 32
Code generation trajectory for TTAs • Frontend: GCC or SUIF (adapted) Architecture description Application (C) Compiler frontend Sequential code Compiler backend Parallel code 9/17/2020 Processor Architectures and Program Mapping Sequential simulation Input/Output Profiling data Parallel simulation H. Corporaal, J. van Meerbergen, and B. Mesman Input/Output 33
Exploration: TTA resource reduction 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 34
cin gb us del ay Critical Execution time du connect Re ions dis appear Exporation: TTA connectivity reduction FU stage constrains cycle time 0 9/17/2020 Number of connections removed Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 35
How ? • Transformations • SFUs: Special Function Units • Multiple Processors Execution time Can we do better Cost 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 36
Transforming the specification + + + Based on associativity of + operation a + (b + c) = (a + b) + c 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 37
Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; 1 b y a << - z + x r 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 38
Changing the architecture adding SFUs: special function units + + + 4 -input adder why is this faster? 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 39
Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 40
SFUs: fine grain patterns • Why using fine grain SFUs: – Code size reduction – Register file #ports reduction – Could be cheaper and/or faster – Transport reduction – Power reduction (avoid charging non-local wires) – Supports whole application domain ! Which patterns do need support? • Detection of recurring operation patterns needed 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 41
SFUs: covering results 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 42
Exploration: resulting architecture stream input 4 Addercmp FUs 4 RFs 2 Multiplier FUs 2 Diffadd FUs 9 buses stream output Architecture for image processing • Note the reduced connectivity 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 43
Conclusions • Billions of embedded processing systems – how to design these systems quickly, cheap, correct, low power, . . ? – what will their processing platform look like? • VLIWs are very powerful and flexible – can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 44
Conclusions • Compilation for ILP architectures is getting mature, and • Enters the commercial area. • However – Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 45
Bottom line: 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 46