Processor Architectures and Program Mapping Exploiting ILP part

  • Slides: 46
Download presentation
Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5 kk

Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5 kk 10 Henk Corporaal Jef van Meerbergen Bart Mesman Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Overview • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C

Overview • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C 6 – TM – TTA • Clustering • Code generation • Hands-on 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 2

Compiler basics • Overview – Compiler trajectory / structure / passes – Control Flow

Compiler basics • Overview – Compiler trajectory / structure / passes – Control Flow Graph (CFG) – Mapping and Scheduling – Basic block list scheduling – Extended scheduling scope – Loop schedulin 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 3

Compiler basics: trajectory Source program Preprocessor Compiler Assembler Library code Error messages Loader/Linker Object

Compiler basics: trajectory Source program Preprocessor Compiler Assembler Library code Error messages Loader/Linker Object program 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 4

Compiler basics: structure / passes Source code Lexical analyzer Parsing Intermediate code Code optimization

Compiler basics: structure / passes Source code Lexical analyzer Parsing Intermediate code Code optimization Code generation Register allocation Sequential code Scheduling and allocation Object code 9/17/2020 Processor Architectures and Program Mapping token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph coloring spill code insertion caller / callee save and restore code exploiting ILP H. Corporaal, J. van Meerbergen, and B. Mesman 5

Compiler basics: structure Simple compilation example position : = initial + rate * 60

Compiler basics: structure Simple compilation example position : = initial + rate * 60 Lexical analyzer id : = id + id * 60 temp 1 : = intoreal(60) temp 2 : = id 3 * temp 1 temp 3 : = id 2 + temp 2 id 1 : = temp 3 Syntax analyzer Code optimizer temp 1 : = id 3 * 60. 0 id 1 : = id 2 + temp 1 : = id + id * id 60 Intermediate code generator 9/17/2020 Processor Architectures and Program Mapping Code generator movf mulf movf addf movf H. Corporaal, J. van Meerbergen, and B. Mesman id 3, r 2 #60, r 2 id 2, r 1 r 1, id 1 6

Compiler basics: Control flow graph (CFG) C input code: if (a > b) else

Compiler basics: Control flow graph (CFG) C input code: if (a > b) else CFG: { r = a % b; } { r = b % a; } 1 sub t 1, a, b bgz t 1, 2, 3 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 …………. . Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports, . . 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 7

Mapping / Scheduling: placing operations in space and time d = a * b;

Mapping / Scheduling: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; a b 2 * * d z + + e f y + - x r Data Dependence Graph (DDG) 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 8

How to map these operations? b a * + e * d + f

How to map these operations? b a * + e * d + f r 9/17/2020 Architecture constraints: • One Function Unit • All operations single cycle latency 2 z y cycle 1 * 2 * 3 + 4 5 + 6 + + x Processor Architectures and Program Mapping - H. Corporaal, J. van Meerbergen, and B. Mesman 9

How to map these operations? b a * + e Architecture constraints: • One

How to map these operations? b a * + e Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency 2 * d + f r z y cycle 1 + x 2 Mul * Add-sub + * + 3 + 4 5 - 6 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 10

There are many mapping solutions x Pareto curve (solution space) T execution x x

There are many mapping solutions x Pareto curve (solution space) T execution x x x xx x x x 0 9/17/2020 x x x x x Cost Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 11

Basic Block Scheduling • • Make a dependence graph Determine minimal length Determine ASAP,

Basic Block Scheduling • • Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: – Scheduling order sequential – Priority determined by used heuristic; e. g. slack 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 12

Basic Block Scheduling B C ASAP cycle ALAP cycle A ADD <2, 2> SUB

Basic Block Scheduling B C ASAP cycle ALAP cycle A ADD <2, 2> SUB <3, 3> NEG <1, 1> slack A LD <2, 3> ADD <4, 4> X 9/17/2020 Processor Architectures and Program Mapping C ADD <1, 3> LD A B MUL <2, 4> y <1, 4> z H. Corporaal, J. van Meerbergen, and B. Mesman 13

Cycle based list scheduling proc Schedule(DDG = (V, E)) beginproc ready = { v

Cycle based list scheduling proc Schedule(DDG = (V, E)) beginproc ready = { v | (u, v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if Resource. Confl(v, current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u, v) E, u sched } ready’ = { v | v ready (u, v) E, cycle(u) + delay(u, v) current_cycle} endwhile endproc 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 14

Extended basic block scheduling: Code Motion A a) add r 4, 4 b) beq.

Extended basic block scheduling: Code Motion A a) add r 4, 4 b) beq. . . B C c) add r 1, r 2 d) sub r 1, r 2 D e) st r 1, 8(r 4) • Downward code motions? — a B, a C, a D, c D, d D • Upward code motions? — c A, d A, e B, e C, e A 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 15

Extended Scheduling scope Code: A; If cond Then B Else C; D; If cond

Extended Scheduling scope Code: A; If cond Then B Else C; D; If cond Then E Else F; G; CFG: Control Flow Graph A B C D E F G 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 16

Scheduling scopes Trace 9/17/2020 Superblock Decision tree Processor Architectures and Program Mapping Hyperblock/region H.

Scheduling scopes Trace 9/17/2020 Superblock Decision tree Processor Architectures and Program Mapping Hyperblock/region H. Corporaal, J. van Meerbergen, and B. Mesman 17

Code movement (upwards) within regions destination block Legend: Copy needed I I Check for

Code movement (upwards) within regions destination block Legend: Copy needed I I Check for off-liveness Code movement I add 9/17/2020 Intermediate block source block Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 18

Extended basic block scheduling: Code Motion • A dominates B A is always executed

Extended basic block scheduling: Code Motion • A dominates B A is always executed before B – Consequently: • A does not dominate B code motion from B to A requires code duplication • B post-dominates A B is always executed after A – Consequently: • B does not post-dominate A code motion from B to A is speculative A Q 1: does C dominate E? B Q 2: does C dominate D? C Q 3: does F post-dominate D? D E Q 4: does D post-dominate B? F 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 19

Scheduling: Loops Loop Optimizations: A C B C B C’ C’ C’’ D Loop

Scheduling: Loops Loop Optimizations: A C B C B C’ C’ C’’ D Loop peeling 9/17/2020 A D Processor Architectures and Program Mapping D Loop unrolling H. Corporaal, J. van Meerbergen, and B. Mesman 20

Scheduling: Loops Problems with unrolling: • Exploits only parallelism within sets of n iterations

Scheduling: Loops Problems with unrolling: • Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion resource utilization Basic block scheduling and unrolling Software pipelining time 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 21

Software pipelining • Software pipelining a loop is: – Scheduling the loop such that

Software pipelining • Software pipelining a loop is: – Scheduling the loop such that iterations start before preceding iterations have finished Or: – Moving operations across the backedge Example: y = a. x LD LD ML ST LD ML ST ST ST 3 cycles/iteration 9/17/2020 Unroling Software pipelining 5/3 cycles/iteration 1 cycle/iteration Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 22

Software pipelining (cont’d) Basic techniques: • Modulo scheduling (Rau, Lam) – list scheduling with

Software pipelining (cont’d) Basic techniques: • Modulo scheduling (Rau, Lam) – list scheduling with modulo resource constraints • Kernel recognition techniques – – unroll the loop schedule the iterations identify a repeating pattern Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) • Enhanced pipeline scheduling (Ebcioğlu) – fill first cycle of iteration – copy this instruction over the backedge 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 23

Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i

Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld mul sub st r 1, (r 2) r 3, r 1, 3 r 4, r 3, 1 r 4, (r 5) (b) Code without loop control Prologue ld mul sub st (c) Software pipeline r 1, (r 2) r 3, r 1, 3 r 4, r 3, 1 r 4, (r 5) ld mul sub st r 1, (r 2) r 3, r 1, 3 r 4, r 3, 1 r 4, (r 5) Kernel Epilogue • Prologue fills the SW pipeline with iterations • Epilogue drains the SW pipeline 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 24

Software pipelining: determine II, Initation Interval For (i=0; . . . ) Cyclic data

Software pipelining: determine II, Initation Interval For (i=0; . . . ) Cyclic data dependences A[i+6]= 3*A[i]-1 ld r 1, (r 2) (0, 1) (1, 0) (delay, distance) mul r 3, r 1, 3 (0, 1) (1, 0) (1, 6) sub r 4, r 3, 1 (0, 1) (1, 0) st r 4, (r 5) cycle(v) cycle(u) + delay(u, v) - II. distance(u, v) 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 25

Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII

Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ Res. MII, Rec. MII } Resources: Cycles: Therefore: Or: 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 26

The Role of the Compiler 9 steps required to translate an HLL program •

The Role of the Compiler 9 steps required to translate an HLL program • • • 9/17/2020 Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 27

Division of responsibilities between hardware and compiler Application Frontend Determine Dependencies Binding of Operands

Division of responsibilities between hardware and compiler Application Frontend Determine Dependencies Binding of Operands Scheduling Binding of Operations Binding of Transports Superscalar Dataflow Multi-threaded Indep. Arch VLIW TTA Determine Dependencies Binding of Operands Scheduling Binding of Operations Binding of Transports Execute Responsibility of compiler 9/17/2020 Processor Architectures and Program Mapping Responsibility of Hardware H. Corporaal, J. van Meerbergen, and B. Mesman 28

Overview • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C

Overview • • Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples – C 6 – TM – TTA • Clustering • Code generation • Hands-on 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 29

Hands-on (not this year) • Map JPEG to a TTA processor – see web

Hands-on (not this year) • Map JPEG to a TTA processor – see web page: http: //www. ics. ele. tue. nl/~heco/courses/pam • • Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU • 1 or 2 page report in 2 weeks 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 30

Hands-on • Let’s look at DSE: Design Space Exploration • We will use the

Hands-on • Let’s look at DSE: Design Space Exploration • We will use the Imagine processor • http: //cva. stanford. edu/projects/imagine/ 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 31

Mapping applications to processors MOVE framework User intercation x Architecture parameters Parametric compiler feedback

Mapping applications to processors MOVE framework User intercation x Architecture parameters Parametric compiler feedback exec. time feedback Optimizer x Pareto curve (solution space) x x x x x cost Hardware generator Move framework Parallel object code chip TTA based system 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 32

Code generation trajectory for TTAs • Frontend: GCC or SUIF (adapted) Architecture description Application

Code generation trajectory for TTAs • Frontend: GCC or SUIF (adapted) Architecture description Application (C) Compiler frontend Sequential code Compiler backend Parallel code 9/17/2020 Processor Architectures and Program Mapping Sequential simulation Input/Output Profiling data Parallel simulation H. Corporaal, J. van Meerbergen, and B. Mesman Input/Output 33

Exploration: TTA resource reduction 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van

Exploration: TTA resource reduction 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 34

cin gb us del ay Critical Execution time du connect Re ions dis appear

cin gb us del ay Critical Execution time du connect Re ions dis appear Exporation: TTA connectivity reduction FU stage constrains cycle time 0 9/17/2020 Number of connections removed Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 35

How ? • Transformations • SFUs: Special Function Units • Multiple Processors Execution time

How ? • Transformations • SFUs: Special Function Units • Multiple Processors Execution time Can we do better Cost 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 36

Transforming the specification + + + Based on associativity of + operation a +

Transforming the specification + + + Based on associativity of + operation a + (b + c) = (a + b) + c 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 37

Transforming the specification d = a * b; e = a + d; f

Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; 1 b y a << - z + x r 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 38

Changing the architecture adding SFUs: special function units + + + 4 -input adder

Changing the architecture adding SFUs: special function units + + + 4 -input adder why is this faster? 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 39

Changing the architecture adding SFUs: special function units In the extreme case put everything

Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 40

SFUs: fine grain patterns • Why using fine grain SFUs: – Code size reduction

SFUs: fine grain patterns • Why using fine grain SFUs: – Code size reduction – Register file #ports reduction – Could be cheaper and/or faster – Transport reduction – Power reduction (avoid charging non-local wires) – Supports whole application domain ! Which patterns do need support? • Detection of recurring operation patterns needed 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 41

SFUs: covering results 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen,

SFUs: covering results 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 42

Exploration: resulting architecture stream input 4 Addercmp FUs 4 RFs 2 Multiplier FUs 2

Exploration: resulting architecture stream input 4 Addercmp FUs 4 RFs 2 Multiplier FUs 2 Diffadd FUs 9 buses stream output Architecture for image processing • Note the reduced connectivity 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 43

Conclusions • Billions of embedded processing systems – how to design these systems quickly,

Conclusions • Billions of embedded processing systems – how to design these systems quickly, cheap, correct, low power, . . ? – what will their processing platform look like? • VLIWs are very powerful and flexible – can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 44

Conclusions • Compilation for ILP architectures is getting mature, and • Enters the commercial

Conclusions • Compilation for ILP architectures is getting mature, and • Enters the commercial area. • However – Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 45

Bottom line: 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and

Bottom line: 9/17/2020 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman 46