Galois Performance Mario MendezLojo Donald Nguyen Overview Galois

  • Slides: 36
Download presentation
Galois Performance Mario Mendez-Lojo Donald Nguyen

Galois Performance Mario Mendez-Lojo Donald Nguyen

Overview • Galois system is a test bed to explore opts – Safe but

Overview • Galois system is a test bed to explore opts – Safe but not fast out of the box • Important optimizations – Select least transactional overhead – Select right scheduling – Select appropriate data structure • Quantify optimizations on applications 2

Algorithms general graph topology grid tree morph irregular algorithms operator 1. Barnes-Hut 2. Delaunay

Algorithms general graph topology grid tree morph irregular algorithms operator 1. Barnes-Hut 2. Delaunay Mesh Refinement 3. Preflow-push local computation reader ordering unordered 3

Methodology Threads Time Serial Idle GC Compute • Abort Ratio: Aborted It/Total it •

Methodology Threads Time Serial Idle GC Compute • Abort Ratio: Aborted It/Total it • GC options • • • Use. Parallel. GC Use. Parallel. Old. GC New. Ratio=1 4

Terms • Base – Default scheduling, Default graph • Serial – Galois classes =>

Terms • Base – Default scheduling, Default graph • Serial – Galois classes => No concurrency control classes • Speedup – Best mean performance of a serial variant • Throughput – # Serial Iterations / time 5

Numbers • Runtime – Last of 5 runs in same VM – Ignore time

Numbers • Runtime – Last of 5 runs in same VM – Ignore time to read and construct initial graph • Other statistics – Last of 5 runs 6

Test Environment • • 2 x Xeon X 5570 (4 core, 2. 93 GHz)

Test Environment • • 2 x Xeon X 5570 (4 core, 2. 93 GHz) Java 1. 6. 0_0 -b 11 Linux 2. 6. 24 -27 x 86_64 20 GB heap size 7

BARNES-HUT Most Distant Galaxy Candidates in the Hubble Ultra Deep Field 8

BARNES-HUT Most Distant Galaxy Candidates in the Hubble Ultra Deep Field 8

Barnes-Hut • N-body algorithm – Oct-tree acceleration structure – Serial • Tree build, center

Barnes-Hut • N-body algorithm – Oct-tree acceleration structure – Serial • Tree build, center of mass, particle update – Parallel • Force computation • Structure – Reader on tree • Variants – Splash 2, Reader Galois 9

Reader Optimization child = octree. get. Neighbor(nn, 1); child = octree. get. Neighbor(nn, 1,

Reader Optimization child = octree. get. Neighbor(nn, 1); child = octree. get. Neighbor(nn, 1, Method. Flag. NONE); 10

Para. Meter Profile 11

Para. Meter Profile 11

Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms

Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6. 6 X 100, 000 points, 1 time step 12

Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms

Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6. 6 X 100, 000 points, 1 time step 13

Barnes-Hut Scalability 14

Barnes-Hut Scalability 14

15

15

DELAUNAY MESH REFINEMENT 16

DELAUNAY MESH REFINEMENT 16

Delaunay Mesh Refinement • Refine “bad” triangles – Maintained in worklist • Structure –

Delaunay Mesh Refinement • Refine “bad” triangles – Maintained in worklist • Structure – Cautious operator on graph • Variants – Flag optimized, locallifo base: Priority. default. Order() local lifo: Priority. first(Chunked. FIFO. class). then. Locally(LIFO. class) 17

Cautious Optimization • No need to save undo info • Only check conflicts up

Cautious Optimization • No need to save undo info • Only check conflicts up to first write mesh. contains(item); . . . mesh. remove(pre. Nodes. get(i)); . . . mesh. add(node); mesh. contains(item, Method. Flag. CHECK_CONFLICT); . . . mesh. remove(pre. Nodes. get(i), Method. Flag. NONE); . . . mesh. add(node, Method. Flag. NONE);

LIFO Optimization Galois. Runtime. foreach(. . . , Priority. default. Order()); Galois. Runtime. foreach(.

LIFO Optimization Galois. Runtime. foreach(. . . , Priority. default. Order()); Galois. Runtime. foreach(. . . , Priority. first(Chunked. FIFO. class). then. Locally(LIFO. class)); 19

Para. Meter Profile 20

Para. Meter Profile 20

DMR Results Best serial: locallifo. flagopt Serial time: 17002 ms Best // time: 3745

DMR Results Best serial: locallifo. flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4. 5 X 0. 5 M triangles, 0. 25 M bad triangles 21

22

22

PREFLOW-PUSH 23

PREFLOW-PUSH 23

Preflow-push • Max-flow algorithm – Nodes push flow downhill • Structure – Cautious, local

Preflow-push • Max-flow algorithm – Nodes push flow downhill • Structure – Cautious, local computation • Variants – Flag optimized, local computation graph base (discharge): Priority. first(Bucketed. class, num. Height+1, false, indexer). then(FIFO. class) base (relabel): Priority. first(Chunked. FIFO. class, 8)

Local Computation Optimization graph =. . . b = new Local. Computation. Graph. Object.

Local Computation Optimization graph =. . . b = new Local. Computation. Graph. Object. Graph. Builder(); graph = b. from(graph). create() 25

Para. Meter Profile 26

Para. Meter Profile 26

Preflow-push Results C: 11450 ms Java: 30234 ms Best serial: lc. flagopt Serial time:

Preflow-push Results C: 11450 ms Java: 30234 ms Best serial: lc. flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3. 1 X From challenge problem (genmf-wide) 14 linearly connected grids(194 x 194), 526, 904 nodes, 2, 586, 020 edges http: //avglab. com/andrew/CATS/maxflow_synthetic. htm 27

Preflow-push Scalability 28

Preflow-push Scalability 28

29

29

What performance did we expect? Threads Time Measured Indirectly Error //Compute Serial GC Idle

What performance did we expect? Threads Time Measured Indirectly Error //Compute Serial GC Idle Miss-Speculation Synchronization, … 30

What performance did we expect? • Naïve: r(x) = t 1 / x •

What performance did we expect? • Naïve: r(x) = t 1 / x • Amdahl: r(x) = tp / x + ts t 1 = tp + ts ts = tidle + tgc+ tserial • Simple: r(x) = (tp (ix / i 1)) / x + ts 31

Barnes-Hut 32

Barnes-Hut 32

Delaunay Mesh Refinement 33

Delaunay Mesh Refinement 33

Preflow-push 34

Preflow-push 34

Summary • Many profitable optimizations – Selecting among method flags, worklists, graph variants •

Summary • Many profitable optimizations – Selecting among method flags, worklists, graph variants • Open topics – Automation – Static, dynamic and performance analysis – Efficient ordered algorithms 35

36

36