Structure Layout Optimizations in the Open 64 Compiler

  • Slides: 19
Download presentation
Structure Layout Optimizations in the Open 64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti

Structure Layout Optimizations in the Open 64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow Path. Scale, LLC.

Outline Ø Motivation Ø Types of structure layout optimizations Ø Criteria for structure layout

Outline Ø Motivation Ø Types of structure layout optimizations Ø Criteria for structure layout optimizations Ø Implementation details Ø Performance results Ø Future work Ø Conclusion Open 64 Workshop 2008 2

Motivation Ø Poor data locality in many applications Ø High data cache miss rates

Motivation Ø Poor data locality in many applications Ø High data cache miss rates Ø Growing gap between processor and memory speeds Our Aim Ø Make applications more cache-friendly Our Approach Ø Change layout of data structures Ø Requires whole-program optimization Ø Use Inter-Procedural Analysis and Optimizations (IPA) Open 64 Workshop 2008 3

IPA Summarizatio n Ø Ø Analysis Ø Optimization Open 64 Workshop 2008 4

IPA Summarizatio n Ø Ø Analysis Ø Optimization Open 64 Workshop 2008 4

Types of Structure Layout Optimizations Structure splitting struct_A { double d 1; double d

Types of Structure Layout Optimizations Structure splitting struct_A { double d 1; double d 2; int i; float f; long l; char c; struct_A * next; }; Open 64 Workshop 2008 Structure peeling struct_A { double d 1; double d 2; int i; float f; long l; char c; }; 5

Structure Splitting Example struct_A { double d 1; double d 2; int i; float

Structure Splitting Example struct_A { double d 1; double d 2; int i; float f; long l; char c; struct_A * next; }; Open 64 Workshop 2008 struct new_struct_A { double d 1; int i; long l; struct new_struct_A * next; struct cold_sub_struct_A * p; }; struct cold_sub_struct_A { double d 2; float f; char c; }; 6

Structure Peeling Example struct_A { double d 1; double d 2; int i; float

Structure Peeling Example struct_A { double d 1; double d 2; int i; float f; long l; char c; }; Open 64 Workshop 2008 struct new_struct_A { double d 1; int i; long l; }; struct cold_sub_struct_A { double d 2; float f; char c; }; 7

Criteria for structure layout optimizations Ø Legality Analysis Ø Type cast Address of a

Criteria for structure layout optimizations Ø Legality Analysis Ø Type cast Address of a field is taken Escaped types Parameter types Full visibility to IPA Alignment restrictions Open 64 Workshop 2008 Profitability Analysis Hotness Affinity 8 Field accesses at loop level Size

Implementation Details Step 1: Type information summarization (IPL) Step 2: Symbol table merging (IPA)

Implementation Details Step 1: Type information summarization (IPL) Step 2: Symbol table merging (IPA) Step 3: Legality and profitability analysis (IPA analysis) Step 4: Transforming the program (IPA optimization) Open 64 Workshop 2008 9

Implementation Details: Type information summarization Ø Information summarization in IPL Ø Framework for computing

Implementation Details: Type information summarization Ø Information summarization in IPL Ø Framework for computing static profiles using heuristics Ø New TY flag TY_NO_SPLIT Ø SUMMARY_TY_INFO Ø SUMMARY_LOOP Ø For each DO_LOOP, WHILE_DO, DO_WHILE Ø Bit-vector to track field accesses of up to N structure for each loop Ø Considers field accesses immediately inside loop ØThese fields are considered affine to each other Ø Execution count of statements immediately inside loop ØFrom statically estimated profiles or from runtime feedback Open 64 Workshop 2008 10

Implementation Details: IPA Analysis Ø Inter-procedurally update statically estimated execution count of PUs Ø

Implementation Details: IPA Analysis Ø Inter-procedurally update statically estimated execution count of PUs Ø Update statically estimated loop frequencies in SUMMARY_LOOP Ø Consider SUMMARY_LOOP from the hottest P PUs Ø Determine candidates for structure-layout transformation Ø Determine new layout of structures Open 64 Workshop 2008 11

Implementation Details: IPA Analysis Example F 4 F 3 L 1 F 2 22

Implementation Details: IPA Analysis Example F 4 F 3 L 1 F 2 22 L 2 F 1 BV 22 0101 14 L 3 0010 12 L 4 8 12 8 L 5 1100 6 F 4 AG 1 6 F 3 F 2 40 F 1 14 8 8 Li — Loops AGk — Affinity groups Fj — Fields in a struct Open 64 Workshop 2008 0101 40 AG 2 AG 3 0101 12

Implementation Details: Transforming the program Example: struct S { // N fields struct T

Implementation Details: Transforming the program Example: struct S { // N fields struct T * p; // M fields }; struct S { // N fields struct T 1 * p 1; struct T 2 * p 2; // M fields }; Open 64 Workshop 2008 struct T { // AG 1 fields // AG 2 fields }; // peel T struct T 1 { // AG 1 fields }; struct T 2 { // AG 2 fields }; 13 Ø New type definitions Ø Field table update Ø Field access statements Ø New symbols Ø Assignment statements

Implementation Details: Transforming the program (continued) Function calls to memory management routines Example: p

Implementation Details: Transforming the program (continued) Function calls to memory management routines Example: p = (T *) malloc (N * sizeof (T)) if (p == NULL) exit (1); Ø Detect memory management routine calls involving transformed type T Ø Replicate call, assignment statements Ø Update size of memory being allocated Ø Handle comparisons involving pointer p Open 64 Workshop 2008 14

Performance Results Compilations options: -Ofast at 32 -bit ABI Speedup due to structure layout

Performance Results Compilations options: -Ofast at 32 -bit ABI Speedup due to structure layout optimizations Benchmarks AMD Intel® Si. Cortex Geometric Opteron™ Barcelona(2. EM 64 T(3. 4 G Core™(3. 0 MIPS®(500 MHz, Mean (2. 8 GHz, 0 GHz, 8 GB, Hz, 4 GB, GHz, 4 GB, 256 KB) 4 GB, 1 MB) 512 KB) 1 MB) 4 MB) 179. art 134% 66% 56% 47% 41% 62. 5% 181. mcf 24% 23% 31% 13% 22. 0% 462. libquantum 32% 17% 40% 72% 62% 39. 6% Geometric Mean 46. 9% 29. 6% 37. 2% 47. 2% 32. 1% 37. 9% Open 64 Workshop 2008 15

Performance Results (continued) Compilations options: -Ofast at 64 -bit ABI Speedup due to structure

Performance Results (continued) Compilations options: -Ofast at 64 -bit ABI Speedup due to structure layout optimizations Benchmarks AMD Intel® Si. Cortex Geometric Opteron™ Barcelona(2. EM 64 T(3. 4 G Core™(3. 0 MIPS®(500 MHz, Mean (2. 8 GHz, 0 GHz, 8 GB, Hz, 4 GB, GHz, 4 GB, 256 KB) 4 GB, 1 MB) 512 KB) 1 MB) 4 MB) 179. art 169% 66% 53% 60% 45% 69. 3% 181. mcf 25% 35% 12% 30% 7% 18. 6% 462. libquantum 82% 51% 75% 70% 69% 68. 6% Geometric Mean 70. 2% 49. 0% 36. 3% 50. 1% 27. 9% 44. 6% Open 64 Workshop 2008 16

Performance Results (continued) Compilations options: -Ofast at 64 -bit ABI Multiple copies of 462.

Performance Results (continued) Compilations options: -Ofast at 64 -bit ABI Multiple copies of 462. libquantum running on multi-core chip Platform: Quad-core AMD Barcelona (2. 0 GHz, 8 GB, 512 KB, 2 MB) 3 rd level cache shared among 4 cores Speedup from structure layout optimizations Benchmark 1 copy 2 copies 462. libquantum 51% 69% 123% Open 64 Workshop 2008 17

Future Work Ø Tune static profile estimation Ø Less restrictions Ø Integrate with field-reordering

Future Work Ø Tune static profile estimation Ø Less restrictions Ø Integrate with field-reordering Open 64 Workshop 2008 18

Conclusion Ø A framework for performing structure layout transformations is now available in the

Conclusion Ø A framework for performing structure layout transformations is now available in the Open 64 compiler. Ø The superior infrastructure in the Open 64 compiler helped us implement the optimizations cleanly and with relatively less effort. Ø Substantial speedups are possible on some of the CPU 2000 and CPU 2006 SPEC benchmarks. Ø Structure layout optimization is a required feature for a compiler to remain competitive. Open 64 Workshop 2008 19