ProfileDriven Selective Program Loading Tugrul Ince tugrulcs umd

  • Slides: 22
Download presentation
Profile-Driven Selective Program Loading Tugrul Ince tugrul@cs. umd. edu Jeff Hollingsworth Department of Computer

Profile-Driven Selective Program Loading Tugrul Ince tugrul@cs. umd. edu Jeff Hollingsworth Department of Computer Science University of Maryland, College Park, MD 20742 University of Maryland

Motivation l Programs are getting larger! – Many frameworks and libraries l Many supercomputers

Motivation l Programs are getting larger! – Many frameworks and libraries l Many supercomputers lack demand-paging – Example: Cray XT and Blue. Gene series – Available memory is scarce l Observation: Most programs do not use every available function! – Frameworks and libraries are too general – Code that handles errors or special cases l 2 Why not remove functions that are not used in the common case? University of Maryland

Aim Reduce memory footprint by selectively loading parts of shared libraries 3 University of

Aim Reduce memory footprint by selectively loading parts of shared libraries 3 University of Maryland

Target Platforms and Applications l Unix/Linux systems that support ELF – Modifies ELF program

Target Platforms and Applications l Unix/Linux systems that support ELF – Modifies ELF program headers l Applications with many libraries – Most current reasonable applications l Parallel programs running on multiple nodes – MPI etc. l Platforms without demand-paging – Cray XT and Blue. Gene series 4 University of Maryland

Architecture Overview l l Application is profiled. It is rewritten with – Modified Shared

Architecture Overview l l Application is profiled. It is rewritten with – Modified Shared Libraries – A Signal Handler l 5 Application is executed as usual. University of Maryland

Profiler l Need a list of never-called functions in each shared library – Profile

Profiler l Need a list of never-called functions in each shared library – Profile the application several times – May not be perfect l Dyn. Inst-based profiler – Write small program (~ 70 LOC) – Rewrite shared libraries – Profile as many times as necessary 6 University of Maryland

Rewriting l Do not load unused functions . text – Modify ELF program headers

Rewriting l Do not load unused functions . text – Modify ELF program headers – Example: libpetsc. so Program Headers: Type LOAD DYNAMIC GNU_STACK l Offset 0 x 000000 0 x 124584 0 x 112000 0 x 12459 c 0 x 000000 Virt. Addr 0 x 0000 0 x 00125584 0 x 00112000 0 x 0012559 c 0 x 0000 Phys. Addr 0 x 0000 0 x 00125584 0 x 00112000 0 x 0012559 c 0 x 0000 File. Siz 0 x 090000 0 x 124584 0 x 013 f 8 0 x 012584 0 x 00130 0 x 00000 First Loadable Section: . text, . init, . fini, . plt l 7 Second Loadable Section: . dynamic, . got. plt, . data, . bss University of Maryland Mem. Siz 0 x 090000 0 x 124584 0 x 0 a 434 0 x 012584 0 x 00130 0 x 00000 Flg R E RW RW Align 0 x 1000 0 x 4

Rewriting l Do not load unused functions . text – Modify ELF program headers

Rewriting l Do not load unused functions . text – Modify ELF program headers – Example: libpetsc. so Program Headers: Type LOAD DYNAMIC GNU_STACK l Offset 0 x 000000 0 x 112000 0 x 124584 0 x 12459 c 0 x 000000 Virt. Addr 0 x 0000 0 x 00112000 0 x 00125584 0 x 0012559 c 0 x 0000 Phys. Addr 0 x 0000 0 x 00112000 0 x 00125584 0 x 0012559 c 0 x 0000 File. Siz 0 x 090000 0 x 012584 0 x 013 f 8 0 x 00130 0 x 00000 First Loadable Section: . text, . init, . fini, . plt l 8 Second Loadable Section: . dynamic, . got. plt, . data, . bss University of Maryland Mem. Siz 0 x 090000 0 x 012584 0 x 0 a 434 0 x 00130 0 x 00000 Flg R E RW RW RW Align 0 x 1000 0 x 4

Rewriting l l Rewriter based on Dyn. Inst Profile data is used to create

Rewriting l l Rewriter based on Dyn. Inst Profile data is used to create lists of Used and Unused functions Access / Modify symbols Defragment functions to maximize space savings – Requires moving functions inside shared libraries 9 University of Maryland

Function Defragmentation Used Unused 10 University of Maryland

Function Defragmentation Used Unused 10 University of Maryland

Challenges: Relative Calls l l l Common way of calling functions in PIC. If

Challenges: Relative Calls l l l Common way of calling functions in PIC. If either callee or caller is moved, their relative positioning changes. Offsets in such relative call instructions need to be updated call d’ d d' foo 11 University of Maryland

Challenges: Symbols l Runtime linker uses symbols to resolve cross-library calls. – Uses procedure

Challenges: Symbols l Runtime linker uses symbols to resolve cross-library calls. – Uses procedure linkage tables (plt) l If a function is moved, its associated symbol has to be updated. foo: 0 xdeadbeef foo@plt foo: 0 xbeefdead foo@plt foo call foo@plt 12 call foo@plt foo University of Maryland

Challenges: Jump Tables l l Used to represent n-way branches at machine level Targets

Challenges: Jump Tables l l Used to represent n-way branches at machine level Targets are read from jump table – Entries are offsets of targets from the GOT address l l l 13 Becomes invalid if the function referenced in a jump table is moved Dyn. Inst reads jump tables to generate CFGs We update entries so that they can be used to point to new location of targets University of Maryland

Unexpectedly Called Function l Execution is not always predictable – Unexpected function calls l

Unexpectedly Called Function l Execution is not always predictable – Unexpected function calls l l Rewrite original executable with a Signal Handler Load the function upon an unexpected call – Signal Handler picks up page faults (SIGSEGV) – Loads requested page on-demand – Execution resumes l 14 User-level: No OS modifications University of Maryland

Experiments l Tested on – PETSc ex 5 in snes package – PETSc ex

Experiments l Tested on – PETSc ex 5 in snes package – PETSc ex 2 in ksp package – GS 2 l l l Compiled with debug flag and no optimization Used Open MPI Tested on 64 -node cluster at UMD – Dual-core x 86 processors – Unmodified Linux kernel l 15 Space savings of about 82% on average University of Maryland

PETSc – snes (ex 5) Library Name Text Pages (Original) Text Pages (Modified) Reduction

PETSc – snes (ex 5) Library Name Text Pages (Original) Text Pages (Modified) Reduction % petsc 260 68 73. 85 petscdm 161 19 88. 2 petscksp 335 39 88. 36 blas petscmat 772 40 94. 82 petscvec 204 52 74. 51 petscsnes 20 20 0 mpi_cxx 10 5 50 142 37 73. 94 gfortran open-pal 62 34 45. 16 open-rte 55 34 m 28 3 mpi 16 Library Name Text Pages (Original) Text Pages (Modified) Reductio n% X 11 146 7 95. 21 lapack 866 2 99. 77 80 3 96. 25 stdc++ 133 12 90. 98 gcc_s 12 2 83. 33 Xau 2 2 0 Xdcm 3 3 0 123 4 96. 75 dl 2 2 0 38. 18 nsl 14 2 85. 71 89. 29 util 2 2 0 OVERALL University of Maryland 2021 348 82. 78

PETSc – snes (ex 5) 17 University of Maryland

PETSc – snes (ex 5) 17 University of Maryland

PETSc – ksp (ex 2) Library Name Text Pages (Modified) Reduction % petsc 260

PETSc – ksp (ex 2) Library Name Text Pages (Modified) Reduction % petsc 260 72 72. 31 petscdm 161 3 98. 14 petscksp 335 49 85. 37 petscmat 772 49 93. 65 petscvec 204 54 73. 53 mpi_cxx 10 5 50 142 47 66. 9 open-pal 62 37 40. 32 open-rte 55 36 34. 55 OVERALL 2001 352 82. 41 mpi 18 Text Pages (Original) University of Maryland

GS 2 Library Name Reduction % Mds. Lib 21 0 100 Mds. Shr 21

GS 2 Library Name Reduction % Mds. Lib 21 0 100 Mds. Shr 21 0 100 Tdi. Shr 220 3 98. 64 Tree. Shr 38 0 100 fftw 70 25 64. 29 rfftw 58 8 86. 21 mpi_f 77 13 2 84. 62 142 40 71. 83 open-pal 62 36 41. 94 open-rte 55 36 34. 55 OVERALL 700 150 78. 57 mpi 19 Text Pages (Original) Text Pages (Modified) University of Maryland

Running Times l GS 2 takes 5 seconds less on average – (36 m

Running Times l GS 2 takes 5 seconds less on average – (36 m 38 s vs. 36 m 33 s) l Overhead on PETSc examples – ex 2 runs for 2. 7 secs, ex 5 runs for 1. 05 secs. 20 University of Maryland

Running Times l Results suggest no overhead for reasonably-long running programs – Initial cost

Running Times l Results suggest no overhead for reasonably-long running programs – Initial cost for signal handler registration – Better instruction cache and TLB performance 21 University of Maryland

Summary l l Our tool reduces memory footprint of shared libraries Rewrite shared libraries

Summary l l Our tool reduces memory footprint of shared libraries Rewrite shared libraries with holes – Defragment functions to maximize space savings l l l 22 On-demand page loading if a not-yetloaded function is called About 82% memory space savings for shared libraries Might improve instruction cache and TLB performance University of Maryland