Pin Building Customized Program Analysis Tools with Dynamic

Pin Building Customized Program Analysis Tools with Dynamic Instrumentation CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood Intel Vijay Janapa Reddi University of Colorado http: //rogue. colorado. edu/Pin PLDI’ 05 1

Instrumentation • Insert extra code into programs to collect information about execution – Program analysis: • Code coverage, call-graph generation, memory-leak detection – Architectural study: • Processor simulation, fault injection • Existing binary-level instrumentation systems: – Static: • ATOM, EEL, Etch, Morph – Dynamic: • Dyninst, Vulcan, DTrace, Valgrind, Strata, Dynamo. RIO C Pin is a new dynamic binary instrumentation system PLDI’ 05 2

Advantages of Pin Instrumentation 1. Easy-to-use Instrumentation API – – Instrumentation code written in C/C++/asm ATOM-like API, based on procedure calls 2. Instrumentation tools portable across platforms – – Same tools work on IA 32, EM 64 T (x 86 -64), Itanium, ARM Same tools work on Linux and Windows (ongoing work) 3. Low instrumentation overhead – – Pin automatically optimizes instrumentation code Pin can attach instrumentation to a running process 4. Robust – Handle mixed code and data, variable-length instructions, dynamically-generated code 5. Transparent – Application sees original addresses, values, and stack content PLDI’ 05 3

A Pintool for Tracing Memory Writes #include <iostream> #include "pin. H" FILE* trace; executed immediately before a write is executed • Same source code works on thesize) 4 architectures VOID Record. Mem. Write(VOID* ip, VOID* addr, UINT 32 { fprintf(trace, “%p: W %p %dn”, ip, addr, size); } => Pin takes care of different addressing modes VOID Instruction(INS *v) { • No needins, to VOID manually save/restore application state if (INS_Is. Memory. Write(ins)) INS_Insert. Call(ins, IPOINT_BEFORE, AFUNPTR(Record. Mem. Write), => Pin does it for you automatically and efficiently IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); } executed when an instruction int main(int argc, char * argv[]) { PIN_Init(argc, argv); is dynamically compiled trace = fopen(“atrace. out”, “w”); INS_Add. Instrument. Function(Instruction, 0); PIN_Start. Program(); PLDI’ 05 4 return 0; }

Dynamic Instrumentation Original code Code cache 1’ 1 2 3 5 6 Exits point back to Pin 2’ 4 7’ 7 PLDI’ 05 Pin fetches trace starting block 1 and start instrumentation Pin 5

Dynamic Instrumentation Original code Code cache 1’ 1 2 3 5 6 2’ 4 7’ 7 Pin transfers control into code cache (block 1) PLDI’ 05 Pin 6

Dynamic Instrumentation Original code Code cache trace linking 1 2 3 5 6 7 PLDI’ 05 1’ 3’ 2’ 5’ 7’ 6’ 4 Pin fetches and instrument a new trace Pin 7

Pin’s Software Architecture Address space Pintool Pin Instrumentation APIs q 3 programs (Pin, Pintool, App) in same address space: Ø User-level only q Instrumentation APIs: Application Virtual Machine (VM) JIT Compiler Code q JIT compiler: Cache Emulation Unit Operating System Hardware PLDI’ 05 Ø Through which Pintool communicates with Pin Ø Dynamically compile and instrument q Emulation unit: Ø Handle insts that can’t be directly executed (e. g. , syscalls) q Code cache: Ø Store compiled code => Coordinated by VM 8

Pin Internal Details • • • Loading of Pin, Pintool, & Application An Improved Trace Linking Technique Register Re-allocation Instrumentation Optimizations Multithreading Support PLDI’ 05 9

Register Re-allocation • Instrumented code needs extra registers. E. g. : – – – • Virtual registers available to the tool A virtual stack pointer pointing to the instrumentation stack Many more … Approaches to get extra registers: 1. Ad-hoc (e. g. , Dynamo. RIO, Strata, Dyn. Inst) – Whenever you need a register, spill one and fill it afterward 2. Re-allocate all registers during compilation a. Local allocation (e. g. , Valgrind) – Allocate registers independently within each trace b. Global allocation (Pin) – PLDI’ 05 Allocate registers across traces (can be inter-procedural) 10

Valgrind’s Register Re-allocation Trace 1 Original Code mov 1, %eax mov 2, %ebx mov 2, %esi cmp %ecx, %edx re-allocate jz t cmp %ecx, %edx Virtual Physical mov %eax, SPILLeax %ebx %ecx %edx %eax %esi %ecx %edx mov SPILLeax, %eax Virtual Physical mov SPILLebx , %edi %eax %ebx %ecx %edx %eax %edi %ecx %edx mov %esi, SPILLebx t: jz t’ add 1, %eax sub 2, %ebx Trace 2 t’: C Simple but inefficient add 1, %eax sub 2, %edi • All modified registers are spilled at a trace’s end PLDI’ 05 • Refill registers at a trace’s beginning 11

Pin’s Register Re-allocation Scenario (1): Compiling a new trace at a trace exit Trace 1 Original Code mov 1, %eax mov 2, %ebx mov 2, %esi cmp %ecx, %edx re-allocate cmp %ecx, %edx jz t t: jz t’ add 1, %eax sub 2, %ebx Trace 2 t’: add 1, %eax Compile Trace 2 using the binding at Trace 1’s exit: Virtual Physical %eax %ebx %ecx %edx %eax %esi %ecx %edx sub 2, %esi PLDI’ 05 C No spilling/filling needed across traces 12

Pin’s Register Re-allocation Scenario (2): Targeting an already generated trace at a trace exit Trace 1 (being compiled) Original Code mov 1, %eax mov 2, %ebx mov 2, %esi cmp %ecx, %edx re-allocate cmp %ecx, %edx mov %esi, SPILLebx jz t mov SPILLebx, %edi t: jz t’ add 1, %eax sub 2, %ebx Physical %eax %ebx %ecx %edx %eax %esi %ecx %edx Trace 2 (in code cache) t’: PLDI’ 05 Virtual add 1, %eax Virtual Physical sub 2, %edi %eax %ebx %ecx %edx %eax %edi %ecx %edx C Minimal spilling/filling code 13

Instrumentation Optimizations 1. Inline instrumentation code into the application 2. Avoid saving/restoring eflags with liveness analysis 3. Schedule inlined instrumentation code PLDI’ 05 14

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target 1> add %ecx, %edx cmp %edx, 0 je <target 2> BBL_Insert. Call(bbl, IPOINT_BEFORE, docount(), IARG_UINT 32, BBL_Num. Ins(bbl), IARG_END) C 33 extra instructions executed altogether Instrument without applying any optimization bridge() Trace mov %esp, SPILLappsp mov SPILLpinsp, %esp call <bridge> cmov %esi, %edi mov SPILLappsp, %esp cmp %edi, (%esp) jle <target 1’> mov %esp, SPILLappsp mov SPILLpinsp, %esp call <bridge> add %ecx, %edx PLDI’ 05 cmp %edx, 0 je <target 2’> pushf push %edx push %ecx push %eax movl 0 x 3, %eax call docount pop %eax pop %ecx pop %edx popf ret docount() add %eax, icount ret 15

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target 1> Inlining add %ecx, %edx cmp %edx, 0 je <target 2> C 11 extra instructions executed PLDI’ 05 Trace mov %esp, SPILLappsp mov SPILLpinsp, %esp pushf add 0 x 3, icount popf cmov %esi, %edi mov SPILLappsp, %esp cmp %edi, (%esp) jle <target 1’> mov %esp, SPILLappsp mov SPILLpinsp, %esp pushf add 0 x 3, icount popf add %ecx, %edx cmp %edx, 0 je <target 2’> 16

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target 1> Inlining + eflags liveness analysis add %ecx, %edx cmp %edx, 0 je <target 2> C 7 extra instructions executed Trace mov %esp, SPILLappsp mov SPILLpinsp, %esp pushf add 0 x 3, icount popf cmov %esi, %edi mov SPILLappsp, %esp cmp %edi, (%esp) jle <target 1’> add 0 x 3, icount add %ecx, %edx cmp %edx, 0 je <target 2’> PLDI’ 05 17

Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target 1> Inlining + eflags liveness analysis + scheduling add %ecx, %edx cmp %edx, 0 je <target 2> C 2 extra instructions executed Trace cmov %esi, %edi add 0 x 3, icount cmp %edi, (%esp) jle <target 1’> add 0 x 3, icount add %ecx, %edx cmp %edx, 0 je <target 2’> PLDI’ 05 18

Pin Instrumentation Performance Runtime overhead of basic-block counting with Pin on IA 32 PLDI’ 05 (SPEC 2 K using reference data sets) 19

Comparison among Dynamic Instrumentation Tools Runtime overhead of basic-block counting with three different tools • Valgrind is a popular instrumentation tool on Linux • Call-based instrumentation, no inlining • Dynamo. RIO is the performance leader in binary dynamic optimization • Manually inline, no eflags liveness analysis and scheduling 20 CPLDI’ 05 Pin automatically provides efficient instrumentation

Pin Applications • Sample tools in the Pin distribution: – Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler • Some tools developed and used inside Intel: – Opcodemix (analyze code generated by compilers) – Pin. Points (find representative regions in programs to simulate) – A tool for detecting memory bugs • Some companies are writing their own Pintools: – A major database vendor, a major search engine provider • Some universities using Pin in teaching and research: – U. of Colorado, MIT, Harvard, Princeton, U of Minnesota, Northeastern, Tufts, University of Rochester, … PLDI’ 05 21

Conclusions • Pin – A dynamic instrumentation system for building your own program analysis tools – Easy to use, robust, transparent, efficient – Tool source compatible on IA 32, EM 64 T, Itanium, ARM – Works on large applications • database, search engine, web browsers, … – Available on Linux; Windows version coming soon • Downloadable from http: //rogue. colorado. edu/Pin – User manual, many example tools, tutorials – 3300 downloads since 2004 July PLDI’ 05 22

Acknowledgments • Prof Dan Connors – Hosting Pin website at U of Colorado • Intel Bistro Team – Providing the Falcon decoder/encoder – Suggesting instrumentation scheduling • Mark Charney – Providing the XED decoder/encoder • Ramesh Peri – Implementing part of Itanium Instrumentation PLDI’ 05 23

Backup PLDI’ 05 24

Talk Outline • • • A Sample Pintool Pin Internal Details Experimental Results Pin Applications Conclusions PLDI’ 05 25

Trace Linking • Trace linking is a very effective optimization – Bypass VM when transferring from one trace to another – Slowdown without trace linking as much as 100 x • Linking direct branches/calls – Straightforward as targets are unique • Linking indirect branches/calls & returns – More challenging because the target can be different each time – Our approach: • For all indirect control transfers, use chaining • For returns, further optimizes with function cloning PLDI’ 05 26

Indirect Trace Linking original indirect jump jmp [%eax] chain of predicted targets target_1’: mov [%eax], T jmp target_1’ if (T != target_1) jmp target_2’ … target_N’: if (T != target_N) jmp Lookup. Htab … • Chains are built incrementally Lookup. Htab: if (hit) jmp translated[T] else call Pin slow path – Most recent target inserted at the chain’s head • Hash table is local to each indirect jump C Improved prediction accuracy over existing schemes PLDI’ 05 27

Return-Address Prediction • Distinguish different callers to a function by cloning: F’(): g n i n A(): call F(): no clo jmp A’ F_A’() : pop T ret B(): call F() pop T clo n jmp A’ ing F_B’() : pop T jmp B’ PLDI’ 05 A’: if (T != A) jmp B’ … B’: if (T != B) jmp Lookuphtab 1 … A’: if (T != A) jmp Lookuphtab 1 … B’: if (T != B) jmp Lookuphtab 2 … C Prediction accuracy further improved 28

Pin Multithreading Support • For instrumenting multithreaded programs: – Pin intercepts all threading-related system calls: • Create and start jitting a thread if a clone() is seen – Pin provides a “thread id” for pintools to index threadlocal storage – Pin’s virtual registers are backed up by per-thread spilling area • For writing multithreaded pintools: – Since Pin cannot link in libpthread in the pintool (to avoid conflicts in setting up signal handlers by two libpthreads) PLDI’ 05 • Pin implements a subset of libpthread itself • Pin can also redirect libpthread calls in pintool to the application’s libpthread 29

Instrumenting Multithreaded Programs • Pin instruments multithreaded programs: – Spilling area has to be thread local • Create a new per-thread spilling area when a thread-create system call (e. g. , clone()) is intercepted • How to access to per-thread spilling area? – Steal a physical register to point to the per-thread spilling area – x 86 -specific optimization: • Initially assuming single-threaded program – Access to the spilling area via its absolute address • If multiple threads detected later: – Flush the code cache – Recompile with a physical register pointing to per-thread spilling area PLDI’ 05 30

Optimizing Instrumentation Performance Observations: – Slowdown largely due to executing instrumentation code rather than dynamic compilation Þ Make sense to spend more time to optimize – Focus on optimizing simple instrumentation tools: • Performance depends on how fast we can transit between the application and the tool • Simple yet commonly used (e. g. , basic-block profiling) PLDI’ 05 31

Pin Source Code Organization • Pin source organized into generic, architecturedependent, OS-dependent modules: Architecture #source files #source lines Generic 87 (48%) 53595 (47%) x 86 (32 -bit + 64 -bit) 34 (19%) 22794 (20%) Itanium 34 (19%) 20474 (18%) ARM 27 (14%) 17933 (15%) TOTAL 182 (100%) 114796 (100%) C ~50% code shared among architectures PLDI’ 05 32

Pin Instrumentation Performance of basic-block counting with Pin/IA 32 Average slowdown INT FP Without optimization 10. 4 x 3. 9 x Inlining 7. 8 x 3. 5 x Inlining + eflags analysis 2. 8 x 1. 5 x 2. 5 x 1. 4 x PLDI’ 05 Inlining + eflags analysis + scheduling 33

Comparison among Dynamic Instrumentation Tools Performance of basic-block counting with three different tools • Valgrind is a popular instrumentation tool on Linux • Call-based instrumentation, no inlining • Dynamo. RIO is the performance leader in dynamic optimization • Manually inline, no eflags liveness analysis and scheduling 34 CPLDI’ 05 Pin automatically provides efficient instrumentation

Pin/IA 32 Performance (no instrumentation) PLDI’ 05 35

Pin/EM 64 T Performance (no instrumentation) PLDI’ 05 36

Pin 0/IPF Performance (no instrumentation) PLDI’ 05 37