Mixed Mode Execution with Context Threading Mathew Zaleski

  • Slides: 41
Download presentation
Mixed Mode Execution with Context Threading Mathew Zaleski, Marc Berndl, Angela Demke Brown University

Mixed Mode Execution with Context Threading Mathew Zaleski, Marc Berndl, Angela Demke Brown University of Toronto {matz, berndl, demke}@cs. toronto. edu (CASCON 2005, 11: 15 am, Oct 19/2005. )

Overview ‣Introduction • Background: • Interpretation • Dynamo & Traces • Our Approach •

Overview ‣Introduction • Background: • Interpretation • Dynamo & Traces • Our Approach • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005

VM performance • Native code performs better than an interpreter. • Ahead-of-time compilation not

VM performance • Native code performs better than an interpreter. • Ahead-of-time compilation not always possible. • High-performance VMs interpret and compile. • Hence term mixed-mode execution. • Typically method-based. ‣ Perl, python, php, Tcl, Java. Script and many others do not run mixed-mode. Why? CASCON 2005 Oct 19/2005 3

VM complexity Context Threaded interpreter Switch interpreter Simple CASCON 2005 Oct 19/2005 what about

VM complexity Context Threaded interpreter Switch interpreter Simple CASCON 2005 Oct 19/2005 what about here? • optimizing inlined method nests Complicated 4 • Much up-front effort needed before method-based JIT works JIT must be able to compile complete inlined method nests before performance benefit accrues. ‣ We aim to create a more incremental approach to building a mixed-mode system.

Our vision of Incremental VM lifecycle Context Threaded interpreter Partial methods? Basic Blocks Switch

Our vision of Incremental VM lifecycle Context Threaded interpreter Partial methods? Basic Blocks Switch interpreter Traces optimized inlined method nests Complexity of Compiled Code Regions 4 Step up to more ambitious regions as required CASCON 2005 Oct 19/2005 6

Overview ✓Introduction ‣ Background: • Interpretation • Traces • Our Approach • Selecting Regions

Overview ✓Introduction ‣ Background: • Interpretation • Traces • Our Approach • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005

Where does bytecode come from? Java Source int f(boolean parm){ if (parm){ return 42;

Where does bytecode come from? Java Source int f(boolean parm){ if (parm){ return 42; }else{ return 0; } } CASCON 2005 Oct 19/2005 Java Bytecode Javac compiler int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn

Interpreter Loaded Program fetch execute dispatch Load Parms Internal Representation Bytecode bodies Execution Cycle

Interpreter Loaded Program fetch execute dispatch Load Parms Internal Representation Bytecode bodies Execution Cycle CASCON 2005 Oct 19/2005

Switched Interpreter while(1){ switch(*v. PC++){ case iload_1: . . break; case ifeq: . .

Switched Interpreter while(1){ switch(*v. PC++){ case iload_1: . . break; case ifeq: . . break; //and many more. . } }; 4 slow. burdened by switch and loop overhead CASCON 2005 Oct 19/2005

“Threading” Dispatch iload_1: . . goto *v. PC++; int f(boolean); Code: 0: iload_1 1:

“Threading” Dispatch iload_1: . . goto *v. PC++; int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn ifeq: if () v. PC= goto *v. PC++; bipush: . . goto *v. PC++; ireturn: . . goto *v. PC++; iconst_0: . . goto *v. PC++; execution of virtual program “threads” through bodies (as in needle & thread) ireturn: . . goto *v. PC++; ‣ No switch overhead. Still nasty indirect branch. CASCON 2005 Oct 19/2005 12

Direct Threaded Interpreter v. PC … iload_1 ifeq 7 bipush 42 ireturn iconst_0 ireturn

Direct Threaded Interpreter v. PC … iload_1 ifeq 7 bipush 42 ireturn iconst_0 ireturn … Virtual Program DTT &&iload_1 &&ifeq 4 &&bipush 42 &&ireturn &&iconst_0 &&ireturn DTT - Direct Threading Table iload_1: . . goto *v. PC++; ifeq: if () v. PC= goto *v. PC++; bipush: . . goto *v. PC++; C implementation of each body 4 Target of computed goto is data-driven CASCON 2005 Oct 19/2005

Essence of Subroutine Threading DTT Context Threading. Table Bytecode bodies (ret terminated) CTT call

Essence of Subroutine Threading DTT Context Threading. Table Bytecode bodies (ret terminated) CTT call iload_1 4 42 call ifeq iload_1: . . ret; call bipush call ireturn call iconst_0 ifeq: v. PC=. . goto *v. PC; call ireturn We recently reported (CGO 2005) that on modern hardware (Pentium 4 and Power PC) dispatching virtual instruction bodies by calling them reduces branch mispredictions significantly. 4 Package bodies as subroutines and call them CASCON 2005 Oct 19/2005

Context Threading (CT) -- Generating specialized code in CTT Context Threading v. PC …

Context Threading (CT) -- Generating specialized code in CTT Context Threading v. PC … if(eq) goto target: 4 call bipush call … … target: DTT Branch Inlined Into the CTT 4 Specialized bodies can also be generated in CTT! CASCON 2005 Oct 19/2005

Overview ✓Introduction ✓ Background ✓Interpretation • Dynamo Traces • Our Approach • Why Context

Overview ✓Introduction ✓ Background ✓Interpretation • Dynamo Traces • Our Approach • Why Context Threading? • Case study: Forward Branch. • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005

HP Dynamo • Trace-oriented dynamic optimization system. • HP PA-8000 computers. • Counter-Intuitive approach:

HP Dynamo • Trace-oriented dynamic optimization system. • HP PA-8000 computers. • Counter-Intuitive approach: • Don’t execute optimized binary interpret it. • Count transits of reverse branches. • Trace-generate (next slide). • Dispatch traces when encountered. • Soon, most execution from trace cache. • faster than binary! CASCON 2005 Oct 19/2005 17

Trace with if-then-else //c => b 2 if (c) b 1; else b 2;

Trace with if-then-else //c => b 2 if (c) b 1; else b 2; b 3; • c b 1 b 2 b 3 • c • texit b 1 b 2 b 3 CASCON 2005 Oct 19/2005 18 Trace is path followed by program Conditional branches become trace exits. Do not expect trace exits to be taken.

Other Related work • Ertl & Gregg • Piumarta & Riccardi • Vitale &

Other Related work • Ertl & Gregg • Piumarta & Riccardi • Vitale & Abdelrahman • Bala, Duesterwald and Banerjia • Whaley • Many JIT authors CASCON 2005 Oct 19/2005 19

Overview ✓Introduction ✓ Background: Interpretation & traces ‣ Our Approach • Strategy • Why

Overview ✓Introduction ✓ Background: Interpretation & traces ‣ Our Approach • Strategy • Why Context Threading? • Case study: Forward Branch. • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005

Our Strategy • Much optimizing JIT research exists. • Almost all method-based. • We

Our Strategy • Much optimizing JIT research exists. • Almost all method-based. • We investigate how to mixed-mode execute variously shaped regions of a program. • Region selection • Code generation • Dispatch and execution ‣ We concentrate on how to extend a CT interpreter to detect, translate and execute basic blocks and traces. CASCON 2005 Oct 19/2005 21

Context Threading was easy to program Three main reasons: 1. Bodies organized as callable

Context Threading was easy to program Three main reasons: 1. Bodies organized as callable routines. 2. The DTT always points to implementation. 3. CTT callsites provide a convenient interposition opportunity. CASCON 2005 Oct 19/2005 22

1. Bodies are callable Packaging bytecode bodies as lightweight subroutines iload_1: . . ret;

1. Bodies are callable Packaging bytecode bodies as lightweight subroutines iload_1: . . ret; call iload_1 specialized code for iadd istore: . . ret; call istore_1 ‣ Easy to intersperse generated code and dispatch. CASCON 2005 Oct 19/2005 23

2. DTT always points to implementation. . of corresponding region of virtual program pc

2. DTT always points to implementation. . of corresponding region of virtual program pc goto *pc //branches to iadd DTT CTT call iload_1 specialized code for iadd call istore_1 iload_1: . . ret; istore: . . ret; ‣ DTT/CTT correspondence enables soft link to dispatch code or body for a virtual instruction. CASCON 2005 Oct 19/2005 24

3. CTT provides for efficient interposition An Interposer is a generated trampoline iload_1: .

3. CTT provides for efficient interposition An Interposer is a generated trampoline iload_1: . . ret; DTT call iload_1 preworker(){ //instrument //or debug } postworker(){ //instrument //or debug } ‣ Regular C functions called between every dispatch CASCON 2005 Oct 19/2005 25

CT as basis for light-weight JIT • Code can be generated: • Inline in

CT as basis for light-weight JIT • Code can be generated: • Inline in CTT. • As new dynamically generated callable region. • Interposers support profiling: • Discover interesting regions at runtime. • Rewrite DTT to “soft link” new code into program. CASCON 2005 Oct 19/2005 26

Overview ✓Introduction ✓ Background: Interpretation & traces • Our Approach ✓Why Context Threading? ‣

Overview ✓Introduction ✓ Background: Interpretation & traces • Our Approach ✓Why Context Threading? ‣ Case study: Forward Branch. • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005

Case Study: Forward Branch Java Source int f(boolean parm){ if (parm){ return 42; }else{

Case Study: Forward Branch Java Source int f(boolean parm){ if (parm){ return 42; }else{ return 0; } } Java Bytecode Javac compiler int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn ‣ Address of destination needed to load branch CASCON 2005 Oct 19/2005

Loading Forward Branches CTT call 42 call l 0: call iload_1 lazy_inter bipush iload_1:

Loading Forward Branches CTT call 42 call l 0: call iload_1 lazy_inter bipush iload_1: ; ifeq: . . pc=. . ret; call ifeq call. . cmp. . beq l 0. . lazy l 0. . jmp *pc lazy(){ //rewrite ctt //to relative } ireturn iconst_0 ireturn ‣ Runtime -- lazily rewrite code as relative branch CASCON 2005 Oct 19/2005 29

Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach ‣ Selecting Regions •

Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach ‣ Selecting Regions • Basic Blocks • Traces • Results and Discussion CASCON 2005 Oct 19/2005

Detecting basic blocks DTT ifeq 4 iconst_0 ireturn CASCON 2005 Oct 19/2005 all basic

Detecting basic blocks DTT ifeq 4 iconst_0 ireturn CASCON 2005 Oct 19/2005 all basic blocks end with a virtual branch instruction. . the instruction executed immediately after a virtual branch is the entry to a basic block 32

End basic block at virtual branch DTT ifeq 4 pre(){. . end current bb.

End basic block at virtual branch DTT ifeq 4 pre(){. . end current bb. . gen code for bb } call pre call ifeq call post jmp *pc post(){ curr_bb = 0; } iconst_0 ireturn CASCON 2005 Oct 19/2005 ifeq: 33

Start basic block following branch DTT iconst_0: ifeq 4 call pre call iconst_0 pre(){

Start basic block following branch DTT iconst_0: ifeq 4 call pre call iconst_0 pre(){ if (!curr_bb){ curr_bb = new_bb() } append_bb(); } iconst_0 ireturn CASCON 2005 Oct 19/2005 ret 34

(Preliminary) generated code for a basic block pre(){ //profile basic block } DTT ifeq

(Preliminary) generated code for a basic block pre(){ //profile basic block } DTT ifeq 4 call pre call bb 1 mini-CTT-b 1 call iconst_0 ireturn call ireturn ret ‣ Basic block is a run-time superinstruction CASCON 2005 Oct 19/2005 36 jmp *pc

Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach • Selecting Regions ✓Basic

Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach • Selecting Regions ✓Basic Blocks ‣ Traces • Results and Discussion CASCON 2005 Oct 19/2005

Detecting Traces • Use Dynamo’s trace detection heuristic. • Instrument reverse branches until they

Detecting Traces • Use Dynamo’s trace detection heuristic. • Instrument reverse branches until they are hot. • in postworker of virtual branch. • Then trace generate • in preworker of each basic block region CASCON 2005 Oct 19/2005 38

Traces pre(){//profile trace} DTT bb 1 bb 2 call pre call bb 1 trace

Traces pre(){//profile trace} DTT bb 1 bb 2 call pre call bb 1 trace exit call bb 2 jmp *pc mini-CTT-b 1 call iconst_0 mini-CTT-b 2 call iconst_0 call ireturn ret ‣ A Trace is a run-time super-instruction CASCON 2005 Oct 19/2005 39

Code Generation • The code generation as we have described it today is preliminary.

Code Generation • The code generation as we have described it today is preliminary. • We are actively working on a JIT that compiles • basic blocks and traces into register allocated native code. Meanwhile, what can we learn from the current system? CASCON 2005 Oct 19/2005 40

Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach ✓Selecting Regions ‣ Results

Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach ✓Selecting Regions ‣ Results and Discussion CASCON 2005 Oct 19/2005

Run time performance • We built our system into two VMs (Pentium 4). •

Run time performance • We built our system into two VMs (Pentium 4). • Sablevm 1. 1. 8 • Ocaml 3. 08 • Region selection overhead is reasonable. VM Benchmark Suite Sablevm Ocaml Spec. Jvm 98 shootout CASCON 2005 Oct 19/2005 Elapsed time to run whole suite Direct Threaded (sec) CT-trace (sec) 843 4. 04 771 4. 57 42

Progress towards our vision of VM lifecycle Context Threaded interpreter Basic Blocks ✔ Traces

Progress towards our vision of VM lifecycle Context Threaded interpreter Basic Blocks ✔ Traces ✔ Switch interpreter Complexity of Compiled Code Regions 4 Select, dispatch traces with reasonable overhead CASCON 2005 Oct 19/2005 43

Discussion • Our system detects and executes basic blocks and traces. • Paper discusses

Discussion • Our system detects and executes basic blocks and traces. • Paper discusses other shapes. • Preliminary code generator shows: • Flexible shapes are doable. • Overheads are reasonable. • How will a better code generator effect performance? CASCON 2005 Oct 19/2005 44

CASCON 2005 Oct 19/2005 45

CASCON 2005 Oct 19/2005 45