Mixed Mode Execution with Context Threading Mathew Zaleski
- Slides: 41
Mixed Mode Execution with Context Threading Mathew Zaleski, Marc Berndl, Angela Demke Brown University of Toronto {matz, berndl, demke}@cs. toronto. edu (CASCON 2005, 11: 15 am, Oct 19/2005. )
Overview ‣Introduction • Background: • Interpretation • Dynamo & Traces • Our Approach • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005
VM performance • Native code performs better than an interpreter. • Ahead-of-time compilation not always possible. • High-performance VMs interpret and compile. • Hence term mixed-mode execution. • Typically method-based. ‣ Perl, python, php, Tcl, Java. Script and many others do not run mixed-mode. Why? CASCON 2005 Oct 19/2005 3
VM complexity Context Threaded interpreter Switch interpreter Simple CASCON 2005 Oct 19/2005 what about here? • optimizing inlined method nests Complicated 4 • Much up-front effort needed before method-based JIT works JIT must be able to compile complete inlined method nests before performance benefit accrues. ‣ We aim to create a more incremental approach to building a mixed-mode system.
Our vision of Incremental VM lifecycle Context Threaded interpreter Partial methods? Basic Blocks Switch interpreter Traces optimized inlined method nests Complexity of Compiled Code Regions 4 Step up to more ambitious regions as required CASCON 2005 Oct 19/2005 6
Overview ✓Introduction ‣ Background: • Interpretation • Traces • Our Approach • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005
Where does bytecode come from? Java Source int f(boolean parm){ if (parm){ return 42; }else{ return 0; } } CASCON 2005 Oct 19/2005 Java Bytecode Javac compiler int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn
Interpreter Loaded Program fetch execute dispatch Load Parms Internal Representation Bytecode bodies Execution Cycle CASCON 2005 Oct 19/2005
Switched Interpreter while(1){ switch(*v. PC++){ case iload_1: . . break; case ifeq: . . break; //and many more. . } }; 4 slow. burdened by switch and loop overhead CASCON 2005 Oct 19/2005
“Threading” Dispatch iload_1: . . goto *v. PC++; int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn ifeq: if () v. PC= goto *v. PC++; bipush: . . goto *v. PC++; ireturn: . . goto *v. PC++; iconst_0: . . goto *v. PC++; execution of virtual program “threads” through bodies (as in needle & thread) ireturn: . . goto *v. PC++; ‣ No switch overhead. Still nasty indirect branch. CASCON 2005 Oct 19/2005 12
Direct Threaded Interpreter v. PC … iload_1 ifeq 7 bipush 42 ireturn iconst_0 ireturn … Virtual Program DTT &&iload_1 &&ifeq 4 &&bipush 42 &&ireturn &&iconst_0 &&ireturn DTT - Direct Threading Table iload_1: . . goto *v. PC++; ifeq: if () v. PC= goto *v. PC++; bipush: . . goto *v. PC++; C implementation of each body 4 Target of computed goto is data-driven CASCON 2005 Oct 19/2005
Essence of Subroutine Threading DTT Context Threading. Table Bytecode bodies (ret terminated) CTT call iload_1 4 42 call ifeq iload_1: . . ret; call bipush call ireturn call iconst_0 ifeq: v. PC=. . goto *v. PC; call ireturn We recently reported (CGO 2005) that on modern hardware (Pentium 4 and Power PC) dispatching virtual instruction bodies by calling them reduces branch mispredictions significantly. 4 Package bodies as subroutines and call them CASCON 2005 Oct 19/2005
Context Threading (CT) -- Generating specialized code in CTT Context Threading v. PC … if(eq) goto target: 4 call bipush call … … target: DTT Branch Inlined Into the CTT 4 Specialized bodies can also be generated in CTT! CASCON 2005 Oct 19/2005
Overview ✓Introduction ✓ Background ✓Interpretation • Dynamo Traces • Our Approach • Why Context Threading? • Case study: Forward Branch. • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005
HP Dynamo • Trace-oriented dynamic optimization system. • HP PA-8000 computers. • Counter-Intuitive approach: • Don’t execute optimized binary interpret it. • Count transits of reverse branches. • Trace-generate (next slide). • Dispatch traces when encountered. • Soon, most execution from trace cache. • faster than binary! CASCON 2005 Oct 19/2005 17
Trace with if-then-else //c => b 2 if (c) b 1; else b 2; b 3; • c b 1 b 2 b 3 • c • texit b 1 b 2 b 3 CASCON 2005 Oct 19/2005 18 Trace is path followed by program Conditional branches become trace exits. Do not expect trace exits to be taken.
Other Related work • Ertl & Gregg • Piumarta & Riccardi • Vitale & Abdelrahman • Bala, Duesterwald and Banerjia • Whaley • Many JIT authors CASCON 2005 Oct 19/2005 19
Overview ✓Introduction ✓ Background: Interpretation & traces ‣ Our Approach • Strategy • Why Context Threading? • Case study: Forward Branch. • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005
Our Strategy • Much optimizing JIT research exists. • Almost all method-based. • We investigate how to mixed-mode execute variously shaped regions of a program. • Region selection • Code generation • Dispatch and execution ‣ We concentrate on how to extend a CT interpreter to detect, translate and execute basic blocks and traces. CASCON 2005 Oct 19/2005 21
Context Threading was easy to program Three main reasons: 1. Bodies organized as callable routines. 2. The DTT always points to implementation. 3. CTT callsites provide a convenient interposition opportunity. CASCON 2005 Oct 19/2005 22
1. Bodies are callable Packaging bytecode bodies as lightweight subroutines iload_1: . . ret; call iload_1 specialized code for iadd istore: . . ret; call istore_1 ‣ Easy to intersperse generated code and dispatch. CASCON 2005 Oct 19/2005 23
2. DTT always points to implementation. . of corresponding region of virtual program pc goto *pc //branches to iadd DTT CTT call iload_1 specialized code for iadd call istore_1 iload_1: . . ret; istore: . . ret; ‣ DTT/CTT correspondence enables soft link to dispatch code or body for a virtual instruction. CASCON 2005 Oct 19/2005 24
3. CTT provides for efficient interposition An Interposer is a generated trampoline iload_1: . . ret; DTT call iload_1 preworker(){ //instrument //or debug } postworker(){ //instrument //or debug } ‣ Regular C functions called between every dispatch CASCON 2005 Oct 19/2005 25
CT as basis for light-weight JIT • Code can be generated: • Inline in CTT. • As new dynamically generated callable region. • Interposers support profiling: • Discover interesting regions at runtime. • Rewrite DTT to “soft link” new code into program. CASCON 2005 Oct 19/2005 26
Overview ✓Introduction ✓ Background: Interpretation & traces • Our Approach ✓Why Context Threading? ‣ Case study: Forward Branch. • Selecting Regions • Results and Discussion CASCON 2005 Oct 19/2005
Case Study: Forward Branch Java Source int f(boolean parm){ if (parm){ return 42; }else{ return 0; } } Java Bytecode Javac compiler int f(boolean); Code: 0: iload_1 1: ifeq 7 4: bipush 42 6: ireturn 7: iconst_0 8: ireturn ‣ Address of destination needed to load branch CASCON 2005 Oct 19/2005
Loading Forward Branches CTT call 42 call l 0: call iload_1 lazy_inter bipush iload_1: ; ifeq: . . pc=. . ret; call ifeq call. . cmp. . beq l 0. . lazy l 0. . jmp *pc lazy(){ //rewrite ctt //to relative } ireturn iconst_0 ireturn ‣ Runtime -- lazily rewrite code as relative branch CASCON 2005 Oct 19/2005 29
Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach ‣ Selecting Regions • Basic Blocks • Traces • Results and Discussion CASCON 2005 Oct 19/2005
Detecting basic blocks DTT ifeq 4 iconst_0 ireturn CASCON 2005 Oct 19/2005 all basic blocks end with a virtual branch instruction. . the instruction executed immediately after a virtual branch is the entry to a basic block 32
End basic block at virtual branch DTT ifeq 4 pre(){. . end current bb. . gen code for bb } call pre call ifeq call post jmp *pc post(){ curr_bb = 0; } iconst_0 ireturn CASCON 2005 Oct 19/2005 ifeq: 33
Start basic block following branch DTT iconst_0: ifeq 4 call pre call iconst_0 pre(){ if (!curr_bb){ curr_bb = new_bb() } append_bb(); } iconst_0 ireturn CASCON 2005 Oct 19/2005 ret 34
(Preliminary) generated code for a basic block pre(){ //profile basic block } DTT ifeq 4 call pre call bb 1 mini-CTT-b 1 call iconst_0 ireturn call ireturn ret ‣ Basic block is a run-time superinstruction CASCON 2005 Oct 19/2005 36 jmp *pc
Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach • Selecting Regions ✓Basic Blocks ‣ Traces • Results and Discussion CASCON 2005 Oct 19/2005
Detecting Traces • Use Dynamo’s trace detection heuristic. • Instrument reverse branches until they are hot. • in postworker of virtual branch. • Then trace generate • in preworker of each basic block region CASCON 2005 Oct 19/2005 38
Traces pre(){//profile trace} DTT bb 1 bb 2 call pre call bb 1 trace exit call bb 2 jmp *pc mini-CTT-b 1 call iconst_0 mini-CTT-b 2 call iconst_0 call ireturn ret ‣ A Trace is a run-time super-instruction CASCON 2005 Oct 19/2005 39
Code Generation • The code generation as we have described it today is preliminary. • We are actively working on a JIT that compiles • basic blocks and traces into register allocated native code. Meanwhile, what can we learn from the current system? CASCON 2005 Oct 19/2005 40
Overview ✓ Introduction ✓ Background: Interpretation & traces ✓Our Approach ✓Selecting Regions ‣ Results and Discussion CASCON 2005 Oct 19/2005
Run time performance • We built our system into two VMs (Pentium 4). • Sablevm 1. 1. 8 • Ocaml 3. 08 • Region selection overhead is reasonable. VM Benchmark Suite Sablevm Ocaml Spec. Jvm 98 shootout CASCON 2005 Oct 19/2005 Elapsed time to run whole suite Direct Threaded (sec) CT-trace (sec) 843 4. 04 771 4. 57 42
Progress towards our vision of VM lifecycle Context Threaded interpreter Basic Blocks ✔ Traces ✔ Switch interpreter Complexity of Compiled Code Regions 4 Select, dispatch traces with reasonable overhead CASCON 2005 Oct 19/2005 43
Discussion • Our system detects and executes basic blocks and traces. • Paper discusses other shapes. • Preliminary code generator shows: • Flexible shapes are doable. • Overheads are reasonable. • How will a better code generator effect performance? CASCON 2005 Oct 19/2005 44
CASCON 2005 Oct 19/2005 45
- Mathew zaleski
- Bmode
- Microprocessor without interlocked pipelined stages
- Lathe gore
- Cnabd
- Com threading model
- Hyper threading
- Hyper threading
- Huber
- Slender tool attached in the needle clamp used for sewing
- Hyper threading
- Frunnable
- Hyper threading
- Cutting tools workshop
- Mixed mode assignment
- Utils ctl update ctlfile
- Mixed mode data collection
- Mixed mode assignment allowed in c and java
- Mode địa chỉ tức thì là mode
- Perbedaan (planning mode) dan (evolutionary mode)
- Difference between real mode and virtual mode of 80386
- Reset pin in 8086
- Which mode is defined as 8 bit auto reload mode of timer
- Focus mode and diffuse mode
- Gartner mode 1 mode 2
- Communicating across generational differences
- Presupposition triggers
- Verbal adalah
- High context vs low context culture ppt
- Diya mathew
- Mathew 2:1
- Idleon whattso
- Mathew 24:44
- Tisson mathew
- Matthew last supper
- Xxxx ccom
- Matthew 25:21
- Brandt andrews method
- Jose manuel mathew
- Mathews v. eldridge (1976)
- Fs71i
- Matthew 5:1-12 nkjv