Exploring MultiThreaded Java Application Performance onon Multicore Hardware

  • Slides: 25
Download presentation
Exploring Multi-Threaded Java Application Performance onon Multicore Hardware Performance Multicore Hardware Jennifer B. Sartor,

Exploring Multi-Threaded Java Application Performance onon Multicore Hardware Performance Multicore Hardware Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012

Modern Software & Hardware n Managed l l Ubiquitous, but added runtime layer Many

Modern Software & Hardware n Managed l l Ubiquitous, but added runtime layer Many service threads interact with application w w w JIT compilation, on-stack replacement, collector Stop the application, possibly critical Share hardware resources n Multicore l languages with multiple sockets How do we schedule threads with constrained resources? w w Scale core frequency for power Use caches of all sockets, or limit communication p. 2

Extensive Performance Study n Multi-threaded Java application on multicore, multi-socket hardware n Large space

Extensive Performance Study n Multi-threaded Java application on multicore, multi-socket hardware n Large space to explore l l l Number of threads Thread-to-core/socket mapping Pairing or isolating application and JVM threads Pinning Impact of frequency scaling Difference between startup and steady state How do choices with scheduling and hardware resources affect performance? p. 3

Experimental Machine: Nehalem Scale frequency per socket to 1. 596 or 3. 059 GHzp.

Experimental Machine: Nehalem Scale frequency per socket to 1. 596 or 3. 059 GHzp. 4

Gain Insight on Scheduling n Application n Java l l Virtual Machine Garbage collector

Gain Insight on Scheduling n Application n Java l l Virtual Machine Garbage collector Just-in-time compiler with on-stack replacement n Cao, et al. [ISCA 2012] studied JVM amenability to heterogeneity by measuring service threads’ performance per energy n We study end-to-end performance p. 5

Roadmap 1. Cost of Isolation Socket 0 1. Socket 1 3. Frequency Scaling Socket

Roadmap 1. Cost of Isolation Socket 0 1. Socket 1 3. Frequency Scaling Socket 0 Pairing Threads Socket 1 p. 6

Experimental Methodology n Jikes l l Research Virtual Machine (Dec 2011) Generational Immix collector

Experimental Methodology n Jikes l l Research Virtual Machine (Dec 2011) Generational Immix collector 1. 5, 2, and 3 x minimum heap sizes n Multithreaded l l Avrora, lusearch (with fix), pmd, sunflow, xalan Also, pseudojbb 2005 n Timed l l Da. Capo benchmarks 9. 12 -bach 10 invocations Steady state, measure 15 th iteration Startup, measure 1 st iteration p. 7

Baseline Setup Application threads JVM service threads Pin application & collection threads Collection Nehalem

Baseline Setup Application threads JVM service threads Pin application & collection threads Collection Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Compilation Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 8

Boosting Socket Frequency 1. 596 Nehalem Core 0 Nehalem Core 1 Nehalem Core 2

Boosting Socket Frequency 1. 596 Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 3. 059 GHz 27 -50% improvement in execution time Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 9

Exploring The Cost of Isolation Collection threads Nehalem Core 0 Nehalem Core 1 Nehalem

Exploring The Cost of Isolation Collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 10

Isolating Collection Threads Isolating collector does not significantly hurt performance p. 11

Isolating Collection Threads Isolating collector does not significantly hurt performance p. 11

Exploring The Cost of Isolation Compiler thread Nehalem Core 0 Nehalem Core 1 Nehalem

Exploring The Cost of Isolation Compiler thread Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 12

Isolating Compiler Thread at Startup Isolating compiler at startup has little impact p. 13

Isolating Compiler Thread at Startup Isolating compiler at startup has little impact p. 13

Isolating On-Stack-Replace at Startup Isolating OSR at startup improves performance p. 14

Isolating On-Stack-Replace at Startup Isolating OSR at startup improves performance p. 14

Exploring The Cost of Isolation All JVM service threads Nehalem Core 0 Nehalem Core

Exploring The Cost of Isolation All JVM service threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 15

Isolating All JVM Threads Isolating service threads only significantly hurts one benchmark p. 16

Isolating All JVM Threads Isolating service threads only significantly hurts one benchmark p. 16

Exploring Frequency Scaling Baseline: JVM service threads isolated, all cores at highest frequency Nehalem

Exploring Frequency Scaling Baseline: JVM service threads isolated, all cores at highest frequency Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 17

Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Lower frequency

Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Lower frequency of JVM service threads Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 versus Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Lower frequency of application threads Nehalem Core 4 p. 18

Lower Frequency: Collector vs App Lowering collector frequency affects performance 5 x less than

Lower Frequency: Collector vs App Lowering collector frequency affects performance 5 x less than for application p. 19

Lower Freq at Startup: Compiler vs App Lowering compiler frequency is not detrimental compared

Lower Freq at Startup: Compiler vs App Lowering compiler frequency is not detrimental compared to application p. 20

Lower Frequency: JVM vs App Lowering JVM frequency affects performance 5 x less than

Lower Frequency: JVM vs App Lowering JVM frequency affects performance 5 x less than for application p. 21

Exploring Pairing Threads Pair application and collection threads Nehalem Core 0 Nehalem Core 1

Exploring Pairing Threads Pair application and collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 22

Pairing App & Collector, 2 Sockets With all but avrora, pairing application and collector

Pairing App & Collector, 2 Sockets With all but avrora, pairing application and collector performs best p. 23

Overall Performance Comparison Either use 1 socket, or isolate compiler thread p. 24

Overall Performance Comparison Either use 1 socket, or isolate compiler thread p. 24

Conclusions: Scheduling Insights n 1 socket: # application = # collection threads n 2

Conclusions: Scheduling Insights n 1 socket: # application = # collection threads n 2 sockets: l l l Isolate compilation thread Pair application and collection threads Set # application threads = # cores, fewer collection threads n Increasing application frequency is more important than for JVM service threads n Analyzed Java performance given hardware resources p. 25