Advantages of Processor Virtualization and AMPI Laxmikant Kale

  • Slides: 43
Download presentation
Advantages of Processor Virtualization and AMPI Laxmikant Kale CS 320 Spring 2003 Kale@cs. uiuc.

Advantages of Processor Virtualization and AMPI Laxmikant Kale CS 320 Spring 2003 Kale@cs. uiuc. edu http: //charm. cs. uiuc. edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 1/9/2022 Virtualization for CS 320: 1

Overview • Processor Virtualization – Motivation – Realization in AMPI and Charm++ • Part

Overview • Processor Virtualization – Motivation – Realization in AMPI and Charm++ • Part I: Benefits – Better Software Engineering – Message Driven Execution – Flexible and dynamic mapping to processors – Principle of Persistence 1/9/2022 Virtualization for CS 320: 2

Motivation • We need to Improve Performance and Productivity in parallel programming • Parallel

Motivation • We need to Improve Performance and Productivity in parallel programming • Parallel Computing/Programming is about: – Coordination between processes • Information exchange • Synchronization – (knowing when the other guy has done something) – Resource management • Allocating work and data to processors 1/9/2022 Virtualization for CS 320: 3

Coordination: • Processes, each with possibly local data – How do they interact with

Coordination: • Processes, each with possibly local data – How do they interact with each other? – Data exchange and synchronization • Solutions proposed – – – Message passing Shared variables and locks Global Arrays / shmem UPC Asynchronous method invocation Specifically shared variables : • readonly, accumulators, tables – Others: Linda, 1/9/2022 Virtualization for CS 320: Each is probably suitable for different applications and subjective tastes of programmers 4

Resource Management • Coordination is one aspect – But parallel computing is also about

Resource Management • Coordination is one aspect – But parallel computing is also about resource management • Who needs resources: – Work units • Threads, function-calls, method invocations, loop iterations – Data units • Array segments, cache lines, stack-frames, messages, object variables • What are the resources: – Processors, floating point units, thread-units – Memories: caches, SRAMs, drams, • Idea: – Programmer should not have to manage resources explicitly, even within one program 1/9/2022 Virtualization for CS 320: 5

Processor Virtualization • Basic Idea: – Divide the computation into a large number of

Processor Virtualization • Basic Idea: – Divide the computation into a large number of pieces • Independent of number of processors • Typically larger than number of processors – Let the system map these virtual processors to processors • Old idea? G. Fox Book (’ 86? ), – DRMS (IBM), Data Parallel C (Michael Quinn), MPVM/UPVM/MIST • Our approach is “virtualization++” –Language and runtime support for virtualization –Exploitation of virtualization to the hilt 1/9/2022 Virtualization for CS 320: 6

Virtualization: Object-based Parallelization User is only concerned with interaction between objects (VPs) System implementation

Virtualization: Object-based Parallelization User is only concerned with interaction between objects (VPs) System implementation User View 1/9/2022 Virtualization for CS 320: 7

Technical Approach • Seek optimal division of labor between “system” and programmer: Decomposition done

Technical Approach • Seek optimal division of labor between “system” and programmer: Decomposition done by programmer, everything else automated Decomposition Automation Mapping Charm++ AMPI HPF Scheduling Expression MPI Specialization 1/9/2022 Virtualization for CS 320: 8

Why Virtualization? • Advertisement: – Virtualization is ready and powerful to meet the needs

Why Virtualization? • Advertisement: – Virtualization is ready and powerful to meet the needs of tomorrows applications and machines • Specifically: – Virtualization and associated techniques that we have been exploring for the past decade are ready and powerful enough to meet the needs of highend parallel computing and complex and dynamic applications • These techniques are embodied into: – – Charm++ AMPI Frameworks (Strucured grids, unstructured grids, particles) Virtualization of other coordination languages (UPC, GA, . . ) 1/9/2022 Virtualization for CS 320: 9

Realizations: Charm++ • Charm++ – Parallel C++ with Data Driven Objects (Chares) – Asynchronous

Realizations: Charm++ • Charm++ – Parallel C++ with Data Driven Objects (Chares) – Asynchronous method invocation • Prioritized scheduling – Object Arrays – Object Groups: – Information sharing abstractions: readonly, tables, . . – Mature, robust, portable (http: //charm. cs. uiuc. edu) 1/9/2022 Virtualization for CS 320: 10

Object Arrays • A collection of data-driven objects – With a single global name

Object Arrays • A collection of data-driven objects – With a single global name for the collection – Each member addressed by an index • [sparse] 1 D, 2 D, 3 D, tree, string, . . . – Mapping of element objects to proc. S handled by the system User’s view A[0] A[1] A[2] A[3] 1/9/2022 Virtualization for CS 320: A[. . ] 11

Object Arrays • A collection of data-driven objects – With a single global name

Object Arrays • A collection of data-driven objects – With a single global name for the collection – Each member addressed by an index • [sparse] 1 D, 2 D, 3 D, tree, string, . . . – Mapping of element objects to proc. S handled by the system User’s view A[0] A[1] A[2] A[3] A[. . ] System view A[0] 1/9/2022 A[3] Virtualization for CS 320: 12

Object Arrays • A collection of data-driven objects – With a single global name

Object Arrays • A collection of data-driven objects – With a single global name for the collection – Each member addressed by an index • [sparse] 1 D, 2 D, 3 D, tree, string, . . . – Mapping of element objects to proc. S handled by the system User’s view A[0] A[1] A[2] A[3] A[. . ] System view A[0] A[3] 1/9/2022 Virtualization for CS 320: 13

Adaptive MPI • A migration path for legacy MPI codes – AMPI = MPI

Adaptive MPI • A migration path for legacy MPI codes – AMPI = MPI + Virtualization – Uses Charm++ object arrays and migratable threads • Existing MPI programs: – Minimal modifications needed to convert existing MPI programs • Bindings for – C, C++, and Fortran 90 • We will focus on AMPI – Ignoring Charm++ for now. . 1/9/2022 Virtualization for CS 320: 14

AMPI: 7 MPI processes 1/9/2022 Virtualization for CS 320: 15

AMPI: 7 MPI processes 1/9/2022 Virtualization for CS 320: 15

AMPI: 7 MPI “processes” Implemented as virtual processors (user-level migratable threads) Real Processors 1/9/2022

AMPI: 7 MPI “processes” Implemented as virtual processors (user-level migratable threads) Real Processors 1/9/2022 Virtualization for CS 320: 16

Benefits of Virtualization 1. 2. 3. 4. Modularity and Better Software Engineering Message Driven

Benefits of Virtualization 1. 2. 3. 4. Modularity and Better Software Engineering Message Driven Execution Flexible and dynamic mapping to processors Principle of Persistence: – – Enables Runtime Optimizations Automatic Dynamic Load Balancing Communication Optimizations Other Runtime Optimizations 1/9/2022 Virtualization for CS 320: 17

1: Modularization • Logical Units decoupled from “Number of processors” – E. G. Oct

1: Modularization • Logical Units decoupled from “Number of processors” – E. G. Oct tree nodes for particle data – No artificial restriction on the number of processors • Cube of power of 2 • Modularity: – Software engineering: cohesion and coupling – MPI’s “are on the same processor” is a bad coupling principle – Objects liberate you from that: • E. G. Solid and fluid moldules in a rocket simulation 1/9/2022 Virtualization for CS 320: 18

Example: Rocket Simulation • Large Collaboration headed by Prof. M. Heath – DOE supported

Example: Rocket Simulation • Large Collaboration headed by Prof. M. Heath – DOE supported ASCI center • Challenge: – Multi-component code, • with modules from independent researchers – MPI was common base • AMPI: new wine in old bottle – Easier to convert – Can still run original codes on MPI, unchanged • Example of modularization: – Roc. Flo: Fluids code. – Roc. Solid: Structures code, – Rocface: data-transfer at the boundary. 1/9/2022 Virtualization for CS 320: 19

Rocket simulation via virtual processors Rocflo Rocface Rocsolid Rocflo Rocface Rocflo Rocsolid Rocface Rocsolid

Rocket simulation via virtual processors Rocflo Rocface Rocsolid Rocflo Rocface Rocflo Rocsolid Rocface Rocsolid Rocflo Rocface Rocsolid 1/9/2022 Rocflo Rocface Rocsolid Virtualization for CS 320: Rocflo Rocface Rocsolid 20

AMPI and Roc*: Communication Using separate sets of virtual processors for rocflo and Rocsolid

AMPI and Roc*: Communication Using separate sets of virtual processors for rocflo and Rocsolid eliminates unnecessary coupling Rocflo Rocface Rocsolid 1/9/2022 Rocflo Rocface Rocsolid Virtualization for CS 320: Rocflo Rocface Rocsolid 21

2: Benefits of Message Driven Execution Virtualization leads to Message Driven Execution: Since there

2: Benefits of Message Driven Execution Virtualization leads to Message Driven Execution: Since there are potential multiple objects on each processor Scheduler Message Q Which leads to Automatic Adaptive overlap of computation and communication 1/9/2022 Virtualization for CS 320: 22

Adaptive Overlap via Data-driven Objects • Problem: – Processors wait for too long at

Adaptive Overlap via Data-driven Objects • Problem: – Processors wait for too long at “receive” statements • Routine communication optimizations in MPI – Move sends up and receives down – Use irecvs, but be careful • With Data-driven objects – Adaptive overlap of computation and communication – No object or threads holds up the processor – No need to guess which is likely to arrive first 1/9/2022 Virtualization for CS 320: 23

Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of

Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph. D Thesis, Apr 1994. ) 1/9/2022 Virtualization for CS 320: 24

Handling Random Load Variations via MDE • MDE encourages asynchrony – Asynchronous reductions, for

Handling Random Load Variations via MDE • MDE encourages asynchrony – Asynchronous reductions, for example – Only data dependence should force synchronization • One benefit: – Consider an algorithm with N steps • Each step has different load balance: Tij • Loose dependence between steps – (on neighbors, for example) – Sum-of-max (MPI) vs max-of-sum (MDE) • OS Jitter: – Causes random processors to add delays in each step – Handled Automatically by MDE 1/9/2022 Virtualization for CS 320: 25

Asynchronous reductions in Jacobi Processor timeline with sync. reduction compute This gap is avoided

Asynchronous reductions in Jacobi Processor timeline with sync. reduction compute This gap is avoided below Processor timeline with async. reduction compute 1/9/2022 reduction compute Virtualization for CS 320: 26

Virtualization/MDE leads to predictability • Ability to predict: – Which data is going to

Virtualization/MDE leads to predictability • Ability to predict: – Which data is going to be needed and – Which code will execute – Based on the ready queue of object method invocations • So, we can: – – Prefetch data accurately Prefetch code if needed Out-of-core execution Caches vs controllable SRAM S Q 1/9/2022 Virtualization for CS 320: S Q 27

3: Flexible Dynamic Mapping to Processors • The system can migrate objects between processors

3: Flexible Dynamic Mapping to Processors • The system can migrate objects between processors – Vacate processor used by a parallel program – Dealing with extraneous loads on shared workstations – Adapt to speed difference between processors • E. g. Cluster with 500 MHz and 1 Ghz processors • Automatic checkpointing – Checkpointing = migrate to disk! – Restart on a different number of processors • Shrink and Expand the set of processors used by an app • Shrink from 1000 to 900 procs. Later expand to 1200. • Adaptive job scheduling for better System utilization 1/9/2022 Virtualization for CS 320: 28

Inefficient Utilization Within A Cluster Allocate A 16 Processor system Conflict ! B Queued

Inefficient Utilization Within A Cluster Allocate A 16 Processor system Conflict ! B Queued Job A Job B rs o ss e Job A 10 c ro p 8 processors Job B Current Job Schedulers can yield low system utilization. . A competetive problem in the context of Faucets-like systems 1/9/2022 Virtualization for CS 320: 29

Two Adaptive Jobs can shrink or expand the number of processors they use, at

Two Adaptive Jobs can shrink or expand the number of processors they use, at runtime: by migrating virtual processor AAllocate Expands A !! 16 Processor system Allocate BA! BShrink Finishes Job A Job B Job A 1/9/2022 10 = 1 pe e = _ ax p M in_ M Virtualization for CS 320: Min_pe = 8 Max_pe= 16 Job B 30

AQS Features • • AQS: Adaptive Queuing System Multithreaded Reliable and robust Supports most

AQS Features • • AQS: Adaptive Queuing System Multithreaded Reliable and robust Supports most features of standard queuing systems • Has the ability to manage adaptive jobs currently implemented in Charm++ and MPI • Handles regular (non-adaptive) jobs 1/9/2022 Virtualization for CS 320: 31

Cluster Utilization Experimental Simulated 1/9/2022 Virtualization for CS 320: 32

Cluster Utilization Experimental Simulated 1/9/2022 Virtualization for CS 320: 32

Experimental Mean Response Time 1/9/2022 Virtualization for CS 320: 33

Experimental Mean Response Time 1/9/2022 Virtualization for CS 320: 33

4: Principle of Persistence • Once the application is expressed in terms of interacting

4: Principle of Persistence • Once the application is expressed in terms of interacting objects: – Object communication patterns and computational loads tend to persist over time – In spite of dynamic behavior • Abrupt and large, but infrequent changes (eg: AMR) • Slow and small changes (eg: particle migration) • Parallel analog of principle of locality – – Heuristics, that holds for most CSE applications Learning / adaptive algorithms Adaptive Communication libraries Measurement based load balancing 1/9/2022 Virtualization for CS 320: 34

Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation –

Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation – Measures communication volume and computation time • Measurement based load balancers – Use the instrumented data-base periodically to make new decisions – Many alternative strategies can use the database • Centralized vs distributed • Greedy improvements vs complete reassignments • Taking communication into account • Taking dependences into account (More complex) 1/9/2022 Virtualization for CS 320: 35

Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3.

Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked 1/9/2022 Virtualization for CS 320: 36

Optimizing for Communication Patterns • The parallel-objects Runtime System can observe, instrument, and measure

Optimizing for Communication Patterns • The parallel-objects Runtime System can observe, instrument, and measure communication patterns – Communication is from/to objects, not processors – Load balancers use this to optimize object placement – Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime – E. g. Each to all individualized sends • Performance depends on many runtime characteristics • Library switches between different algorithms V. Krishnan, MS Thesis, 1996 1/9/2022 Virtualization for CS 320: 37

All to all on Lemieux for a 76 Byte Message 1/9/2022 Virtualization for CS

All to all on Lemieux for a 76 Byte Message 1/9/2022 Virtualization for CS 320: 38

Impact on Application Performance Molecular Dynamics (NAMD) Performance on Lemieux, with the transpose step

Impact on Application Performance Molecular Dynamics (NAMD) Performance on Lemieux, with the transpose step implemented using different all -to-all algorithms 1/9/2022 Virtualization for CS 320: 39

“Overhead” of Virtualization Isn’t there significant overhead of virtualization? No! Not in most cases.

“Overhead” of Virtualization Isn’t there significant overhead of virtualization? No! Not in most cases. Here, an application is run with increasing degree of virtualization Performance actually improves with virtualization because of better cache performance 1/9/2022 Virtualization for CS 320: 40

How to decide the granularity • How many virtual processors should you use? –

How to decide the granularity • How many virtual processors should you use? – This (typically) does not depend on the number physical processors available – Granularity: • Simple definition: amount of computation per message – Guiding principle: • Make (the work for) each virtual processor as small as possible, while making sure it is sufficiently large compared with the scehduling/messaging overhead. • In practivce, today: – Average computation per message > 100 microseconds is enough – 0. 5 msec to several msecs is typically used 1/9/2022 Virtualization for CS 320: 41

How to decide the granularity: contd. • Exceptions: – Memory overhead • Virtualization may

How to decide the granularity: contd. • Exceptions: – Memory overhead • Virtualization may lead to a large area of memory devoted to “ghosts” • Reduce the number of virtual processors • OR: “fuse” chunks on individual processors to avoid ghost regions. – Large messages • Modify the rule: – Calculate message overhead – Ensure granularity is more than 10 times this overhead 1/9/2022 Virtualization for CS 320: 42

Benefits of Virtualization: Summary • Software Engineering – Number of virtual processors can be

Benefits of Virtualization: Summary • Software Engineering – Number of virtual processors can be independently controlled – Separate VPs for modules • Message Driven Execution – Adaptive overlap – Modularity – Predictability: • Automatic Out-of-core • Cache management • Principle of Persistence: – Enables Runtime Optimizations – Automatic Dynamic Load Balancing – Communication Optimizations – Other Runtime Optimizations • Dynamic mapping – Heterogeneous clusters: • Vacate, adjust to speed, share – Automatic checkpointing – Change the set of processors 1/9/2022 More info: http: //charm. cs. uiuc. edu Virtualization for CS 320: 43