Grid performance grid benchmarks grid metrics Zsolt Nmeth
Grid performance, grid benchmarks, grid metrics Zsolt Németh MTA SZTAKI Computer and Automation Research Institute zsnemeth@sztaki. hu http: //www. lpds. sztaki. hu/~zsnemeth
Outline ● ● What is the grid? What is grid performance? Are benchmarks useful? How can be grid metrics defined?
What is the grid?
Distributed applications ● A set of cooperative processes
Distributed applications ● Processes require resources Printer Network CPU Storage I/O devices Memory Database Librabries
Distributed applications ● Resources can be found on computational nodes Network CPU Printer Storage Database Memory I/O devices CPU Libraries Mapping
Distributed applications Application: Cooperative processes – – – Process control? Security? Naming? Communication? Input / output? File access? Physical layer: Computational nodes
Distributed applications Application: Cooperative processes Virtual machine: • Process control • Security • Naming • Communication • Input / output • File access Physical layer: Computational nodes
Conventional distributed environments and grids ● Distributed resources are virtually unified by a software layer ● ● ● A virtual machine is introduced between the application and the physical layer Provides a single system image to the application Types ● ● “Conventional” (PVM, some implementations of MPI) Grid (Globus, Legion)
Conventional distributed environments and grids • What is the essential difference?
Conventional distributed environments and grids • Geographical extent?
Conventional distributed environments and grids • Performance?
Conventional distributed environments and grids • Tools and services?
Conventional distributed environments and grids • How is the virtual machine built up? • What does execution mean? • What is the semantics of execution?
Description of grid ● “flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources” (The anatomy of the grid) ● “single, seamless, computational environment in which cycles, communication and data are shared” (Legion: the Next Step Toward a Nationwide Virtual Computer) ● “widearea environment that transparently consists of workstations, personal computers, graphic rendering engines, supercomputers and nontraditional devices” (Legion - A View from 50, 000 Feet) ● “collection of geographically separated resources connected by a high speed network”, “a software layer which transforms a collection of independent resources into a single, coherent virtual machine” (Metacomputing - What’s in it for me)
Conventional environments Processes • Have resource requests Mapping • Processes are mapped onto nodes • Resource assignment is implicit Virtual machine • Constructed on a priori information Set of nodes (node=collection of resources) • Login access • Static Physical level
Grid Processes • Have resource requirements Virtual machine • Resources are assigned to processes • Consists of the selected resources Mapping • Assign nodes to resources? Set of resources • Shared • Dynamic Physical layer
Grid: the resource abstraction Processes • Have resource needs Resource abstraction • Explicit mapping between virtual and physical resources • Cannot be solved at user/application level Physical layer
Grid: the user abstraction Processes • Belong to a user • User of the virtual machine is authorised to use the constituting resources • Have no login access to the node the resource belongs to User abstraction • User of the virtual machine is temporarily mapped onto some local accounts • Cannot be solved at user/application level Physical layer • Local, physical users (user accounts)
The grid: abstraction ● Semantically: the grid is nothing but abstraction ● Resource abstraction ● ● ● Physical resources can be assigned to virtual resource needs (matched by properties) Grid provides a mapping between virtual and physical resources User abstraction ● ● User of the physical machine may be different from the user of the virtual machine Grid provides a temporal mapping between virtual and physical users
Conventional distributed environments and grids Smith 4 nodes Smith, 4 CPU, memory, storage Smith 1 CPU smith@n 1. e du default@foo. c om smith@n 2. e griduser@mynode. hu
Grid performance
What is grid performance at all? ● ● Performance of ‘grid infrastructure’ or performance of ‘grid application’? Traditionally ‘performance’ is ● ● Speed Throughput Bandwidth, etc. Using grids ● ● ● Quantitative reasons Qualitative reasons – Qo. S Economic aspects
Grid performance analysis scenarios 1. 2. 3. 4. 5. 6. 7. 8. Resource brokering: evaluate the performance of a given resource if it is appropriate for a certain job At runtime: check if a resource can maintain an acceptable/required performance At runtime: check if a job can evolve according to checkpoints Find obvious idling/waiting spots Find bad communication patterns Find serious performance skew Post mortem: see if brokering strategy was correct Etc.
What is grid performance at all? • supercomputer • cluster
What is grid performance at all? • supercomputer • task is done in 20 minutes • cluster • task is done in 12 hours
What is grid performance at all? • supercomputer • task is done in 20 minutes • available tomorrow night • cluster • task is done in 12 hours • available now
What is grid performance at all? • • supercomputer task is done in 20 minutes available tomorrow night costs $200/hour • • cluster task is done in 12 hours available now costs $15/hour
What is grid performance at all? • Grid is about resource sharing • What is the benefit of sharing – acceptable for resource owners – acceptable for resource users • Speed, bandwidth, capacity, etc. is just one aspect • Properness, fairness, effectiveness of assignment of processes to resources
Grid performance Performance ? Virtual layer Physical layer
Grid performance Performance ? Virtual layer Physical layer Measuremen t
Grid performance Performance ? Virtual layer Physical layer Measuremen t
Interaction of application and the infrastructure ● ● Performance = application perf. infrastructure perf. Signature model (Pablo group) ● Application signature ● ● Scaling factor (capabilities of the resources) ● ● e. g. instructions/FLOPs e. g. FLOPs/seconds Execution signature: ● ● application signature * scaling factor E. g. instructions/second = instructions/FLOPS * FLOPs/seconds
Possible performance problems in grids ● ● All that may occur in a distributed application Plus ● ● ● ● Effectiveness of resource brokering Synchronous availability of resources Resources may change during execution Various local policies Shared use of resources Higher costs of some activities The corresponding symptoms must be characterised
Grid performance metrics ● ● ● Abstract representation of measurable quantities M=R 1 x. R 2 x. . . Rn Usual metrics ● ● ● Such strict values are not characteristic in grid ● ● ● Speedup, efficiency Load, queue length, etc. Cannot be interpreted Cannot be compared New metrics ● ● Local metrics and grid metrics Symbolic description / metrics
Processing monitoring information ● Trace data reduction ● ● Proportional to time t, processes P, metrics dimension n Statistical clustering (reducing P) ● Similar temporal behaviours are classified ● ● ● Representative processes are recorded for each class Statistical projection pursuit (reducing n) ● ● Questionnable if works for grids reduces the dimension by identifying significant metrics Sampling frequency (reducing t)
Performance tuning, optimisation ● The execution cannot be reproduced ● ● Post-mortem optimisation is not viable On-line steering is necessary though, hard to realise ● ● Sensors and actuators Application and implementation dependent E. g Autopilot, Falcon Average behaviour of applications can be improved ● Post-mortem tuning of the infrastructure (if possible) ● ● Brokering decisions Supporting services
Grid benchmarking
Grid performance, resource performance ● The traditional way: benchmarking ● As suggested by GGF-GBRG
Running benchmarks ● Benchmarks are executed on a virtual machine
Running benchmarks ● ● Benchmarks are executed on a virtual machine The virtual machine may change (composed of different resources) from run to run
Running benchmarks ● ● ● Benchmarks are executed on a virtual machine The virtual machine may change (composed of different resources) from run to run Benchmark result is representative to one certain virtual machine
Running benchmarks ● ● ● Benchmarks are executed on a virtual machine The virtual machine may change (composed of different resources) from run to run Benchmark result is representative to one certain virtual machine What can it show about the entire grid? What can it show about a certain resource?
Grid benchmarking Measuremen t Performance ? Virtual layer Physical layer
Grid metrics
Local metrics ● ● Load averages, CPU user, system, idle percentages, network bandwidth, cache hit ratio, available memory, page faults, etc. Performance is a trajectory in a multi-dimensional space Cannot be compared Cannot be interpreted ● processes: 55. 2, user: 70, system: 0, idle: 30 ● ● processes: 55. 2, user: 70, system: 30, idle: 0 ● ● slightly overloaded 64 -CPU system processes: 4. 1, user: 99, system: 1, idle: 0 ● ● 64 -CPU system, serious overheads processes: 72. 8, user: 99, system: 1, idle: 0 ● ● underloaded 64 -CPU system seriously overloaded 1 -CPU system Fine details are even more complex to evaluate
Local metrics, global (grid) metrics ● ● ● Local metrics are transformed into some globally understandable performance figures What are the dimensions? What is the transformation?
Global metrics ● ● ● MIPS, MFLOPS, Gbit/s, etc. Comparable, interpretable Most users have no idea about the computing power they really require These are usually nominal and not actual values Too general characterisation – fine details are hidden
Benchmark metrics ● ● Benchmarks are for comparing computer systems A well selected benchmark set ● ● sensitive to different factors: CPU intensive, communication intensive, I/O intensive jobs able to show fine details: cache behaviour, floating point capabilities, etc. able to show behaviour at different levels: instruction, loop, procedure, application These figures can be obtained actively: require time, resources
Benchmark metrics ● Given a local database with local and benchmark performance records ● get the local performance figures ● ● ● look up the database for benchmark performance there may not be record for actual local performance ● ● ● symbolic (fuzzy) interpolation the actual benchmark figures can be estimated ● ● low cost OS functionality actual execution of benchmarks is costly if not impossible Estimated benchmark figures give a characterisation of the system in a comparable and interpretable way Sounds reasonable… but not enough
Benchmark metrics ● ● Benchmarks may show actual execution performance but it is not enough… Real-life experiments: execution time may show no correlation to actual load ● ● ● start every job and suffer resource starvation wait until resources are available and start specific jobs Resource management policy must be taken into consideration
Job startup times ● ● ● corona. iif. hu, SUN Ultra Enterprise 10000, 64 CPU Sun Grid Engine Time between submission and actual start ● ● ● ● 1 processor job: within 1 minute 2 processor job: mostly within 1 minute 4 processor job: 2 -3 hours 8 processor job: 1 -2 days 9 processor job: 1 -2 days 16 processor job: 2 -3 days 25 processor job: > 4 -5 days See online: ● http: //www. lpds. sztaki. hu/~zsnemeth/apart/statistics. shtml
Resource performance characterisation ● Execution phase: resource performance can be characterized in the space of benchmark metrics ● ● ● analyse relationship between local metrics a benchmark results find the principal components Waiting phase: a stochastic model ● find the parameters of the distribution
Resource performance characterisation ● ● ● These parameters ( i, {t 1, t 2, …tn} ) can be distributed in an information system Interpretable: the stochastic model and the benchmark set give an appropriate framework Comparable: figures have the same meaning within this framework
Ongoing work ● Exploring the statistical properties of benchmarks and system parameters ● Intensive benchmark experiments Getting the most out of figures Principal component analysis: which figures are really meaningful Testing the stability of statistic data ● http: //www. lpds. sztaki. hu/~zsnemeth/apart/statistics. shtml ● ● Exploring the way how benchmark results can be estimated from past measurements ● ● Database management Symbolic interpolation
Conclusion ● A semantic definition for grids ● ● ● Grid performance has a more complex meaning Resource abstraction requires abstraction in the performance characterisation, too ● ● the presence of user and resource abstraction separation of local (physical) an global (virtual) metrics benchmarking is not viable but benchmarks can serve as metrics Experiments with resource characterisation
- Slides: 56