VGr ADS Runtime System Architecture and Research Andrew

  • Slides: 21
Download presentation
VGr. ADS Runtime System Architecture and Research Andrew Chien, Henri Casanova, Rich Wolski, Carl

VGr. ADS Runtime System Architecture and Research Andrew Chien, Henri Casanova, Rich Wolski, Carl Kesselman, Fran Berman, Dan Reed, Jack Dongarra Feb 2004 Kickoff Meeting Rice University

Runtime Challenges • Simplify Resource Abstraction for PPS – Enable Simpler and Better Optimization

Runtime Challenges • Simplify Resource Abstraction for PPS – Enable Simpler and Better Optimization • Scale to Larger and More Complex Resource Environment – Grids resource pools are large and growing • Provide Better Information – Useful and Current Information for Dynamic Decisions 9/18/2021 2

What is a Virtual Grid? “PPS/Application” “Virtual Grids” “The Grid” • A Resource Management

What is a Virtual Grid? “PPS/Application” “Virtual Grids” “The Grid” • A Resource Management Oriented Abstraction – Provides Simple Resource Performance Model for the PPS/Application – Structures Information Collection by the Execution System • Accelerate And Improve Decision Making About Resources • Reduced Scope Of Resource Monitoring And Scheduling • Improved Scalability And Quality – Enables Proactive And Reactive Resource Monitoring, Acquisition To Improve Properties Of Performance, Reliability, Stability, Security, Etc. 9/18/2021 3

How are Virtual Grid Abstractions Defined? • Top-down (Application => Virtualization) • Bottom-up (Resource

How are Virtual Grid Abstractions Defined? • Top-down (Application => Virtualization) • Bottom-up (Resource Properties – Individual & Aggregate => Virtualization) App – Better Aggregate Resource Capabilities – Rapid Selection and Binding) /s 10 Gb/ s 5 m 1 s Gb SDSC 10 Gb/ s 5 m s 1 Gb Calte /s ch /s 10 Gb/ s 5 m 1 s Gb SDSC 9/18/2021 10 Gb/ s 5 m s 1 Gb Calte /s ch • Attributes: Resource Properties, Communication Structure, Aggregates Over These Such as Reliability, Quality of Service (Into Picture) Resources Tera Grid 30 30 30 Back Gb/s plane 20 ms 10 10 10 Gb/ Gb/ Gb/ s s s s s 5 m 5 m 5 m 1 1 1 1 s 1 s s Gb Gb s Gb Gb Gb NCSA PSC ANL /s /s /s Figure 5. Teragrid topology 4

Application-Driven Example • Dataset and Computations on Data Elements • EOL: Genomic and Proteomic

Application-Driven Example • Dataset and Computations on Data Elements • EOL: Genomic and Proteomic Databases – Annotation Pipeline Employs Myriad Applications and Heterogeneous Workers – Computations Operate on Parts of the Dataset and Compute New Elements of New Datasets 9/18/2021 5

Resource-Driven Example • Tera. Grid: Five Clusters, Fast Network Singleton Clusters SDSC 1 Gb/s

Resource-Driven Example • Tera. Grid: Five Clusters, Fast Network Singleton Clusters SDSC 1 Gb/s 10 Gb/s 5 ms Tera. Grid Backplane 30 Gb/s 20 ms Teragrid 9/18/2021 Data Server Clusters Caltech – Direct Access to Individual Clusters and Parts – Virtualized as Clusters (Multiple Choices) and as a Single Cluster – One/Part of a Cluster Dedicated as Data Servers Single Cluster 10 Gb/s 5 ms 1 Gb/s NCSA 10 Gb/s 5 ms 1 Gb/s PSC 10 Gb/s 5 ms 1 Gb/s ANL 6

Example: Uniform Parallel Grid • N Uniform Performance Nodes; Rich connectivity • Range of

Example: Uniform Parallel Grid • N Uniform Performance Nodes; Rich connectivity • Range of “close approximations” 9/18/2021 7

Heterogeneous Collection • Oracle Grid: Database and Cluster of Workers • Abstract Component Machine?

Heterogeneous Collection • Oracle Grid: Database and Cluster of Workers • Abstract Component Machine? 9/18/2021 8

And many more… • • • Cluster of clusters Big bag of processors Big

And many more… • • • Cluster of clusters Big bag of processors Big bag of disks Humongous bag of processors Humongous bag of disks … 9/18/2021 9

Virtual Grid Architecture Application/PPS/Libraries VGrid RSpec: Compute: Comm: Dynamic RM: (security, reliability, communication, performance,

Virtual Grid Architecture Application/PPS/Libraries VGrid RSpec: Compute: Comm: Dynamic RM: (security, reliability, communication, performance, location, Qo. S, etc) (inst set, special operations, libraries, Etc. ) (multicast, reduce, P 2 p, Lambda’s, etc. ) (add rsc, release, inquire, swap, overallocate, reserve, etc. ) Rsc. Uniform Info/ Cluster Perf Monitor Rsc Access Selection Heterogeneous Checkpointing Cluster & Fault Toler. Schedule/ Reschedule Proactive Rsc … Reserve/Bind Generic or Custom Grid Services Narrowed Scope Secure Clusters 9/18/2021 x 86 Clusters IA 64 Clusters Desktop Grid … Resource Classes: Classification, Composition, Virtualization 10

Virtual Grid Runtime Challenges: Implement the Abstractions • • Custom Abstractions: Intelligence for Each

Virtual Grid Runtime Challenges: Implement the Abstractions • • Custom Abstractions: Intelligence for Each Scheduling, Composition, Proactive techniques Resource Characterization and Classification Monitoring and Detecting Pailures and Performance Violations of VG Abstractions • Rapid Rescheduling in Response to Failures and Performance Violations • Scaling and Information and Decision Quality • Others? 9/18/2021 11

Resource Classes • Characterization and Organization of Resources • => Short and Long-term Monitoring

Resource Classes • Characterization and Organization of Resources • => Short and Long-term Monitoring and Analysis of Resources • Many Open Questions – What are the Meaningful and Useful Resource Classes? – How do We Both Support Large-scale of Resources, Yet Refined Classification? – Is this a Multi-classification? – Is this Centralized or Decentralized or Both? 9/18/2021 12

Virtual Grids and Gr. ADS Conceptual Gr. ADSoft Virtual Grid Gr. ADSoft vs. •

Virtual Grids and Gr. ADS Conceptual Gr. ADSoft Virtual Grid Gr. ADSoft vs. • Broad, General-Purpose Model vs. • Narrow / Specialized Model per Abstraction 9/18/2021 13

Virtual Grids and Gr. ADS Architecture • Small Set of Virtual Grid Abstractions =

Virtual Grids and Gr. ADS Architecture • Small Set of Virtual Grid Abstractions = Performance Models = PPS View – Decouples the Optimization Problems – BUT, coordination on adaptation still required • Customized Information Collection (per PPS view) • Customized Resource Management / Scheduling (per PPS view) • Customized monitoring (per PPS view) 9/18/2021 14

Initial Steps and Activities • Take Familiar and Important Application/Workloads and Explore Issues –

Initial Steps and Activities • Take Familiar and Important Application/Workloads and Explore Issues – What Type of Virtual Grid Might an Application Specify? How Might We Exploit These Attributes for Better Selection/Scheduling, Etc. – Initial Work On EOL and Speeding Critical Phases • Take Typical Resource Configurations And Elicit Structure – – What are Grid Resource Configuration? What are Their Characteristics (Static, Dynamic) Do These Classify Naturally Fall into Structured Classification? Can We Reduce the Scope Needed thru Virtual Grid Mechanism? • Explore Future Grid Information Systems – What Information Can be Provided with What Resolution and Accuracy, Scaling? – Can Virtual Grid Improve and Organize the Quality of Information for Adaptation? – Techniques To Make The Gathering And Distribution Of Different Types Of Information More Efficient And Scalable 9/18/2021 15

Runtime Deliverables • September 2004, end Year 1 Virtual Grid – Prototype Resource Virtualization

Runtime Deliverables • September 2004, end Year 1 Virtual Grid – Prototype Resource Virtualization and Abstraction Classes [V 1] – Virtual Scheduling requirements study [V 2] Performance Provisioning: – Initial time-space reasoning for contracts and signatures [PP 1] Grid Economy: – Develop rudimentary simulation of VGr. ADS resource allocation mechanisms. [GE 1] – Begin the exploration of Tatonnement, Smale's method, and Continuous-Price Double auctions using simulation. [GE 2] Fault Tolerance: – Experimental measurement of Grid & cluster reliability [FT 1] 9/18/2021 16

Runtime Deliverables (cont. ) • September 2005, end Year 2 Virtual Grid: – Prototype

Runtime Deliverables (cont. ) • September 2005, end Year 2 Virtual Grid: – Prototype Virtual Grid examples defined [V 3] – Prototype virtual scheduler [V 4] Performance Provisioning: – Extended time-space reasoning for contracts and signatures [PP 2] Grid Economy: – Determine initial pricing conditions and pricing methods that prevent multiple equilibria. [GE 3] – Verify stability results using simulation environment. [GE 4] Fault Tolerance: – Prototype fault tolerant library [FT 2] 9/18/2021 17

And beyond… • September 2006, end Year 3 Virtual Grid: – Novel resource selection

And beyond… • September 2006, end Year 3 Virtual Grid: – Novel resource selection and virtual scheduling strategy experiments with application kernels on virtual grid environments [V 5] Performance Provisioning: – Limited tunable performance/fault-tolerance capabilities [PP 3] Grid Economy: – Begin designing experiments to test pricing techniques using VGr. ADS framework. [GE 5] – Continue simulation experiments to evaluate resource allocation efficiency. [GE 6] Fault Tolerance: – Consider novel techniques [FT 3] • • September 2007, end Year 4 September 2008, end Year 5 9/18/2021 18

Research Questions I • How General and Precise a Description Language for Virtual Grid

Research Questions I • How General and Precise a Description Language for Virtual Grid Abstractions do We NEED? Or do Application/PPS WANT? • Vgrid is an SOA, All Can Be Used Electively – No Layering – Can We Meaningfully Support Use of the System and Modification at Multiple Levels of Abstraction? • What are the New Scheduling, Adaptation, Monitoring Capabilities and Opportunities of Virtual Grid? Can We Prove Properties Relative to Global Grid Views? • What are a Minimal and Expressive Set of Resource Management Services for Virtual Grid Abstractions? Pass Appropriate Information to Allow Lower Level Optimization; Higher Level Control 9/18/2021 19

Research Questions II • Where Do Ideas Of Transparent Fault-tolerance Fit? Embedded In For

Research Questions II • Where Do Ideas Of Transparent Fault-tolerance Fit? Embedded In For Example In Many Reliable Virtual Grid Abstractions? • Classification: What Are The Meaningful/Useful Resource Classes? – How Do We Both Support Large-scale Of Resources, Yet Refined Classification? – Is This A Multi-classification? Is This Centralized Or Decentralized Or Both? • How Do These Affect The Interfaces To The Program Preparation System? – What Interfaces Might be Preserved? – Move Towards A Limited Set Of Descriptions, Conversion, Basic Resource Selection – SOA Based on Java or WS-resource – Potentially A Separate Implementation Of Each VG Abstraction; Significant Sharing Expected • How To They Affect The Functionality Needed In The Program Preparation System? – Incremental Adaptation? Checkpointing? Fault Tolerance? 9/18/2021 20

9/18/2021 21

9/18/2021 21