eScience and Grid The VLe approach L O

  • Slides: 52
Download presentation
e-Science and Grid The VL-e approach L. O. (Bob) Hertzberger Computer Architecture and Parallel

e-Science and Grid The VL-e approach L. O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam bob@science. uva. nl

Content • Developments in Grid • Developments in e-Science ü Objectives • Virtual Lab

Content • Developments in Grid • Developments in e-Science ü Objectives • Virtual Lab for e-Science ü Research philosophy • Conclusions

Background ICTpush developments • • Processing power doubles every 18 month Memory size doubles

Background ICTpush developments • • Processing power doubles every 18 month Memory size doubles every 12 month Network speed doubles every 9 month Something has to be done to harness this development ü Virtualization of ICT resources § Internet § Web § Grid

Mbit/s Internal versus external bandwidth Computer busses networks

Mbit/s Internal versus external bandwidth Computer busses networks

Web& Grid & Web/Grid Services • Web services is a paradigm/way of using/accessing information

Web& Grid & Web/Grid Services • Web services is a paradigm/way of using/accessing information ü Web resources are mostly human-centered (understood by humans, computers can read but can’t understand) • Grid is about accessing & sharing computing resources by virtualization ü Data & Information repositories ü Experimental facilities • OGSA: Service oriented Grid standard based on Web services

Web & Grid & Semantic Web/Grid • Semantic Web resources should also be understandable

Web & Grid & Semantic Web/Grid • Semantic Web resources should also be understandable by computers ü This way, many complex tasks formulated and assigned by humans can be automated and executed by agents working on the Semantic Web • But, both Grid and Web Services only focus on single services. • Semantic Web/Grid should be able to describe a single service as well as the relationship between services e. g. aggregate service using a number of services so making knowledge explicit

Levels of Grid abstraction Knowledge Web/Grid Information Web/Grid Data Grid Computational Grid

Levels of Grid abstraction Knowledge Web/Grid Information Web/Grid Data Grid Computational Grid

Background information experimental sciences • There is a tendency to look ever deeper in:

Background information experimental sciences • There is a tendency to look ever deeper in: ü Matter e. g. Physics ü Universe e. g. Astronomy ü Life e. g. Life sciences • Therefore experiments become increasingly more complex • Instrumental consequences are increase in detector: ü Resolution & sensitivity ü Automation & robotization • Results for instance in life science in: § So called high throughput methods § Omics experimentation Ø genome ===> genomics

New technologies in Life Sciences research cell Methodology/ Technology DNA Genomics RNA Transcriptomics protein

New technologies in Life Sciences research cell Methodology/ Technology DNA Genomics RNA Transcriptomics protein metabolites Proteomics Metabolomics University of Amsterdam

Paradigm shift in Life science • Past experiments where hypothesis driven üEvaluate hypothesis üComplement

Paradigm shift in Life science • Past experiments where hypothesis driven üEvaluate hypothesis üComplement existing knowledge • Present experiments are data driven üDiscover knowledge from large amounts of data § Apply statistical techniques

Background information experimental sciences • Experiments become increasingly more complex ü Driven by detector

Background information experimental sciences • Experiments become increasingly more complex ü Driven by detector developments § Resolution increases § Automation & robotization increases • Results in an increase in amount and complexity of data

The Application data crisis • Scientific experiments start to generate lots of data ü

The Application data crisis • Scientific experiments start to generate lots of data ü ü ü medical imaging (f. MRI): Bio-informatics queries: Satellite world imagery: Current particle physics: LHC physics (2007): ~ 1 GByte per measurement (day) 500 GByte per database ~ 5 TByte/year 1 PByte per year 10 -30 PByte per year • Data is often very distributed

Background information experimental sciences • Experiments become increasingly more complex ü Driven by detector

Background information experimental sciences • Experiments become increasingly more complex ü Driven by detector developments § Resolution increases § Automation & robotization increases • Results in an increase in amount and complexity of data • Something has to be done to harness this development ü Virtualization of experimental resources: e-Science

The what of e-Science • e-Science is the application domain “Science” of Grid &

The what of e-Science • e-Science is the application domain “Science” of Grid & Web ü More than only coping with data explosion ü A multi-disciplinary activity combining human expertise & knowledge between: § A particular domain scientist § ICT scientist • e-Science demands a different approach to experimentation because computer is integrated part of experiment § Consequence is a radical change in design for experimentation • e-Science should apply and integrate Web/Grid methods where and whenever possible

Grid and Web Services Convergence Grid Started far apart in apps & tech Web

Grid and Web Services Convergence Grid Started far apart in apps & tech Web GT 1 GT 2 OGS I Have been converging HTTP L, WSD WS-* WSRF L 2, D S W M WSD Definition of Web Service Resource Framework(WSRF) makes explicit distinction between “service” and stateful entities acting upon service i. e. the resources Means that Grid and Web communities can move forward on a common base!!! Ref: Foster

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by sharing data & information ü Result is re-use of data & information

The data sharing potential for Cognition • Collaborative scientific research ü Information sharing ü

The data sharing potential for Cognition • Collaborative scientific research ü Information sharing ü Metadata modeling • Allows for experiment validation ü Independent confirmation of results • Statistical methodologies ü Access to large collections of data and metadata • Training ü Train the next generation using peer reviewed publications and the associated data

Electron tomography data pipeline Acquisition Interpretation Alignment Reconstruction Segmentation

Electron tomography data pipeline Acquisition Interpretation Alignment Reconstruction Segmentation

Cell Centred Database NCIMR (San Diego) l Maryann Martone and Mark Ellisman l

Cell Centred Database NCIMR (San Diego) l Maryann Martone and Mark Ellisman l

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by sharing data & information ü Improve re-use of data & information • Combing data and information from different modalities ü Sensor data & information fusion

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by sharing data & information ü Improve re-use of data & information ü Combing data and information from different modalities § • Sensor data & information fusion Realize the combination of real life & (model based) simulation experiments

Simulated Vascular Reconstruction in a Virtual Operating Theatre • An example of the combination

Simulated Vascular Reconstruction in a Virtual Operating Theatre • An example of the combination of real life & (model based) simulation experiments • • • patient specific vascular geometry (from CT) • Segmentation • blood flow simulation (Latice Bolzmann) Pre-operative planning (interaction) Suitable for parallelization through functional decomposition Patient’s vascular geometry (CTA) Simulated “Fem-Fem” bypass Grid resources

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by

e-Science Objectives • • It should enhance the scientific process by: Stimulating collaboration by sharing data & information ü Improve re-use of data & information ü Combing data and information from different modalities § • • Sensor data & information fusion Realize the combination of real life & (model based) simulation experiments Modeling of dynamic systems

Bird behaviour in relation to weather and landscape RADAR Calibration and Data assimilation Dynamic

Bird behaviour in relation to weather and landscape RADAR Calibration and Data assimilation Dynamic bird behaviour MODELS Bird distributions Ensembles Predictions and on-line warnings

e-Science Objectives • It should result in : • Computer aided support for rapid

e-Science Objectives • It should result in : • Computer aided support for rapid prototyping of ideas ü Stimulate the creativity process • It should realize that by creating & applying: ü New ICT methodologies and a computing infrastructure stimulating this • From this ICT point of view it should support the following application steps: ü ü Design Development & realization Execution Analysis & interpretation • We try to realize e-Science and their applications via the Virtual Lab for e-Science (VL-e) project

Virtual Lab for e-Science research Philosophy • Multidisciplinary research & the development of related

Virtual Lab for e-Science research Philosophy • Multidisciplinary research & the development of related ICT infrastructure • Generic application support ü Application cases are drivers for computer & computational science and engineering research

VL-e project Medical Diagnosis & Imaging Bio. Diversity Bio. Informatics Data intensive Insive sciences

VL-e project Medical Diagnosis & Imaging Bio. Diversity Bio. Informatics Data intensive Insive sciences Science/ LOFAR Food Informatics VL-e Management Application Oriented of comm. & Services computing Grid Services Harness multi-domain distributed resources Dutch Telescience

Two sides of Bioinformatics as an e-Science • The scientific responsibility to develop the

Two sides of Bioinformatics as an e-Science • The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledge • Technological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentation

Role of bioinformatics Genomics RNA Transcriptomics protein metabolites Proteomics Metabolomics Integrative/System Biology Data usage/user

Role of bioinformatics Genomics RNA Transcriptomics protein metabolites Proteomics Metabolomics Integrative/System Biology Data usage/user interfacing DNA Data integration/fusion methodology bioinformatics Data generation/validation cell

Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT

Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT infrastructure • Generic application support ü Application cases are drivers for computer & computational science and engineering research ü Problem solving partly generic and partly specific ü Re-use of components via generic solutions whenever possible

Potential Generic part Management of comm. & computing Application Specific Part Potential Generic part

Potential Generic part Management of comm. & computing Application Specific Part Potential Generic part Virtual Laboratory Management of comm. & Services Application Oriented computing Application Specific Part Potential Generic part Management of comm. & computing Grid/ Web Services Harness multi-domain distributed resources Application pull Application Specific Part

Generic e-Science aspects • • Virtual Reality Visualization & user interfaces Modeling & Simulation

Generic e-Science aspects • • Virtual Reality Visualization & user interfaces Modeling & Simulation ü Interactive Problem Solving • Data & information management ü Data modeling ü dynamic work flow management • Content (knowledge) management ü Semantic aspects ü Meta data modeling § Ontologies • • Wrapper technology Design for Experimentation

Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT

Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT infrastructure • Generic application support ü Application cases are drivers for computer & computational science and engineering research ü Problem solving partly generic and partly specific ü Re-use of components via generic solutions whenever possible • Rationalization of experimental process ü Reproducible & comparable

Issues for a reproducible scientific experiment Parameter settings, Calibrations, Protocols … acquisition experiment sensors,

Issues for a reproducible scientific experiment Parameter settings, Calibrations, Protocols … acquisition experiment sensors, amplifiers imaging devices, , … parameters/settings, algorithms, intermediate results, … raw data processing conversion, filtering, analyses, simulation, … software packages, algorithms … processed data presentation visualization, animation interactive exploration, … interpretation Rationalization of the experiment and processes via protocols Metadata Much of this is lost when an experiment is completed.

Scientific experiments & e-Science Step 1: designing an experiment Step 2: performing the experiment

Scientific experiments & e-Science Step 1: designing an experiment Step 2: performing the experiment Step 3: analyzing the experiment results success For complex experiments: q contain complex processes q require interdisciplinary expertise q need large scale resource Grid & high level support

Components in a VL-e experiment Process-Flow Templates(PFT) – Derived from ontologies – Graphical representation

Components in a VL-e experiment Process-Flow Templates(PFT) – Derived from ontologies – Graphical representation of data elements and processing steps in an experimental procedure – Information to support context-sensitive assistance (semantics) Study – Descriptions of experimental steps represented as an instance of a PFT with references to experiment topologies Experiment Topology – Graphical representation of self-contained data processing modules attached to each other in a workflow

e-Science environment Step 1: designing an experiment Step 2: performing the experiment Step 3:

e-Science environment Step 1: designing an experiment Step 2: performing the experiment Step 3: analyzing the experiment results success Scientific Workflow Management Systems Taverna, Kepler, and Triana: model processes for computing tasks. Process Flow Template PFT in VLAMG Process Flow Template PFT instances in VLAMG In VL-e: model both computing tasks and human activity based processes, and model them from the perspective of an entire lifecycle. It tries to support • Collaboration in different stages • Information sharing • Reuse of experiment

Definition of experiment protocols Workflow definitions Recreate complex experiments into process flows Workflow execution

Definition of experiment protocols Workflow definitions Recreate complex experiments into process flows Workflow execution Maintain control over the experiment Data processing execution Topologies of data processing modules Interpretation Visualization of processed results to help intuition Ontology definitions can help in obtaining a well-structured definition of experiment, data and metadata.

Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT

Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT infrastructure • Generic application support ü Application cases are drivers for computer & computational science and engineering research ü Problem solving partly generic and partly specific ü Re-use of components via generic solutions whenever possible • Rationalization of experimental process ü Reproducible & comparable • Two research experimentation environments ü Proof of concept for application experimentation ü Rapid prototyping for computer & computational science experimentation

The VL-e infrastructure Application specific service Application Potential Generic service & Virtual Lab. services

The VL-e infrastructure Application specific service Application Potential Generic service & Virtual Lab. services Grid & Network Services Telescience Medical Application Bio Informatics Applications Virtual Laboratory Grid Middleware Surfnet VL-e Proof of Concept Environment Test & Cert. VL-software Virtual Lab. rapid prototyping (interactive simulation) Test & Cert. Grid Middleware Additional Grid Services (OGSA services) Test & Cert. Compatibility Network Service (lambda networking) VL-e Certification Environment VL-e Experimental Environment

Infrastructure for Applications • Applications are a driving force of the Po. C •

Infrastructure for Applications • Applications are a driving force of the Po. C • Experience shows applications value stability • Foster two-way interaction to make this happen

e-Science environment Step 1: designing an experiment Step 2: performing the experiment Step 3:

e-Science environment Step 1: designing an experiment Step 2: performing the experiment Step 3: analyzing the experiment results Research activity Stable developments to be developed in the to be used in VL-e in the Proof of Rapid Prototyping Concept environment success Research activity to be developed in the Rapid Prototyping environment Scientific Workflow Management Systems PFT Taverna, Kepler, Triana and VLAM. SWMS PFT

VL-e Po. C environment • • • Latest certified stable software environment of core

VL-e Po. C environment • • • Latest certified stable software environment of core grid and VL-e services Core infrastructure built around clusters and storage at SARA and NIKHEF (‘production’ quality) Controlled extension to other platforms and distributions On the user end: install needed servers: user interface systems, storage elements for data disclosure, grid-secured DB access Focus on stability and scalability

Hosted services for VL-e • Key services and resources are offered centrally for all

Hosted services for VL-e • Key services and resources are offered centrally for all applications in VL-e • Mass data and number crunching on the large resources at SARA • Storage for data replication & distribution • Persistent ‘strategic’ storage on tape • Resource brokers, resource discovery, user group management

Why such a complex scheme? • “software is part of the infrastructure” • stability

Why such a complex scheme? • “software is part of the infrastructure” • stability of core software needed to develop the new scientific applications • enable distributed systems management (who runs what version when? ) “the grid is one big error amplifier” “computers make mistakes like humans, only much, much faster”

What did we learn • It is not enough to just transport current applications

What did we learn • It is not enough to just transport current applications to Grid or e-Science infrastructures • To fully exploit the potential of e-Science infrastructures one has to learn what is possible • Therefore the full lifecycle of an experiment has to be taken into account ü Workflow management is a first step ü We add semantic information via Process Flow Templates • Application innovation such as for instance biobanking should be the aim • Grids should be transparent for the end-user

Conclusions • e-Science is a lot more than trying to cope with data explosion

Conclusions • e-Science is a lot more than trying to cope with data explosion alone • Implementation of e-Science systems requires further rationalization and standardization of experimentation process • e-Science success demands the realization of an environment allowing ü application driven experimentation & ü rapid dissemination of feed back of these new methods • We try to do that via development of Proof of Concept based on Grid

Electron tomography data pipeline development SARA - Amsterdam 1 Gbit/s TOM (Matlab) acquisition and

Electron tomography data pipeline development SARA - Amsterdam 1 Gbit/s TOM (Matlab) acquisition and data storage with the SRB Grid-computing 3 D reconstruction refinements Currently for EMAN and later also for TOM (Matlab) SRB Storage With the CCDB database for retrieval of experimental data sets