Life Sciences Earth Sciences Social Sciences eScience and

  • Slides: 55
Download presentation
Life Sciences Earth Sciences Social Sciences e-Science and Cyberinfrastructure Tony Hey Corporate Vice President

Life Sciences Earth Sciences Social Sciences e-Science and Cyberinfrastructure Tony Hey Corporate Vice President Technical Computing Microsoft Corporation Multidisciplinary Research Computer and Information Sciences New Materials, Technologies and Processes

Licklider’s Vision “Lick had this concept – all of the stuff linked together throughout

Licklider’s Vision “Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job” Larry Roberts – Principal Architect of the ARPANET

Physics and the Web u u u Ø Ø Tim Berners-Lee developed the Web

Physics and the Web u u u Ø Ø Tim Berners-Lee developed the Web at CERN as a tool for exchanging information between the partners in physics collaborations The first Web Site in the USA was a link to the SLAC library catalogue It was the international particle physics community who first embraced the Web ‘Killer’ application for the Internet Transformed modern world – academia, business and leisure

Beyond the Web? u Scientists developing collaboration technologies that go far beyond the capabilities

Beyond the Web? u Scientists developing collaboration technologies that go far beyond the capabilities of the Web Ø Ø Ø u To use remote computing resources To integrate, federate and analyse information from many disparate, distributed, data resources To access and control remote experimental equipment Capability to access, move, manipulate and mine data is the central requirement of these new collaborative science applications Ø Ø Ø Data held in file or database repositories Data generated by accelerator or telescopes Data gathered from mobile sensor networks

What is e-Science? ‘e-Science is about global collaboration in key areas of science, and

What is e-Science? ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it’ John Taylor Director General of Research Councils UK, Office of Science and Technology

The e-Science Vision u e-Science is about multidisciplinary science and the technologies to support

The e-Science Vision u e-Science is about multidisciplinary science and the technologies to support such distributed, collaborative scientific research Ø Ø Ø Many areas of science are in danger of being overwhelmed by a ‘data deluge’ from new highthroughput devices, sensor networks, satellite surveys … Areas such as bioinformatics, genomics, drug design, engineering, healthcare … require collaboration between different domain experts ‘e-Science’ is a shorthand for a set of technologies to support collaborative networked science

e-Science – Vision and Reality Vision u Oceanographic sensors - Project Neptune Ø Joint

e-Science – Vision and Reality Vision u Oceanographic sensors - Project Neptune Ø Joint US-Canadian proposal Reality u Chemistry – The Comb-e-Chem Project Ø Annotation, Remote Facilities and e-Publishing

http: //www. neptune. washington. edu/

http: //www. neptune. washington. edu/

Undersea Sensor Network Connected & Controllable Over the Internet

Undersea Sensor Network Connected & Controllable Over the Internet

Data Provenance

Data Provenance

Persistent Distributed Storage Visual Programming

Persistent Distributed Storage Visual Programming

Distributed Computation Interoperability & Legacy Support via Web Services

Distributed Computation Interoperability & Legacy Support via Web Services

Searching & Visualization Live Documents Reputation & Influence

Searching & Visualization Live Documents Reputation & Influence

Reproducible Research

Reproducible Research

Collaboration

Collaboration

Handwriting

Handwriting

Interactive Data Dynamic Documents

Interactive Data Dynamic Documents

The Comb-e-Chem Project Diffractometer Video Data Stream Automatic Annotation HPC Simulation Data Mining and

The Comb-e-Chem Project Diffractometer Video Data Stream Automatic Annotation HPC Simulation Data Mining and Analysis Structures Database Combinatorial Chemistry Wet Lab National X-Ray Service Middleware

National Crystallographic Service Send sample material to NCS service Collaborate in e-Lab experiment and

National Crystallographic Service Send sample material to NCS service Collaborate in e-Lab experiment and obtain structure Search materials database and predict properties using Grid computations Download full data on materials of interest

A digital lab book replacement that chemists were able to use, and liked

A digital lab book replacement that chemists were able to use, and liked

Monitoring laboratory experiments using a broker delivered over GPRS on a PDA

Monitoring laboratory experiments using a broker delivered over GPRS on a PDA

Crystallographic e-Prints Direct Access to Raw Data from scientific papers Raw data sets can

Crystallographic e-Prints Direct Access to Raw Data from scientific papers Raw data sets can be very large - stored at UK National Datastore using SRB software

e. Bank Project Virtual Learning Environment Undergraduate Students Digital Library E-Scientists Reprints Peer. Reviewed

e. Bank Project Virtual Learning Environment Undergraduate Students Digital Library E-Scientists Reprints Peer. Reviewed Journal & Conference Papers Grid Technical Reports Preprints & Metadata E-Experimentation Publisher Holdings Graduate Students Institutional Archive Local Web Certified Experimental Results & Analyses Data, Metadata & Ontologies 5 Entire E-Science Cycle Encompassing experimentation, analysis, publication, research, learning

Support for e-Science u Cyberinfrastructure and e-Infrastructure Ø Ø u In the US, Europe

Support for e-Science u Cyberinfrastructure and e-Infrastructure Ø Ø u In the US, Europe and Asia there is a common vision for the ‘cyberinfrastructure’ required to support the e-Science revolution Set of Middleware Services supported on top of high bandwidth academic research networks Similar to vision of the Grid as a set of services that allows scientists – and industry – to routinely set up ‘Virtual Organizations’ for their research – or business Ø Ø Many companies emphasize computing cycle aspect of Grids The ‘Microsoft Grid’ vision is more about data management than about compute clusters

Six Key Elements for a Global Cyberinfrastructure for e-Science 1. 2. 3. 4. 5.

Six Key Elements for a Global Cyberinfrastructure for e-Science 1. 2. 3. 4. 5. 6. High bandwidth Research Networks Internationally agreed AAA Infrastructure Development Centers for Open Standard Grid Middleware Technologies and standards for Data Provenance, Curation and Preservation Open access to Data and Publications via Interoperable Repositories Discovery Services and Collaborative Tools

The Future: Hybrid Networks u u Standard packet routed production network for email, Web

The Future: Hybrid Networks u u Standard packet routed production network for email, Web access, … User-controlled ‘lambda’ connections for e-Science applications requiring high performance end-to-end Quality of Service Similar strategies in North America, Europe and Asia The GLIF consortium

Towards an International Grid Infrastructure Starlight (Chicago) SDSC UK NGS Leeds Manchester Netherlight (Amsterdam)

Towards an International Grid Infrastructure Starlight (Chicago) SDSC UK NGS Leeds Manchester Netherlight (Amsterdam) Oxford RAL NCSA US Tera. Grid PSC UCL UKLight SC 05 All sites connected by production network (not all shown) Computation Steering clients Network Po. P Service Registry Local laptops in Seattle and UK

Research Prototypes to Production Quality Middleware? u u Ø Ø Research projects not funded

Research Prototypes to Production Quality Middleware? u u Ø Ø Research projects not funded to do testing, configuration and QA required to produce production quality middleware Common rule of thumb is that 10 times more effort is needed to re-engineer ‘proof of concept’ research software to production quality (Brooks – ‘Mythical Man Month’) Key issue for realizing a global e-Science Cyberinfrastructure is existence of robust, open standard, interoperable middleware OMII-UK, OMII-China, NMI, NAREGI, …

The Web Services ‘Magic Bullet’ Company A (J 2 EE) Web Services Company C

The Web Services ‘Magic Bullet’ Company A (J 2 EE) Web Services Company C (. Net) Company B (ONE)

Open Grid Services u u Development of Web Services Require small set of useful

Open Grid Services u u Development of Web Services Require small set of useful ‘Grid Services’ supported by IT industry Execution Management, Data Access, Resource Management, Security, Information Management Research community experimenting with higher level services: Workflow, Transactions, Data Mining, Knowledge Discovery … Exploit Synergy: Use Commercial Development Environment and Tooling to build robust Grid Services Ø Ø Ø

Digital Curation? u In 20 years can guarantee that the operating system and spreadsheet

Digital Curation? u In 20 years can guarantee that the operating system and spreadsheet program and the hardware used to store data will not exist Ø Ø Need research ‘curation’ technologies such as workflow, provenance and preservation Need to liaise closely with individual research communities, data archives and libraries The UK has set up the ‘Digital Curation Centre’ in Edinburgh with Glasgow, UKOLN and CCLRC Attempt to bring together skills of scientists, computer scientists and librarians

Digital Curation Centre u Actions needed to maintain and utilise digital data and research

Digital Curation Centre u Actions needed to maintain and utilise digital data and research results over entire life-cycle Ø u Digital Preservation Ø u Long-run technological/legal accessibility and usability Data curation in science Ø u For current and future generations of users Maintenance of body of trusted data to represent current state of knowledge Research in tools and technologies Ø Integration, annotation, provenance, metadata, security…. .

Computational Modeling Persistent Distributed Data Workflow, Data Mining & Algorithms Interpretation & Insight Real-world

Computational Modeling Persistent Distributed Data Workflow, Data Mining & Algorithms Interpretation & Insight Real-world Data

Technical Computing in Microsoft u Radical Computing Ø u Advanced Computing for Science and

Technical Computing in Microsoft u Radical Computing Ø u Advanced Computing for Science and Engineering Ø u Research in potential breakthrough technologies Application of new algorithms, tools and technologies to scientific and engineering problems High Performance Computing Ø Application of high performance clusters and database technologies to industrial applications

New Science Paradigms u u Thousand years ago: Experimental Science - description of natural

New Science Paradigms u u Thousand years ago: Experimental Science - description of natural phenomena Last few hundred years: Theoretical Science - Newton’s Laws, Maxwell’s Equations … u Last few decades: Computational Science - simulation of complex phenomena u Today: e-Science or Data-centric Science - unify theory, experiment, and simulation - using data exploration and data mining Ø Ø Data captured by instruments Data generated by simulations Processed by software Scientist analyzes databases/files (With thanks to Jim Gray)

Advanced Computing for Science and Engineering . . . TOOLS Workflow, Collaboration, Visualization, Data

Advanced Computing for Science and Engineering . . . TOOLS Workflow, Collaboration, Visualization, Data Mining DATA Acquisition, Storage, Annotation, Provenance, Curation, Preservation CONTENT Scholarly Communication, Institutional Repositories

Key Issues for e-Science u Workflows and Tools Ø Ø u The Data Chain

Key Issues for e-Science u Workflows and Tools Ø Ø u The Data Chain Ø u The LEAD Project The Discovery. Net Project From Acquisition to Preservation Scholarly Communication Ø Open Access to Data and Publications

The LEAD Project Better predictions for Mesoscale weather

The LEAD Project Better predictions for Mesoscale weather

The LEAD Vision DYNAMIC OBSERVATIONS Analysis/Assimilation Prediction/Detection Quality Control Retrieval of Unobserved Quantities Creation

The LEAD Vision DYNAMIC OBSERVATIONS Analysis/Assimilation Prediction/Detection Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields PCs to Teraflop Systems Product Generation, Display, Dissemination Models and Algorithms Driving Sensors The CS challenge: Build a virtual “e. Science” laboratory to support experimentation and education leading to this vision. End Users NWS Private Companies Students

Composing LEAD Services Need to construct workflows that are: Ø Data Driven Ø Ø

Composing LEAD Services Need to construct workflows that are: Ø Data Driven Ø Ø Persistent and Agile Ø Ø The weather input stream defines the nature of the computation An agent mines a data stream and notices an “interesting” feature. This event may trigger a workflow scenario that has been waiting for months Adaptive Ø Ø Ø The weather changes Workflow may have to change on-the-fly Resources

Example LEAD Workflow

Example LEAD Workflow

and Discovery. Net u u u Constructing a ubiquitous workflow : by scientists Ø

and Discovery. Net u u u Constructing a ubiquitous workflow : by scientists Ø Integrate information resources and applications cross-domain Ø Capture the best practice of your scientific research Warehousing workflows: for scientists Ø Manage discovery processes within an organisation Ø Construct an enterprise process knowledge bank Deployment workflow: to scientists Ø Turn workflows into reusable applications/services Ø Turn every scientist into a solution builder

An Integrative Analysis Example Relational data mining Decision tree model of metabonomic profile Visualizing

An Integrative Analysis Example Relational data mining Decision tree model of metabonomic profile Visualizing serial/spectrum data Text mining Visualizing cluster statistics Spectrum data mining Visualizing Chemical Visualizing sequence data multidimensional Visualizing structure relational data pathway data visualization data clusters Chemical data Text. Chemical mining sequence modeldata visualization model

The Data Deluge u u In the next 5 years e-Science projects will produce

The Data Deluge u u In the next 5 years e-Science projects will produce more scientific data than has been collected in the whole of human history Some normalizations: Ø Ø Ø The Bible = 5 Megabytes Annual refereed papers = 1 Terabyte Library of Congress = 20 Terabytes Internet Archive (1996 – 2002) = 100 Terabytes In many fields new high throughput devices, sensors and surveys will be producing Petabytes of scientific data

The Problem for the e-Scientist Experiments & Instruments fac Other Archives facts Literature ts

The Problem for the e-Scientist Experiments & Instruments fac Other Archives facts Literature ts facts ts Simulations u u u fac ? u Data ingest Managing a petabyte u Common schema u How to organize it? How to reorganize it? How to coexist & cooperate with others? questions answers Data Query and Visualization tools Support/training Performance Ø Ø Execute queries in a minute Batch (big) query scheduling

The e-Science Data Chain u u u u Data Acquisition Data Ingest Metadata Annotation

The e-Science Data Chain u u u u Data Acquisition Data Ingest Metadata Annotation Provenance Data Storage Curation Preservation

Berlin Declaration 2003 u u ‘To promote the Internet as a functional instrument for

Berlin Declaration 2003 u u ‘To promote the Internet as a functional instrument for a global scientific knowledge base and for human reflection’ Defines open access contributions as including: Ø ‘original scientific research results, raw data and metadata, source materials, digital representations of pictorial and graphical materials and scholarly multimedia material’

NSF ‘Atkins’ Report on Cyberinfrastructure u u ‘the primary access to the latest findings

NSF ‘Atkins’ Report on Cyberinfrastructure u u ‘the primary access to the latest findings in a growing number of fields is through the Web, then through classic preprints and conferences, and lastly through refereed archival papers’ ‘archives containing hundreds or thousands of terabytes of data will be affordable and necessary for archiving scientific and engineering information’

Publishing Data & Analysis Is Changing Roles Traditional Emerging Authors Scientists Collaborations Publishers Journals

Publishing Data & Analysis Is Changing Roles Traditional Emerging Authors Scientists Collaborations Publishers Journals Project web site Curators Libraries Data+Doc Archives Digital Archives Consumers Scientists

Data Publishing: The Background In some areas – notably biology – databases are replacing

Data Publishing: The Background In some areas – notably biology – databases are replacing (paper) publications as a medium of communication Ø Ø Ø These databases are built and maintained with a great deal of human effort They often do not contain source experimental data sometimes just annotation/metadata They borrow extensively from, and refer to, other databases You are now judged by your databases as well as your (paper) publications Upwards of 1000 (public databases) in genetics

Data Publishing: The issues u Data integration Ø u Annotation Ø Ø u ‘Where

Data Publishing: The issues u Data integration Ø u Annotation Ø Ø u ‘Where did this data come from? ’ Exporting/publishing in agreed formats Ø u Adding comments/observations to existing data Becoming a new form of communication Provenance Ø u Tying together data from various sources To other programs as well as people Security Ø Specifying/enforcing read/write access to parts of your data

Interoperable Repositories? u Paul Ginsparg’s ar. Xiv at Cornell has demonstrated new model of

Interoperable Repositories? u Paul Ginsparg’s ar. Xiv at Cornell has demonstrated new model of scientific publishing Ø u David Lipman of the NIH National Library of Medicine has developed Pub. Med. Central as repository for NIH funded research papers Ø u Electronic version of ‘preprints’ hosted on the Web Microsoft funded development of ‘portable PMC’ now being deployed in UK and other countries Stevan Harnad’s ‘self-archiving’ EPrints project in Southampton provides a basis for OAI-compliant ‘Institutional Repositories’ Ø Many national initiatives around the world moving towards mandating deposition of ‘full text’ of publicly funded research papers in repositories

Microsoft Strategy for e-Science Microsoft intends to work with the scientific and library communities:

Microsoft Strategy for e-Science Microsoft intends to work with the scientific and library communities: to define open standard and/or interoperable high-level services, work flows and tools Ø to assist the community in developing open scholarly communication and interoperable repositories Ø

© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only.

© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Acknowledgements With special thanks to Kelvin Droegemeier, Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim

Acknowledgements With special thanks to Kelvin Droegemeier, Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Yike Guo, Liz Lyon and Beth Plale