DataIntensive Computing at NSF Jeannette M Wing Assistant

  • Slides: 18
Download presentation
Data-Intensive Computing at NSF Jeannette M. Wing Assistant Director Computer and Information Science and

Data-Intensive Computing at NSF Jeannette M. Wing Assistant Director Computer and Information Science and Engineering Directorate Thanks to the NSF team: Dan Atkins, Debbie Crawford, Haym Hirsh, Jim French, Stephen Meacham, … Corporate Alliance June 18, 2008 Data-Intensive Computing Jeannette M. Wing

Science Story Data-Intensive Computing 2 Jeannette M. Wing

Science Story Data-Intensive Computing 2 Jeannette M. Wing

How Much Data? • • NOAA has ~1 PB climate data (2007) Wayback machine

How Much Data? • • NOAA has ~1 PB climate data (2007) Wayback machine has ~2 PB (2006) CERN’s LHC will generate 15 PB a year (2008) HP building Wal. Mart a 4 PB data warehouse (2007) Google processes 20 PB a day (2008) “all words ever spoken by human beings” ~ 5 EB Int’l Data Corp predicts 1. 8 ZB of digital data by 2011 640 K ought to be enough for anybody. Data-Intensive Computing 3 Jeannette M. Wing

Convergence in Trends • Drowning in data • Data-driven approach in computer science research

Convergence in Trends • Drowning in data • Data-driven approach in computer science research – graphics, animation, language translation, search, …, computational biology • Cheap storage – Seagate Barracuda 1 TB hard drive for $195 • Growth in huge data centers • Open Source “Map. Reduce” programming model Data-Intensive Computing 4 Jeannette M. Wing

Divide and Conquer “Work” w 1 w 2 w 3 “worker” r 1 r

Divide and Conquer “Work” w 1 w 2 w 3 “worker” r 1 r 2 r 3 “Result” Data-Intensive Computing Partition 5 Master Combine Jeannette M. Wing

Data-Intensive Computing Sample Research Questions Science – What are the fundamental capabilities and limitations

Data-Intensive Computing Sample Research Questions Science – What are the fundamental capabilities and limitations of this paradigm? – What new programming abstractions (including models, languages, algorithms) can accentuate these fundamental capabilities? – What are meaningful metrics of performance and Qo. S? Engineering – How can we automatically manage the hardware and software of these systems at scale? – How can we provide security and privacy for simultaneous mutually untrusted users, for both processing and data? – How can we reduce these systems’ power consumption? Users – What (new) applications can best exploit this computing paradigm? Data-Intensive Computing 6 Jeannette M. Wing

NSF’s Interest in Data-Intensive Computing • Broad interest, (potentially) long-term • CISE – Cross-directorate:

NSF’s Interest in Data-Intensive Computing • Broad interest, (potentially) long-term • CISE – Cross-directorate: CCF, CNS, IIS – Short-term: Clu. E • To provide the broad academic community access to large-scale computing cluster and massive data sets – Longer-term: Look for cross-cutting theme in FY 09 solicitation • NSF – Potentially cross-foundational, e. g. , via Cyber-enabled Discovery and Innovation (CDI); CISE, OCI, MPS, ENG, … – Why? Scientists are drowning in data! Data-Intensive Computing 7 Jeannette M. Wing

Clu. E: Cluster Exploratory • Google+IBM cluster software and services – Same as Academic

Clu. E: Cluster Exploratory • Google+IBM cluster software and services – Same as Academic Computing Cluster provided for six universities (announced last October) • Seed program by NSF – $5 M will fund SGERs and regular awards – Solicitation released; July 17 proposal deadline. – Jim French (IIS Program Director) • Hope: Clu. E will be a wild success and community interest and demand will be high Data-Intensive Computing 8 Jeannette M. Wing

Google+IBM Cluster • Cluster – 1600+ processors, terabytes of memory, hundreds of terabytes of

Google+IBM Cluster • Cluster – 1600+ processors, terabytes of memory, hundreds of terabytes of storage, internal networking – External network connection • Software – Linux – Hadoop (written by Yahoo!): Open Source version of Google’s Map. Reduce, Google File System – IBM Tivoli: management, monitoring and dynamic resource provisioning of the cluster • Services – Operations and maintenance, including staff, loading data and programs, energy costs Data-Intensive Computing 9 Jeannette M. Wing

Legal Issues Data-Intensive Computing 10 Jeannette M. Wing

Legal Issues Data-Intensive Computing 10 Jeannette M. Wing

The Partnership: Roles • Google and IBM – Provide data cluster, user support, scheduling,

The Partnership: Roles • Google and IBM – Provide data cluster, user support, scheduling, • NSF – Review proposals, identify awardees, funding • Universities – Propose and execute research plans on data cluster Data-Intensive Computing 11 Jeannette M. Wing

The MOU • Codify the roles • Establish restrictions to comply with export law

The MOU • Codify the roles • Establish restrictions to comply with export law • Prescribe the need for “usage agreement” – Remove NSF from this industry/university process and raise awareness of university sensitivities Data-Intensive Computing 12 Jeannette M. Wing

The Usage Agreement • Sets out terms and conditions for use of the hardware/software

The Usage Agreement • Sets out terms and conditions for use of the hardware/software suite • Three significant issues – Indemnification • State universities prevented by constitution or law from signing • Private universities will not sign as a matter of policy – Export control • Barrier to university mission. May prohibit access by some students. – Intellectual Property • Jury is out on this. Part of 1 on 1 negotiation. Data-Intensive Computing 13 Jeannette M. Wing

Indemnification Example • University and Corporation each agree to defend, indemnify and hold harmless

Indemnification Example • University and Corporation each agree to defend, indemnify and hold harmless the other respective parties for and against any losses damages or claims for damages arising from the wrongful acts or omissions of their respective officers, employees, students or agents (including, without limitation, University Students and University Personnel) in connection with the exercise of their rights and the performance of their obligations under this Agreement, including but not limited to … Asymmetric: We agree not to sue each other but University pays cost of defending Corporation should it be sued based on something a University person did. Data-Intensive Computing 14 Jeannette M. Wing

Export Control Example • Specifically, unless authorized by appropriate government license or regulations, you

Export Control Example • Specifically, unless authorized by appropriate government license or regulations, you agree not to export, directly or indirectly, any technology, software or commodities provided by Corporation or their direct product (including software developed by you on the Corporate systems) to any of the following countries or to the nationals of any of the following countries, wherever they may be located: Cuba, Iran, Sudan, Syria, and North Korea. Explicit Country List discriminates against students from those countries who may be enrolled in University. Data-Intensive Computing 15 Jeannette M. Wing

Academia-Industry-Government Partnership • Win-win for all • New model for NSF – CISE is

Academia-Industry-Government Partnership • Win-win for all • New model for NSF – CISE is breaking new ground at NSF (in many ways) • NSF/CISE welcomes – Other corporations to participate in Data-Intensive Computing effort and other efforts in the future – This and other new models of A-I-G partnerships Data-Intensive Computing 17 Jeannette M. Wing

Thank you! Data-Intensive Computing 18 Jeannette M. Wing

Thank you! Data-Intensive Computing 18 Jeannette M. Wing

Credits • Copyrighted material used under Fair Use. If you are the copyright holder

Credits • Copyrighted material used under Fair Use. If you are the copyright holder and believe your material has been used unfairly, or if you have any suggestions, feedback, or support, please contact: jsoleil@nsf. gov • Except where otherwise indicated, permission is granted to copy, distribute, and/or modify all images in this document under the terms of the GNU Free Documentation license, Version 1. 2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation license” (http: //commons. wikimedia. org/wiki/Commons: GNU_Free_Documentation_License) Data-Intensive Computing 19 Jeannette M. Wing