Data in the Research Cyberinfrastructure Ecosystem Amy Friedlander
Data in the Research Cyberinfrastructure Ecosystem Amy Friedlander Deputy Office Director Office of Advanced Cyberinfrastructure, Directorate for Computer & Information Science & Engineering National Science Foundation
Outline Infrastructure • What do we mean? May 19, 2017 About NSF and OAC • Who are we? • What do we do? Data Challenges • NSF’s data sharing policies • Repositories and storage • Investments to support data-intensive research 2
Infrastructure About NSF and OAC Data Infrastructure is a system that: • Is shareable • Is ubiquitous and broadly accessible • Confers economic advantage Services connecting consumers (applications) Consumer services and appliances Large fixed plants that were frequently geographically expansive, providers, and manufacturers May 19, 2017 3 Challenges
Infrastructure About NSF and OAC Data Why do we talk about the cyberinfrastructure ecosystem? Is infrastructure built from top down? Or the bottom up? May 19, 2017 4 Challenges
Infrastructure About NSF and OAC Data Two Well-Known U. S. examples May 19, 2017 5 Challenges
Infrastructure About NSF and OAC Data Lessons learned • Local and national needs aligned • Funding models engaged public and private sources • Interoperability and agreement on standards were essential to knitting together heterogeneous elements in what became national systems, hence the notion of “ecosystem” • The market did not always choose the “best” technological solution • Governance at multiple scales was worked out over time • Infrastructure systems became interdependent, especially to achieve scale May 19, 2017 6 Challenges
Infrastructure About NSF and OAC Data What does this mean for NSF and OAC? • What constitutes research CI? • What is the national role? What is the campus role? • And in the specific case of distributed systems, where are the interfaces and hand‐offs? Who does what – where “who” can be one machine or system to another? May 19, 2017 7 Challenges
Infrastructure About NSF and OAC Data NSF Addresses National Priorities through Support of Fundamental Research Food/Energy/Water INCLUDES Understanding the Brain …. . and thus requires a highly capable, highly interoperable Research Infrastructure May 19, 2017 8 Challenges
CI is threaded in NSF “Big Ideas” Infrastructure About NSF and OAC Data RESEARCH IDEAS Windows on the Universe: The Era of Multimessenger Astrophysics Work at the Human. Technology Frontier: Shaping the Future Harnessing Data for 21 st Century Science and Engineering The Quantum Leap: Leading the Next Quantum Revolution Understanding the Rules of Life: Predicting Phenotype Navigating the New Arctic PROCESS IDEAS Mid-scale Research Infrastructure NSF 2050: Seeding Innovation Growing Convergent Research at NSF May 19, 2017 NSF-INCLUDES: Enhancing Science and Engineering through Diversity 9 Challenges
NSF Facilities are Increasingly CI Driven … and dependent on Shared CI May 19, 2017 Infrastructure About NSF and OAC Data 10 Challenges
Infrastructure About NSF and OAC Data NSF by the Numbers May 19, 2017 11 Challenges
Infrastructure May 19, 2017 About NSF and OAC Data 12 Challenges
Infrastructure About NSF and OAC Data Office of Advanced Cyberinfrastructure Program Staff Office of Advanced Cyberinfrastructure (OAC) May 19, 2017 Office Director: Irene Qualters Office Deputy Director: A. Friedlander Public Access: [vacant] Cooperative Agreements: Alejandro Suarez High Performance Computing R. Eigenmann E. Walker R. Chadduck Data A. Walton R. Chadduck Science Advisor Cross-cutting programs W. Miller Networking/ Cybersecurity K. Thompson A. Nikolich Software Learning & Worforce Development S. Prasad R. Ramnath V. Chaudhary 13 Challenges
About NSF and OAC Infrastructure Elements of an Overall Cyber Ecosystem CYBERINFRASTRUCTURE ECOSYSTEM Instruments THEORIZE Networking & Cybersecurity May 19, 2017 Computational Resources OBSERVE ANALYZE People, organizations, & communities HYPOTHESIZE Software EXPERIMENT Data 14 Data Challenges
Infrastructure About NSF and OAC Data OAC supports Research Cyberinfrastructure to uniquely enable collaboration and discovery frontiers at all scales Shared resources, capabilities & services across the scientific workflow Cloud Resources & Services CI‐Enabled Instrumentation Data Networks, Cybersecurity May 19, 2017 Computing Resources Gateways, Hubs, and Services Coordination & User support Software, Applications, Workflow Systems 15 Challenges
Infrastructure About NSF and OAC Data Challenges Emerging discovery pathways at scale: Architecture view Measurement Science Portals Applications, Frameworks HPC access, community Authentication Data Management … Research Facilities OSG Workflow Systems Campus, national Private, International commercial clouds resources Collaboration Platforms NSF-supported resources National/International Research & Education Networks, Commercial Networks May 19, 2017 Discipline-specific Environments Integrative Services (“Middleware”) “Foundational” CI Resources Discovery 16
High Performance Computing Infrastructure About NSF and OAC Data Challenges NSF-supported National Computing Resources Complements Larger Aggregate Investments from Universities and other Agencies 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026. . . Blue Waters/UIUC Leadership HPC Planning Stampede/UT Austin Stampede 2/UT Austin Yellowstone/NCARWyoming Cheyenne/NCAR Comet/UCSD Large-scale computation Long-tail and high-throughput Wrangler/UT Austin Bridges/CMU/PSC Jetstream/Indiana U. May 19, 2017 Data-Intensive Cloud Coordination though XSEDE • Resource Allocation • Advanced User Support • Education and Outreach • Common Software 17
Infrastructure Networking & Cybersecurity CISE/OAC Networking Programs • Fundamental layer that enables scientific discovery at the institutional, regional and global collaborative levels. About NSF and OAC Data CC* Science DMZ • Campus Cyberinfrastructure (CC*). Upgrading and accelerating campus networking (10/100 Gbps). Re‐designing campus border to Science DMZs. Innovation, + much more. • International R&E Network Connections (IRNC). Link U. S. research with peer networks in other world regions. Supports all R&E US data flows (not just NSF‐funded). • Stimulates deployment and operational understanding of emerging network technology, best practices, 100 Gbps connections. May 19, 2017 18 Challenges
Infrastructure Networking & Cybersecurity OAC Cybersecurity About NSF and OAC Data Challenges • Cybersecurity Center of Excellence (CCo. E) – formerly Center for Trustworthy Scientific CI, www. trustedci. org. • Site reviews, code reviews, architecture reviews. Example engagements: Gemini, US Antarctic Program, LSST, OOI, LIGO, DKIST, NEON, Pegasus, Perf. Sonar, … Ø Open Science Cyber Risk Profile – asset/impact oriented approach for open science (Do. E, NIH, NSF). Joint effort of CCo. E & ESNet • Cybersecurity Innovation for Cyberinfrastructure (CICI). Topics: Secure and Resilient Architecture, Secure Data Provenance, Regional Cybersecurity • Secure and Trustworthy Cyberspace (Sa. TC). – Cross‐directorate program. OAC funds later stage/applied security projects that can secure scientific CI. Several “Transition to Practice” projects co‐funded by Dept. of Homeland Security. • Annual Large Facilities Cybersecurity Summit. ~120 attendees from NSF‐funded science facilities. Next Summit: August 2017. May 19, 2017 Multi‐Agency workshops on HPC Security – driven by NSCI Example: Bro Intrusion Prevention/Detection software 19
Infrastructure Software Infrastructure About NSF and OAC Data OAC Software Investments • OAC Goal: Catalyze and support unique, innovative software‐intensive science ecosystems to advance research • Flagship - Software Infrastructure for Sustained Innovation (SI 2). Elements ($500 K/3 yrs), Frameworks ($1 M/yr 3‐ 5 yrs), Institutes ($3‐ $5 m/yr 5‐ 10 yrs). • Software “pipeline”: § R&D programs (SPX, CDS&E, DMREF, CRISP, Venture, …) § Development and deployment (SI 2) § Outcomes: Sustainability, open source community, institutional support, education, SAAS, IP licensing, … May 19, 2017 Scientific Software R&D Foundational CI R&D Community Workforce 20 Challenges
Infrastructure About NSF and OAC Data Challenges OAC Learning & Workforce Development LWD Communities of Concern CI Contributors, Cyber -scientists Develop new CI CI Professionals Deploy & support CI CI Users Exploit CI May 19, 2017 New! Cyber. Training - Training-based Workforce Development for Advanced Cyberinfrastructure (NSF 17‐ 507) § Informal, scalable training models and pilot activities ‐ on topics in advanced CI, and computational and data‐enabled science & engineering. § OAC leads, with MPS, ENG, GEO, EHR/DGE, and CISE/CCF. § $300 K‐ 500 K over 1‐ 3 years. § 3 Tracks: 1: CI Professionals. 2: CI Contributors/Users in domain science and engineering. 3: Undergraduate Computational & Data Science User Literacy. § Excellent community response in the inaugural round. § Next Deadline: October 2017 21
Data Infrastructure OAC Data Infrastructure: Accelerating Science, Building Community Infrastructure About NSF and OAC Data • Data Building Blocks (DIBBs). Funds CI/discipline collaborations, cross‐‐‐disciplinary infrastructure, built on recognized capabilities, tangible products. • First PI meeting, Jan 2017 on Results, Challenges, Future Directions, and Gaps to inform future investments. • CC* collaboration. Example topics: multi‐‐‐institution, cloud resources, sharing mechanisms. • Earth. Cube. Collaboration with NSF GEO. Topics: Building new communities, innovative interoperable solutions that link and integrate resources, new capabilities for data capture, discovery, access, processing and analysis. • Innovations at the Nexus of Food, Energy and Water Systems (INFEWS). NSF cross‐cutting activity. May 19, 2017 22 Challenges
Infrastructure The data space is complex Policies, OAC Disciplines Institutions Policies, norms, and codes of conduct May 19, 2017 Data Policies etc • • About NSF and OAC Institutions Disciplines 23 Challenges
Infrastructure About NSF and OAC Data NSF policies: • Cover a broad range of outputs: • Publications • Data • Software • Address intellectual property • Provide for decentralized implementation • Implemented through: • Proposal and Award Policies and Procedures Guide (PAPPG) • Data management plan requirement (2011) • Citation to data (2013) • NSF Public Access Plan (NSF 15‐ 52) • Individual solicitations (SI 2) Source: PAPPG XI. D. 4 May 19, 2017 24 Challenges
Infrastructure About NSF and OAC DIBBs funds nation-wide across all disciplines DIBBs 2014 Funding Contributions ACI 65% DIBBs 2016 Funding Contributions ACI 82% May 19, 2017 Note: In 2013, there was no co‐funding; in 2015; DIBBs was co‐located with CC*. Data Challenges
Infrastructure About NSF and OAC Data DIBBs awards fall into three broad categories Generation Acquisition Discovery 9 awards May 19, 2017 Curation Storage Management 18 awards Analysis Modeling Visualization 16 awards 26 Challenges
Infrastructure About NSF and OAC Data NSF promotes domain-appropriate solutions USGEO Hazard Prediction ‐ Integrated Earth system data ‐ Coordinated data management ‐ 13 Gov’t agencies Example: National Climatic Data Center (NCDC ‐ May 19, 2017 NOAA) Social Science Tracking Health, Wellbeing, Social Indicators, Political Attitudes, Etc. ‐ 40+ years of surveys ‐ Over 30, 000 research usages ‐ Cited in 2, 000+ textbooks Example: The General Social Survey (GSS) Physical Collections Invasive Species Control ‐ $120 billion problem annually ‐ Museum research resources key to control and management Example: Gulf Coast Research Laboratory Invertebrate Collection 27 Challenges
Infrastructure About NSF and OAC Data Challenges NSF promotes domain-appropriate solutions (cont. ) Antarctic Sciences Aeronomy, astrophysics, biology, medicine, geology, geophysics, glaciology, ocean and climate systems ‐ Manage U. S. data collected in Antarctica & Southern Ocean ‐ Compliant with international Antarctic Treaty & linked to Antarctic Master Directory Example: U. S. Antarctic Program Data Center May 19, 2017(IEDA – LEDO – Columbia U. ) Mapping the Universe Discovering the Universe at All Scales – Sloan Digital Sky Survey (SDSS) ‐ 15+ years of data ‐ Over 5, 800 research publications ‐ Cited in 245, 000 times Example: Sloan Digital Sky Survey (SDSS; image credit: Robert Lupton and SDSS) Natural Hazards Engineering Design. Safe Cyberinfrastructure for Natural Hazards Engineering Research ‐ Data Depot: shared data repository of experimental, simulation, and reconnaissance data that supports the full research lifecycle ‐ Discovery Workspace can be used to analyze the datasets Example: Design. Safe (U. Texas – TACC – 28 Rice – Florida Tech)
More computational- and data-intensive research in more disciplines XSEDE usage by directorate (log scale) 1 B Infrastructure About NSF and OAC Data MPS BIO GEO ENG 100 M CISE SBE 10 M 1 M 100, 000 1000 100 2005 May 19, 2017 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 12 29 Challenges
And more competition for limited resources: Allocation and usage trends over the past decade May 19, 2017 Infrastructure About NSF and OAC Data 30 Challenges
Infrastructure About NSF and OAC Data Challenges Common needs § Distributed infrastructure to support large‐scale data systems § Access to data at scale (large volume, large variety, streaming…) § In situ data processing—implementing data processing pipelines § High performance infrastructure for machine learning, including software stacks and suite of packages implementing well‐known algorithms § Support for reproducibility § CI support for the research workflow May 19, 2017 31
Infrastructure About NSF and OAC Data Challenges for you and for us • Sustainability • Value of data vs. cost of data infrastructure over time • Reusability of data infrastructure • Commercial/commodity infrastructure • Priority/emphasis on Research Dataflows and Workflows • Interoperability in a “multi‐cloud” ecosystem • Facilities and instruments • Robust and Reliable Science • Reproducibility is a small aspect • Credibility of analysis? • Incentives and career paths? May 19, 2017 32 Challenges
Infrastructure About NSF and OAC Data And where can we make strategic investments that will leverage local resources to build a national ecosystem to advance science and engineering? May 19, 2017 33 Challenges
May 19, 2017 34
- Slides: 34