i Dig Bio Technology Cloud Computing and Appliances

i. Dig. Bio Technology, Cloud Computing and Appliances Jose Fortes (on behalf of the i. Dig. Bio team) i. Dig. Bio Summit, Gainesville November 29 - December 1 st, 2011

i. Dig. Bio (idigbio. org) �Goal: making data and images for millions of biological specimens available in electronic format for the biological research community, agencies, students, educators, and public �Mission: leadership, coordination, and outreach in digitization of collections by implementing resources for communication, use of technology, access to data, research and education. �A resource: permanent cloud computing infrastructure �to link biological data from collections across the USA �to use search and analytics tools to mine and reference data Advanced Computing and Information Systems laboratory 2

Vision �Cyberinfrastructure to enable �the collaborative creation, integration and management of digitized biocollections, �their use in scientific research, �their use in education and outreach �Visible as a collection of persistent Internet-accessible services, data and resources �For biocollection “producers” �For biocollection “consumers” �For biocollection service providers �For cyberinfrastructure providers �For national/global data aggregators Advanced Computing and Information Systems laboratory

CI Stakeholders Museums TCNs Collectors Domain Data Producers Amazon Turk GBIF ALA EOL i. Plant Amazon WS Google National/Global Data Aggregators Infrastructure Providers TCNs Microsoft Azure i. Dig. Bio BISON Data. ONE Data Conservancy Researchers Teachers Citizens TCNs Georeferencing Domain Data Consumers Government Domain Service Providers Mapping TCNs Imaging services Data quality NESCent OCR Translation i. Plant Advanced Computing and Information Systems laboratory 4

Stakeholders APIs Museums TCNs Collectors Domain Data Producers Amazon Turk GBIF Domain-level data National/Global ALA Updates Notification Usage track Data Aggregators EOL BISON Domain data Updates Notification Researchers Teachers Citizens TCNs Query results Domain Data Consumers Government i. Dig. Bio Amazon WS Google Infrastructure Providers BLOBs Appliances Customer Requests Processed data Domain Service Providers Mapping TCNs Data. ONE TCNs Microsoft Azure Data Conservancy Georeferencing Imaging services Data quality NESCent OCR Translation i. Plant Advanced Computing and Information Systems laboratory 5

Building the i. Dig. Bio Cloud �Continuous consultation with stakeholders and other successful cyberinfrastructures �Already ongoing: survey, summit, workshops, person-to-person �Cloud-based strategy �Federated scalable object storage and information processing �Digitization-oriented virtual appliances �Focus on identified agreed-upon priorities �Weekly reviewed software development process �Leverage experience/skills of PI team �Reliance on standards, proven solutions and sustainable software reuse whenever possible Advanced Computing and Information Systems laboratory

Survey results in one slide � 15 -question survey of institutions in 3 funded TCNs � 30 returned surveys (13+8+9) �Significant variety in the types and amount of data being collected, storage requirements, and the software tools and services being utilized �Three areas where the TCNs may readily look to i. Dig. Bio for guidance and support: �Archival storage �Feedback on the data �Data intensive transformations �Results available from idigbio. org �Survey is “live” for continuous consultation and update Advanced Computing and Information Systems laboratory 7

Interface Model for i. Dig. Bio and TCNs. . . TDWG Data Collections Archiving Learning Modules Structured Data Services Virtual Appliances TCP Machines OCCIWG National History Museums Federal Collections i. Plant EOL NCBI i. Dig. Bio + Resources Applied Innovations Life. Mapper ALA Workflow Engines Non-structured Data Services Storage RDF HTTP Workshop Resources Wiki JPEG 2000 Microsoft Azure XSEDE Amazon EC 2/S 3 TCNs Google Apps Data. ONE Taxonomic Validation Geographical Mapping Data Conversion Collaboration Tools Networking X. 509 XML Open. ID XMPP Microsoft Live Academic Clouds ODBC Google App Engine NESCent Infrastructure Providers, National/Global Data Aggregators, Domain Service Providers, Domain Data Consumers Advanced Computing and Information Systems laboratory 8

Interface Model for i. Dig. Bio and TCNs. . . UTF-8 SQL REST WS SAML WS-I . . . TAPIR TDWG Data Collections Archiving Learning Modules Structured Data Services Virtual Appliances TCP Machines OCCIWG National History Museums Federal Collections i. Plant EOL NCBI i. Dig. Bio + Resources Applied Innovations Life. Mapper ALA Workflow Engines Non-structured Data Services Storage RDF HTTP Workshop Resources Wiki JPEG 2000 Microsoft Azure XSEDE Amazon EC 2/S 3 TCNs Google Apps Data. ONE Taxonomic Validation Geographical Mapping Data Conversion Collaboration Tools Networking X. 509 XML Open. ID XMPP Microsoft Live Academic Clouds ODBC Google App Engine NESCent Infrastructure Providers, National/Global Data Aggregators, Domain Service Providers, Domain Data Consumers Advanced Computing and Information Systems laboratory 9

What is a virtual appliance? �A pre-configured ready-to-run package of programs (i. e. , an “image”) for a particular purpose �E. g. a biocollections toolbox �Ready-to-go easy-to-install/configure software environment �The virtual appliance has no hardware – it can be instantiated on hardware �Servers at collection providers (e. g. TCNs) �User/researcher desktops �Cloud resources Advanced Computing and Information Systems laboratory

Appliance/toolbox example (hypothetical) Purpose Tools Local collection management My. SQL, Specify Metadata management Mercury, Metacat Image processing Image. Magick, Hugin, ICE Value-added Geolocate, ABBYY Fine Reader Web Server, SOA stack XAMMP (Apache, My. SQL, PHP/Pear, Perl) Cloud storage interfaces Amazon S 3, Google App. Engine/Data. Store Software Development Glassfish, Java, Netbeans Interoperability modules Data. ONE tools, GBIF IPT Repositories, dissemination i. Dig. BIO, GBIF, Morphbank Data backup, archival Amanda, My. SQL backup Social net/communication Social. VPN, Telecenter, my. Plant Advanced Computing and Information Systems laboratory

Virtual appliance cycle Requirements, standards Domain expert i. Dig. Bio download Collections Community instantiate Users at TCNs Advanced Computing and Information Systems laboratory

Toolbox Workflow Example Linux, My. SQL, Specify, ( GEOlocate, 3) D ata i. Dig. Bio. Ingest in (1) i. Dig. Bio Cloud ge s ted int oi wn Dig l ap oad Bio pli i D an ig. B ce io (2) Data entry, improvement TCN server Do ata D ) a (4 Global Aggregators ing p ish ubl at c i pl es c i Cloud providers rv e S (Amazon, Azure…) ion Re ) (4 b (5) D ow nl app oad a lian naly sis ce (6) Sea (7) rch Vis ual izat ion Domain Data Consumer Advanced Computing and Information Systems laboratory 13

i. Dig. Bio Cloud Internal Architecture Domain Data Producers Comment Updates Notifications Compute Data Intensive Processing Specimen-record objects Specimen-image objects National/Global Data Aggregators Publish i. Dig. Bio Collections Management Object store Media Database API/XML Consumer GBIF Morphbank … Initial deployment on UF ACIS resources … Data/Metadata Advanced Computing and Information Systems laboratory 14

Summary �i. Dig. Bio cloud �Service-oriented standards-based cyberinfrastructure focused on the ADBC community needs �Scalable data management and information processing solutions using standard interfaces �Toolboxes as appliances �Evolving collection of community-selected tools �Built-in interfaces for effortless i. Dig. Bio integration �Embedded best practices in biocollections work �Feedback and suggestions welcome �fortes@ufl. edu �“Contacts” at idigbio. org Advanced Computing and Information Systems laboratory

Acknowledgments �National Science Foundation �Judith Skog and Anne Maglia �University of Florida �Larry Page, Pamela Soltis, Bruce Mc. Fadden, Renato Figueiredo, Reed Beaman, Nico Celinese, Andrea Matsunaga, Alex Thompson, Matthew Collins, Jason Grabon, Shari Ellis, Kevin Love, Jason Grabon, Betty Dunckel, Kate Rachwal, Cathleen Bester … �Florida State University �Greg Riccardi, Austin Mast, Casey Mc. Laughlin, Deb Paul, Guillaume Jimenez, Gil Nelson … Advanced Computing and Information Systems laboratory 16

Extra Advanced Computing and Information Systems laboratory 17

How i. Dig. Bio differs from other aggregators �Data completeness �Support for indexing and search on full record-level data �Support for media objects �Data will not be constrained to certain geographic areas �Distinguish observations and specimens �Data access �Allow feedback from consumers to be propagated to producers in the form of updates, annotation, notification and statistics �Data provenance �Data quality �Offer data cleaning services (e. g. , automate georeference correction) �Guarantee that data is persistently provided and identified Advanced Computing and Information Systems laboratory 18
- Slides: 18