Data Grids Digital Libraries and Persistent Archives Reagan
Data Grids, Digital Libraries, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center http: //www. npaci. edu/DICE moore@sdsc. edu San Diego Supercomputer Center 1 National Partnership for Advanced Computational Infrastructure
Archive Definition • Computer science - archive is the hardware and software infrastructure used to manage data • Preservation community - archives is the material that is being preserved San Diego Supercomputer Center 2 National Partnership for Advanced Computational Infrastructure
Persistent Archive • Software system that manages evolution of the hardware and software infrastructure – A persistent archive preserves the authenticity and integrity of digital entities while the underlying technology evolves • Combination of the material that is being preserved and the infrastructure used to preserve the material San Diego Supercomputer Center 3 National Partnership for Advanced Computational Infrastructure
Data Grid • Grid Community definition – The infrastructure used to manage distributed data as a collection • Digital library and preservation community definition – The distributed data that is being organized and managed as a collection • A data grid is a mechanism to support sharing of data and the collection that is being shared San Diego Supercomputer Center 4 National Partnership for Advanced Computational Infrastructure
Data Sharing • Management of access controls on local resources to share data – Put controls on resources • Creation of a collection that is being shared across distributed resources – Put controls on collection • The SRB data grid does both, enacts controls on both resources and on collections (data and metadata) San Diego Supercomputer Center 5 National Partnership for Advanced Computational Infrastructure
Topics • Data Grids - managing distributed data – Distributed data management for a project • Digital Libraries - publication of data – Management of collection hierarchies • Persistent Archives - preservation of data – Management of technology evolution • Storage Resource Broker example – Currently supporting all three (seven) data management environments San Diego Supercomputer Center 6 National Partnership for Advanced Computational Infrastructure
Data Management Systems (Supported by Storage Resource Broker) • Data collecting – Sensor systems, object ring buffers and portals • Data organization – Collections, manage data context • Data sharing – Data grids, manage heterogeneity of resources • Data publication – Digital libraries, support discovery • Data preservation – Persistent archives, manage technology evolution • Data analysis – Processing pipelines, manage knowledge extraction San Diego Supercomputer Center 7 National Partnership for Advanced Computational Infrastructure
Data Management Systems • Data grid for managing distributed data – Latency management for bulk analyses of collections – Infrastructure independent name spaces for describing data, resources, users, and state information • Digital library for managing data context – Curation services for managing collections – Descriptive metadata for discovery • Persistent archive to manage technology evolution – Interoperability mechanisms between heterogeneous storage systems and user access mechanisms San Diego Supercomputer Center 8 National Partnership for Advanced Computational Infrastructure
Provide Context for Data • Properties of files – Provenance - source – Descriptive attributes – Structure • Organize properties as metadata in a collection hierarchy – Define operations on file properties – Manage state information - location, replicas, containers • Separate context management from content management – Maintain consistency of context as operations are done on content San Diego Supercomputer Center 9 National Partnership for Advanced Computational Infrastructure
Data Grids • Software systems that manage distributed data • Control global name spaces for – – Resources Users Files Metadata context • Provide standard operations on each name space • Provide single sign-on authentication, collection management, latency management, replication, and federation • Generic distributed data management technology San Diego Supercomputer Center 10 National Partnership for Advanced Computational Infrastructure
Managing Distributed Data Access Methods (Web Browser, DSpace, OAI-PMH) Storage Repository • Storage location Naming conventions provided by storage systems • User name • File context (creation date, …) • Access constraints San Diego Supercomputer Center 11 National Partnership for Advanced Computational Infrastructure
Data Grids Provide a Level of Indirection for Each Naming Convention Data Access Methods (C library, Unix, Web Browser) Data Collection Storage Repository Data Grid • Storage location • Logical resource name space • User name • Logical user name space • File name • Logical file name space • File context (creation date, …) • Logical context (metadata) • Access constraints • Control/consistency constraints Data is organized as a collection San Diego Supercomputer Center 12 National Partnership for Advanced Computational Infrastructure
Logical Name Spaces • Storage resources – Logical names for managing collections of resources • User names (user-name / domain / data grid) – Distinguished names for users to manage access controls • Digital Entities (files, blobs, structured data, …) – Logical name space for global identifiers for files • Context - Metadata attributes – Standard metadata attributes, Dublin Core – State information resulting from data grid operations – User-defined metadata San Diego Supercomputer Center 13 National Partnership for Advanced Computational Infrastructure
Logical Resource Name • Represents a list of physical resources • Operations on the logical resource name result in operations on the list of physical resources – Load leveling -write to the next physical resource in the list – Fault tolerance - write to “k” of “n” physical resources – Replication - write to each physical resource – Compound resource - write to the disk cache in front of the tape archive – Federated resource - write to the controlled resource in another data grid San Diego Supercomputer Center 14 National Partnership for Advanced Computational Infrastructure
Storage Repository Virtualization User Application How does one access data stored on multiple systems? Archive San Diego Supercomputer Center 15 Database File System National Partnership for Advanced Computational Infrastructure
Storage Repository Virtualization (Standard Operations on Logical Resource Names) Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Load leveling Fault tolerance Replication San Diego Supercomputer Center User Application Common set of operations for interacting with every type of storage repository Archive 16 Database File System National Partnership for Advanced Computational Infrastructure
Logical File Name Abstraction How does one identify files stored on multiple systems? User Application Archive at SDSC San Diego Supercomputer Center 17 Database At U Md File System at NARA National Partnership for Advanced Computational Infrastructure
Context Abstraction Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Common naming convention and set of attributes for describing digital entities Archive at SDSC Inter-realm authentication Single sign-on system San Diego Supercomputer Center User Application 18 Database At U Md File System at U Texas National Partnership for Advanced Computational Infrastructure
Federated Server Architecture Read Application Logical Name Or Attribute Condition Peer-to-peer Brokering Parallel Data Access 1 6 SRB server 3 San Diego Supercomputer Center SRB server 4 SRB agent 5 SRB agent 1. Logical-to-Physical mapping 2. Identification of Replicas 3. Access & Audit Control 5/6 2 R 1 19 MCAT Data Access R 2 Server(s) Spawning National Partnership for Advanced Computational Infrastructure
SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Network Source Replication Streaming Server-initiated I/O Parallel I/O San Diego Supercomputer Center 20 Prefetch Destination Caching Client-initiated I/O National Partnership for Advanced Computational Infrastructure
Latency Management -Bulk Operations • Bulk register – Create a logical name for a file • Bulk load – Create a copy of the file on a data grid storage repository • Bulk unload – Provide containers to hold small files and pointers to each file location • Bulk delete – Mark as deleted in metadata catalog – After specified interval, delete file • Bulk metadata load • Requests for bulk operations for access control setting, … San Diego Supercomputer Center 21 National Partnership for Advanced Computational Infrastructure
Data Grid Federation • Link multiple independent data grids – Coordinate metadata between independent metadata catalogs • Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata) – Peer-to-peer federations, data access – Replication federations, shared resources – Hierarchical federations, consistency constraints • Tune data grid federation by implementing different consistency and access constraints San Diego Supercomputer Center 22 National Partnership for Advanced Computational Infrastructure
Federation Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Grid Data Collection B Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities San Diego Supercomputer Center 23 National Partnership for Advanced Computational Infrastructure
Peer-to-Peer Data Grids Free Floating Replication Constraints Partial User-ID Sharing Occasional Interchange Partial Resource Sharing Replicated Data No Metadata Synch System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing Resource Interaction Access Constraints User and Data Replica System Managed Replication Connection From Any Zone Complete Resource Sharing Replicated Catalog Hierarchical Zone Organization One Shared User-ID Nomadic System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Snow Flake Super Administrator Zone Control Master Slave Replication Data Grids System Controlled Complete Synch No User-ID Sharing Federation Environments San Diego Supercomputer Center Consistency Constraints 24 Deep Archive Hierarchical Data Grids National Partnership for Advanced Computational Infrastructure
Generic Infrastructure • SDSC developed the Storage Resource Broker (SRB) to support access to distributed data – Effort started in 1996 as a DARPA funded project – Now support over 30 national/international projects • Development team of 12 staff is led by – Michael Wan, data management systems – Arcot Rajasekar , information management systems San Diego Supercomputer Center 25 National Partnership for Advanced Computational Infrastructure
Data Grid Capabilities • Data manipulation – Containers – Parallel I/O – Firewall interactions • Resource interactions – Fault tolerance – Load leveling – Replication • HIPAA security requirements – – – Authentication of all users Access controls on data and metadata Audit trails Data encryption Centralized control • Application interfaces – C library, Shell commands, Java, Perl, Python, WSDL, workflow San Diego Supercomputer Center 26 National Partnership for Advanced Computational Infrastructure
Digital Library • Collection hierarchy for organizing data – User-defined metadata – Collection level metadata • Metadata manipulation – – – Schema extension Bulk metadata processing Queries on metadata Access controls on metadata Views on collections • Digital library APIs – DSpace, Fedora, OAI-PMH, web browsers – METS metadata XML schema San Diego Supercomputer Center 27 National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center 28 National Partnership for Advanced Computational Infrastructure
Persistent Archives • Authenticity metadata – Provenance – User logical name space • Integrity metadata – Audit trails, checksums – Access controls • Consistency – Context update on all content operations • Persistency – Infrastructure independence • Storage repository abstraction • Information repository abstraction • Access abstraction (standard operations) San Diego Supercomputer Center 29 National Partnership for Advanced Computational Infrastructure
National Archives Persistent Archive NARA MCAT Principle copy stored at NARA with complete metadata catalog San Diego Supercomputer Center U Md MCAT Replicated copy at U Md for improved access, load balancing and disaster recovery 30 SDSC Deep Archive at SDSC, no user access, but complete copy National Partnership for Advanced Computational Infrastructure
Data Grid Federation zone. SRB C, C++, Java Linux Libraries I/O Application Java, NT Browser Kepler Actors Unix Shell DLL / Python, Perl HTTP DSpace Open. DAP OAI, WSDL, WSRF Federation Management Consistency & Metadata Management / Authorization, Authentication, Audit Logical Name Space Catalog Abstraction Databases DB 2, Oracle, Sybase, Postgres, my. SQL, Informix San Diego Supercomputer Center Latency Management Data Transport Metadata Transport Storage Repository Virtualization Databases Archives - Tape, Sam-QFS, DMF, ORB File Systems DB 2, Oracle, Sybase, Unix, NT, SQLserver, Postgres, HPSS, ADSM, Mac OSX my. SQL, Informix Uni. Tree, ADS 31 National Partnership for Advanced Computational Infrastructure
Examples of Extensibility • Storage Repository Driver evolution – – – – – Initially supported Unix file system Added archival access - Uni. Tree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Adding Grid. FTP version 3. 3 • Database management evolution – – – Postgres DB 2 Oracle Informix Sybase my. SQL (most difficult port - no locks, no views, limited SQL) San Diego Supercomputer Center 32 National Partnership for Advanced Computational Infrastructure
Examples of Extensibility • The 3 fundamental APIs are C library, shell commands, Java – Other access mechanisms are ported on top of these interfaces • API evolution – – – – – Initial access through C library, Unix shell command Added i. NQ Windows browser (C++ library) Added my. SRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, Open. DAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Adding Grid. FTP version 3. 3 (C library) San Diego Supercomputer Center 33 National Partnership for Advanced Computational Infrastructure
Sites Using the SRB San Diego Supercomputer Center 34 National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center 35 National Partnership for Advanced Computational Infrastructure
Grid Interfaces • GSI, support versions 1, 2, 3, Java • Grid. FTP version 3. 3 interface to SRB collection – – – Use GSI certificate to identify the user to the SRB Reference file by a SRB logical name space Use SRB access controls for allowed operations Initially support serial transport SRB supports 4 different firewall interaction protocols (client-driven parallel I/O, server-driven parallel I/O, bulk file registration, federated data grid access) • Grid. FTP version 3. 3 driver for SRB collection – Store data at a remote site under the SRB ID • Data will be shareable through SRB access controls – Store data at a remote site under user GSI certificate • Data will not be shareable through SRB access controls San Diego Supercomputer Center 36 National Partnership for Advanced Computational Infrastructure
Grid Interfaces • Replica Location Service Interface – Simon Metson <s. metson@bristol. ac. uk> – GMCat mimics the LRC interface, enabling the files registered in an MCat to appear on the giggle framework (RLS). – Available from http: //tuber 1. phy. bris. ac. uk: 8080/GMCat. WS 3 – (also linked from the third party software on the SRB page) • Storage Resource Manager – SRM Version 1, SRB driver created to store data in SRM – SRM Version 2, development effort to put SRM interface on top of SRB (Alasdair Earl) – SRM Version 3, development effort to put SRM interface on top of SRB (Peter Kunszt) San Diego Supercomputer Center 37 National Partnership for Advanced Computational Infrastructure
Conclusion • Distributed data management systems can be built on generic data grid infrastructure – Data grids to support bulk access across remote sites – Integration of data grid and digital library capabilities to manage massive data collections – Federation of data grids to build international discipline-wide collections San Diego Supercomputer Center 38 National Partnership for Advanced Computational Infrastructure
SDSC SRB Team (left to right) • • • • • Arun Jagatheesan George Kremenek Sheau-Yen Chen Arcot Rajasekar (SRB development lead) Reagan Moore (SRB PI) Michael Wan (SRB architect) Roman Olschanowsky (BIRN) Bing Zhu Charlie Cowart Lucas Gilbert Tim Warnock Wayne Schroeder (SRB product) Adam Birnbaum (SRB production) Antoine De Torcy Vicky Rowley (BIRN) Marcio Faerman (SCEC) Students & emeritus – – – – – • San Diego Supercomputer Center 39 Erik Vandekieft Reena Mathew Xi (Cynthia) Sheng Allen Ding Grace Lin Qiao Xin Daniel Moore Ethan Chen Jon Weinburg Supported by about 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC) National Partnership for Advanced Computational Infrastructure
For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc. edu http: //www. npaci. edu/DICE/SRB http: //www. npaci. edu/dice/srb/my. SRB. html San Diego Supercomputer Center 40 National Partnership for Advanced Computational Infrastructure
- Slides: 40