Introduction to the EGI Data Hub Lukasz Dutka
Introduction to the EGI Data. Hub Lukasz Dutka EGI Engage JRA 2. 1 (WP 4. 1) www. egi. eu EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number 654142
EGI Data. Hub Role • Unified access to reference scientific data of public interest. • Host experimental or temporary scientific data and enable easy access to it by appropriate scientific applications. • Data. Hub is not intended itself be used simply to host data on EGI on behalf of projects for their exclusive usage • Distributed platform for managing replicas of publicly available data collection available on EGI Infrastructure Data here could mean datasets a collection of data/filesets at a level of granularity considered useful to user communities. 10/16/2021 2
Current Landscape Public Data Repository X Community Specific Data Discovery Public Data Repository Y Community Specific Data Discovery S 3 AWS Existing Replica Public Clouds LUSTRE EGI Resource Centres 10/16/2021 S 3 Ceph EGI Resource Centres Private Comp. Cloud NFS Private Resources 3
The Landscape Changed by Data. Hub 10/16/2021 4
Data. Hub Processes / Use Cases • EGI. eu collects interest for data collections that should be available in Data. Hub • EGI. eu finds Resource Centres willing to support replica of the collections • EGI. eu and Data Providers negotiate technical means of replication data collection to the Resource Centers • Data Providers inform users about EGI replicas • Data Collections discoverable in App. DB • Users “link” collections of their interest in personal data hub space trough Data Providers JS widget or App. DB • Users can access data collections via POSIX virtual file system or HTTP requests on all EGI resources • Users can share their private resources in the same way as public collections 10/16/2021 5
Optimizations and Performance • Data. Hub is not replicating the data collections from the Data Providers more then once • Data. Hub is a platform to be deployed on many Resource Centres • Data. Hub is horizontally scalable to provide very large capacity and high performance • Data. Hub is designed for high performance data delivery at the speed range GB+/s per node • Data. Hub is designed for high speed data replication between sites taking into account links 10+Gbps • Data. Hub is designed for lazy replications • Data. Hub software stack is designed for mapping existing third party replicas is available (e. g. AWS replicas used for amazon users) 10/16/2021 6
Data Discovery Issues • Data. Hub is not to replace community specific data discovery mechanism • Data. Hub provides basic interface for access to metadata • Data. Hub provides interface for building added value discovery services running on top of EGI Cloud • Ideally Data. Hub should be integrated with existing discovery webpages using JS widget to enable direct “linking” selected data to make it available on cloud resources • Data Collections available replicated to Data. Hub should be registered in EGI App. DB, but App. DB is not intended to be major data discovery portal 10/16/2021 7
Concept of Data Widgets 10/16/2021 8
Data. Hub Business Ideas • Data. Hub may provides detailed accounting of data usage that can be converted into costing for users. • Possible sharing revenue model between data producers and Resource Centres storing data and providing computational resources • Computational power available in Resources Centres empowering Data. Hub can be exploited as a part of business model 10/16/2021 9
High Performance Access 55 Gbit/s = 7 GB/s on a single node 10/16/2021 10
Thank you for your attention. Questions? www. egi. eu This work by Parties of the EGI-Engage Consortium is licensed under a Creative Commons Attribution 4. 0 International License.
- Slides: 11