Realising Virtual Research Environments by Hybrid Data Infrastructures
Realising Virtual Research Environments by Hybrid Data Infrastructures: the D 4 Science Experience Andrea Manzi (CERN) Leonardo Candela, Donatella Castelli, Pasquale Pagano (ISTICNR) ISGC 2014 Taipei, 25 March 2014
Outline The D 4 Science Infrastructure The Supporting Projects The Infrastructure constituents The Virtual Research Environments VREs Examples 2
The D 4 Science Infrastructure Geographically Distributed Computing Infrastructure Service Allocations, Deployment, Monitoring, and Operation Across administrative boundaries Across private and commercial providers Uniform resource and data access Production level infra deployed and maintained during D 4 Science (2007) and D 4 Science II (2009) projects 3
Hybrid Data Infrastructure • An HDI is an IT Infrastructure where research resources (HW, SW, Data) can be shared and exploited on-demand • built on existing systems, infrastructures and repositories • supporting an innovative application-delivery-model • computing, storage, data and software made available as-a-Service Application #1 Application #2 … Application #N server Hybrid Data Infrastructure /system A Infrastructure /system B … data service Infrastructure /system Z apps 4
Supporting two models of provision • For end-users – A GUI-centric approach focusing on visual interfaces for accessing Data Infrastructure facilities via a Web Browser • For service providers – An API-Centric approach focusing on comprehensive set specifications and methods for accessing HDI facilities in a programmatic way 5
Operate a large-scale HDI supporting the Ecosystem Approach to Fishery and Conservation of Marine Living Resources Exploit D 4 Science infrastructure and its interface with existing grid (EGI) and cloud (MSAzure via VENUS-C platform) infrastructures and data sources (via SDMX, TAPIR, Di. GIR, …) Manage the entire data lifecycle where data can be from any domain: from species observations to socio-statistical data, documents and environmental monitoring data Support ecological niche modelling, temporal and spatial data harmonization, statistical data analysis with R, data mining. Serve statisticians, fishery biologists, marine ecologists, economists, lawyers and enforcement bodies (customs, coast guards), conservationists 6
Exploit D 4 Science infrastructure and its interface with existing cloud (MSAzure and COMPSs via VENUS-C platform) infrastructures and data sources (via SDMX, TAPIR, Di. GIR, …) Provide open access to existing grid & cloud resources and software platforms across continents Operate a large-scale HDI serving the Biodiversity Science in Europe and Brazil Combine the Biodiversity Science and the Open Access Movement Integrate Regional & Global Taxonomies Support biodiversity scientists willing to build, test and project models of species distribution 7
Enabling Technology • The D 4 Science infra is powered by g. Cube https: //www. ohloh. net/p/g. Cube 8
Enabling Layer Software Platform Information System Resource management Workflow Engine 9
Enabling Layer: Information System A scalable and reliable framework – supporting an extensible notion of resource ( HW, Data, services) – open to modular extensions at runtime by arbitrary third parties • • Registration Discovery Notification Monitoring Inspection Assignment Accounting 10
Enabling Layer: Resource Management A distributed framework managing a trusted resource network Dynamic Deployment • remote deployment of resources across the infrastructure Resource lifetime management • running of the lifetime of resources ranging from creation and publication to discovery, access and consumption Virtual Research Environment Management • Cost effective creation, operation and maintenance of Virtual Research Environments Interoperability, openness and integration at software level • third-parties software can be added to the Data e-Infrastructure at runtime - Web Applications (Running in Tomcat); Web Services (Running in service containers, e. g. JAX -WS, Axis); Executable (e. g. pojo, shell script, …) 11
Enabling Layer: Workflow Engine Based on adaptors for the execution on internal or external resources: – JDLAdaptor - parses a Job Description Language (JDL) definition block and translates the described job or DAG of jobs into an Execution Plan which can be submitted to the Execution. Engine (g. Cube) for execution. – Grid. Adaptor - constructs an Execution Plan that can contact an EMI UI node, submit, monitor and retrieve the output of a grid job. – Condor. Adaptor - constructs an Execution Plan that can contact a Condor gateway node, submit, monitor and retrieve the output of a condor job. – Hadoop. Adaptor - constructs an Execution Plan that can contact a Hadoop UI node, submit, monitor and retrieve the output of a Map Reduce job. 12
Infrastructure Constituents: Technologies • The D 4 Science infrastructure hosts a set of components on top of different technologies to make available a large variety of services for managing, manipulating and processing data and metadata within an autonomouslymanaged infrastructure: – – – – MS Azure EGI VENUS-C COMPSS PMES u. store open. Modeller Mongo. DB, Cassandra, Hadoop, Geo. Network Elastic. Search –. . 13
Infrastructure Constituents: Services and Data • The D 4 Science infra leverages existing data sources ranging from species data (species names, synonyms, taxonomical classifications, spatial occurrences ) to literature, images – – – – OBIS, http: //www. iobis. org/ My. Ocean, http: //www. myocean. eu/ Catalogue of Life, http: //www. catalogueoflife. org/ Fish. Base, http: //www. fishbase. org/ species. Link, http: //splink. cria. org. br/ Biodiversity Heritage Library, http: //www. biodiversitylibrary. org/ Bioline International, http: //www. bioline. org. br/ Global Biodiversity Information Facility (GBIF), http: //www. gbif. org/ Catalogue of Life 14
Virtual Research Environment (VRE) is • a distributed and dynamically created environment • where subset of data, services, computational, and storage resources • regulated by tailored policies • are assigned to a subset of users via interfaces • for a limited timeframe L. Candela, D. Castelli, P. Pagano (2013) Virtual Research Environments: An Overview and a Research Agenda. Data Science Journal, Vol. 12 15
Virtual Research Environment (VRE) Cost-effective creation and management • Definition • Creation • Configuration 16
The Social Extension: Workspace • It is a virtual drive in which you can upload and download the files needed for the services and the results • Files can be organized into folders (sharing) • Support for public URIs • Support for Web. DAV 17
The Social Extension: Messages and Notifications Emails in the Cloud Customizable Alerts 18
The Social Extension: News Feed Share News User-shared News Application-shared News 19
Outline VREs Examples 20
ICIS VRE - Tabular Data Analysis Import Code. Lists Validate Datasets Analyse And Project 21
Scalable. Data. Mining VRE - Features Clustering Presence Points (Fish. Base + Obis) Density Based Clustering DBSCAN Other methods are also available … K-Means X-Means 22
Aqua. Maps VRE - Ecological Modeling • • • access to external databases extensible with predictive algorithms exploit several computational back-end use several storage technologies (RDBMS, Column Store, Blob) publish distribution to Geospatial Web services 23
Species. Lab VRE - Cross-Mapper Detecting and reporting differences between species checklists 24
Marine. Search VRE - Information Retrieval Search over several OAI-PMH repositories Semantic post-processing Entity Enrichment 25
Summary The D 4 Science Infrastructure implementing the HDI approach enables heterogeneous resource sharing between crossdomain infrastructures Collects under a common environment resources coming from several einfrastructures Successfully hosts Virtual Research Environments for members of different user communities Sustainability plan is under development for future EU funding and/or exploitation of public-private partnership 26
Questions? Thanks for your attention Landscape D 4 Science e-Infrastructure www. d 4 science. org i-marine. d 4 science. org g. Cube Framework eubrazilopenbio. d 4 science. org g. Cube Apps www. i-marine. eu www. eubrazilopenbio. eu Discussion 27
- Slides: 27