Research Objects for improved sharing and reproducibility Dagstuhl
Research Objects for improved sharing and reproducibility Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology Oscar Corcho @ocorcho, http: //slideshare. net/ocorcho Ontology Engineering Group Universidad Politécnica de Madrid (and the Research Object community group)
My motivation 2
Some memos from our futuristic scenario • Don’t publish, release (ack: Carole Goble), reloaded (ack. Paul Groth) • Don’t just read a paper, but also view it, play with it, and whatever else • Convert passive papers into active scientific storytellers and alert systems 3
A few quotes from this week • Data (and method) sharing • Dietrich: The method for investigation is not clearly described • Eric: Provide links between articles and datasets (interlinking of scholarly content) • William: methods are normally reduced to a tiny piece of text • Reproducibility • Working group on “the present”: Crisis of replicability is driving increased concern and interest • Eric: 70% of science articles are not reproducible 4
Act 1 Data and method sharing 5
One of the many origins of “Don’t Publish, Release” • A day in Granada… (January, 2012) • Let’s get some of the interesting discussions on the Force 11 Dagstuhl meeting into practice 6
One of the origins of “Don’t Publish, Release” Scientist Live RO My supervisor calls me to report my work My supervisor calls me again and we decide to publish our RO+paper <<copy>> Live RO Reviews received and final version published <<copy>> <<copy, filter and curate>> A new Ph. D student continues my work <<copy>> <<version. Of>> Scientist RO snapshot Identified by a URI Some metadata Some curation Mostly private (for my group) Librarian/Curator RO snapshot<<version. Of>> Identified by a URI Some metadata Some curation Mostly private (for my group and for paper reviewers) Archived RO Identified by a URI Good metadata and curation Mostly public
How do you usually structure your experiment? • In a set of folders? • These could be profiles for how you normally structure your research • Dropbox? Google Drive? Git. Hub? • Overleaf+figshare? Whatever? ? ? 8
ity m un m Co . o rg xiv Ar db re ha es Sl id e ar sh fig b th u Gi Scattered Assets
A Framework to Bundle, Port and Link (scattered) resources, related experiments. Metadata Objects that carry Research Context. Units of exchange. Research Objects Multi-various products, platforms, resources First class citizens - id, manage, credit, track, profile, focus http: //www. researchobject. org
RO main principles Identity Refer to aggregations and their contents Interpretation: The objects How they are linked together Attribution: Who , when, where, why? Metadata Description Aggregation manifest Describe group & constituents External ids Local files
RO main principles: technologies ORCID DOIs Handles URIs Identity persistence and resolution, Names Citation Identity W 3 C OADM Annotation first class and stand-off Annotation OAIORE Aggregation Point of extendability manifest Aggregations Resource maps Proxies
RO Model Ontology • Defines core concepts of research objects, identity, aggregation, annotation. Used in the manifest • http: //w 3 id. org/ro/ 14
Manifest – remote and local on my machine
https: //researchobject. github. io/specifications/bundle/ Export, archive, publish and transfer ROs. File format for storage and distribution of ROs as a ZIP archive Includes an RO’s manifest, annotations and some or all of its aggregated resources Basis for more specific file formats Backwards compatible: its zip Programmatic access: JSON and JSONLD manifest, API https: //w 3 id. org/bundle/ doi: 10. 5281/zenodo. 10440
https: //researchobject. github. io/specifications/bundle/ https: //w 3 id. org/bundle/ doi: 10. 5281/zenodo. 10440
Containers 19
Research Objects: Scopes and Tooling • http: //www. researchobject. org/scopes/ • Farr Commons: http: //www. farrcommons. org/ • ISA and FAIR-DOM http: //fair-dom. org/ • SEEK http: //seek 4 science. org/ • COMBINE • Bag. It (soon) • White-labelled sci-domain-independent software • http: //rohub. linkeddata. es/ • http: //www. rohub. org/ • http: //www. researchobject. org/specifications/ • • Core Ontologies and extensions RO managers/APIs/bundling (Ruby, Java, Python) Latex 2 RO LDP 4 RO 20
Publishing may be as easy as… • Providing the URL of the Research Object to the publisher, with a release tag, to start the review process (if extra review needed) 21
Act 2 Reproducibility 22
Terminology Inspired by [Goble, 2012] 23
Terminology Pr es erv ati on Inspired by [Goble, 2012] 24
Terminology Pr es erv ati on n io at rv e ns Co Inspired by [Goble, 2012] 25
Terminology Pr es erv n ati on t s Re tio a r o n io at rv e ns Co Inspired by [Goble, 2012] 26
Terminology Pr es erv n ati on t s Re Re co n io at ns tr v r se C tio a r o on uc ti on Inspired by [Goble, 2012] 27
IN VIVO/VITRO The Research Method in different disciplines SCIENTIFIC PROCEDURE IN SILICO INPUT DATA 28 EQUIPMENT
The Research Method in different disciplines Experiment Lab book Laboratory Protocol (recipe) Workflow Digital Log 29
IN VIVO/VITRO The Research Method in different disciplines SCIENTIFIC PROCEDURE IN SILICO INPUT DATA 30 EQUIPMENT
Some problems in lab protocols • Incubate the centrifuge tubes in a water bath. • Incubate the samples for 5 min with gentle shaking. • Rinse DNA briefly in 1 -2 ml of wash. • Incubate at -20 C overnight. ü some of them present insufficient granularity, ü the instructions can be imprecise or ambiguous due to the use of natural language.
Currently… How to formalize the information from laboratory protocols as a knowledge base? Semi-structured information Ontologies + NLP tools Unstructured information
SMART Protocols - document ü Rhetorical and structural components (e. g. introduction, materials, and methods); ü Information like application of the protocol, advantages and limitations, list of reagents, critical steps.
SMART Protocols - wf Representation of the workflow aspects in protocols ü implicit order in the instructions, following the input output structure.
SMART Protocols documentation • SMART Protocols ontology is available here: • http: //vocab. linkeddata. es/SMARTProtocols/ • Giraldo O, García-Castro A, Corcho O. SMART Protocols: Se. MAntic Represen. Tation for Experimental Protocols. LISC 2014
SMART Protocols in action rdf: type sp: title of the protocol rdf: type sp: author entry sp: has. Title sp: has. Author sp: experimental sp: DNA extraction owl: sub. Class. Of protocol ro: part. Of sp: advantages rdf: type ro: part. Of sp: sample rdf: type ro: part. Of sp: application of the protocol rdf: type sp= smart protocols, ro= relation ontology
SMART Protocols in action
IN VIVO/VITRO The Research Method in different disciplines SCIENTIFIC PROCEDURE IN SILICO INPUT DATA 38 EQUIPMENT
Vocabularies and methodologies for representing and publishing workflows Workflow Provenance http: //www. opmw. org/ontology/ Workflow Plan Methodology for workflow publishing Other workflow environments WINGS on local laptop Core Workflow Template PROV export Programatic access (external apps) Portal Workflow Instance WINGS on shared host Core Workflow Template PROV export Portal Workflow Instance Linked Data Publication WINGS on web server Core Interactive Browsing (Pubby frontend) Workflow Template Users PROV export Portal Workflow Instance Wings workflow generation http: //purl. org/net/p-plan RDF Triple. Store OPM/PROV conversion Publication Share Reuse Repository of linked workflows: http: //www. opmw. org/sparql Daniel Garijo and Yolanda Gil. 2011. A new approach for publishing workflows: abstractions, standards, and linked data. (WORKS '11). ACM, New York, NY, USA, 47 -56. Daniel Garijo and Yolanda Gil. Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data. In Proceedings of the 2 nd International Workshop on Linked Science 2012, Boston, 2012. 39
Definition of workflow abstractions Catalog of common independent workflow abstractions (motifs) Data-oriented motifs: What kind of manipulations does the workflow have? Workflow-oriented motifs: How does the workflow perform its operations Analysis from 260 different workflows from 10 domains analyzed belonging to 5 different workflow systems http: //purl. org/net/wf-motifs# Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble, Common motifs in scientific workflows: An empirical analysis, Future Generation Computer Systems, Volume 36, July 2014, Pages 338 -351 40
Finding and evaluating common abstractions https: //github. com/dgarijo/Frag. Flow Graph mining techniques Workflow fragment Filtering techniques Workflow fragment representation and linkage http: //purl. org/net/wf-fd Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris A. Gutman, Ivo D. Dinov, Paul Thompson, and Arthur W. Toga. Frag. Flow: Automated Fragment Detection in Scientific Workflows. In The 10 th IEEE International Conference on e-Science, Guaruja, 2014 41
How to preserve Workflows/Research Objects? Three main ways/levels: • Descriptive reproducibility • Documentation • Workflow execution reproducibility • Can we run the workflow? • Workflow results reproducibility • Can we get the same results? Checklists! • Corcho et al: Checklist for workflow conservation. • http: //dx. doi. org/10. 6084/m 9. figshare. 1285011 • 40 different aspects • Documentation • Goals • Results • Metadata • Corcho et al: Checklist for a workflow conservation plan • http: //dx. doi. org/10. 6084/m 9. figshare. 1285012 • Based on the DCC’s data management plan 42
Some examples Levels of reproducibility Workflow conservation Plan 43
IN VIVO/VITRO The Research Method in different disciplines SCIENTIFIC PROCEDURE IN SILICO INPUT DATA 44 EQUIPMENT
Reproducibility of Computational Scientific Experiments SEMANTIC ANNOTATIONS FORMER EQUIPMENT ANNOTATE EQUIVALENT EXECUTION ENVIRONMENT REPRODUCE CLOUD Dispel 4 Py Pegasus Montage Soy. KB Epigenomics Internal Extinction Seismic Cross Correlation Makeflow Blast 45
Some results • Pegasus Montage Workflow • • Astronomy workflow Construct large image mosaics of the sky Montage Software distribution 59 binaries • Target Iaa. S Cloud Providers • Amazon EC 2 & Futuregrid • Vagrant RO available at http: //pegasus. isi. edu/publications/reppar 47
Lessons learned for Anna • Research Objects as a concept • Identity, annotation, aggregation • Adapted to the tools/infrastructure for each domain • With some tooling available already • It’s not just data preservation but also methods • Lab protocols • Computational workflows • Understand what reproducibility means for you 48
Research Objects for improved sharing and reproducibility Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology Oscar Corcho @ocorcho, http: //slideshare. net/ocorcho Ontology Engineering Group Universidad Politécnica de Madrid (and the Research Object community group)
Acknowledgements • The Semantic e-Science team at UPM • • • Carlos Badenes Daniel Garijo Olga Giraldo Rafael González-Cabero Idafen Santana • The Wf 4 Ever team • Carole Goble, José Manuel Gómez Pérez, Raúl Palma, Jun Zhao, Stian Soiland-Reyes, Khalid Belhajjame, José Enrique Ruíz, Marco Roos, Lourdes Verdes-Montenegro, Norman Morrison, Sean Bechoffer, Graham Klyne, Matt Gamble, and a large etcetera • The Research Object community group • http: //www. researchobject. org/ 50
- Slides: 47