A Highaltitude View of TDWG Standards Machine Processing

  • Slides: 42
Download presentation
A High-altitude View of TDWG Standards: Machine Processing, Graphs, and the Vocabulary Development Process

A High-altitude View of TDWG Standards: Machine Processing, Graphs, and the Vocabulary Development Process Steve Baskauf Vanderbilt University Dept. of Biological Sciences TDWG Vocabulary Maintenance Specification TG 2016 -12 -05

Deck, Guralnick, Hunze 2013 Page 2013 Soltis 2014 Endresen & Svindseth 2014 Inability to

Deck, Guralnick, Hunze 2013 Page 2013 Soltis 2014 Endresen & Svindseth 2014 Inability to integrate heterogeneous data is considered a high -priority problem. Machine-mediated linking is perceived as the solution.

What kind of thing are we talking about? Class (table) How do we describe

What kind of thing are we talking about? Class (table) How do we describe thing? Property (column) What is one of the thing? Instance (row) events table id lat long state date habitat 078 25. 679706 -80. 688246 Florida 2010 -06 -03 Cypress Dome 137 25. 663749 -80. 713352 Florida 2013 -05 -27 Cypress Dome 298 25. 140454 -80. 906958 Florida 1992 -01 -19 Mangrove Swamp Language for talking about data Value (cell) What is the description of the thing?

1. Basic data integration problems • Merging data about one kind of thing •

1. Basic data integration problems • Merging data about one kind of thing • Linking data about different kinds of things

First data integration problem: lack of common properties Provider 1 events table id lat

First data integration problem: lack of common properties Provider 1 events table id lat long state date habitat 078 25. 679706 -80. 688246 Florida 2010 -06 -03 Cypress Dome 137 25. 663749 -80. 713352 Florida 2013 -05 -27 Cypress Dome 298 25. 140454 -80. 906958 Florida 1992 -01 -19 Mangrove Swamp Provider 2 locations table guid decimal. Latitude decimal. Longitude state. Province habitat f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida cypress_dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida mangrove_swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida mangrove_swamp

We have fixed this problem by developing standards for properties like Darwin Core Provider

We have fixed this problem by developing standards for properties like Darwin Core Provider 1 events table id dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: event. Date dwc: habitat 078 25. 679706 -80. 688246 Florida 2010 -06 -03 Cypress Dome 137 25. 663749 -80. 713352 Florida 2013 -05 -27 Cypress Dome 298 25. 140454 -80. 906958 Florida 1992 -01 -19 Mangrove Swamp Provider 2 locations table guid dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida cypress_dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida mangrove_swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida mangrove_swamp

Second data integration problem: inconsistent assignment of properties to classes ? ? ? Provider

Second data integration problem: inconsistent assignment of properties to classes ? ? ? Provider 1 events table id dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: event. Date dwc: habitat 078 25. 679706 -80. 688246 Florida 2010 -06 -03 Cypress Dome 137 25. 663749 -80. 713352 Florida 2013 -05 -27 Cypress Dome 298 25. 140454 -80. 906958 Florida 1992 -01 -19 Mangrove Swamp Provider 2 locations table guid dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida cypress_dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida mangrove_swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida mangrove_swamp

We have fixed this (mostly) by organizing properties under classes in Darwin Core Provider

We have fixed this (mostly) by organizing properties under classes in Darwin Core Provider 1 events table Provider 1 locations table dwc: event. Date id dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat 2010 -06 -03 078 25. 679706 -80. 688246 Florida Cypress Dome 2013 -05 -27 137 25. 663749 -80. 713352 Florida Cypress Dome 1992 -01 -19 298 25. 140454 -80. 906958 Florida Mangrove Swamp Provider 2 locations table guid dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida cypress_dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida mangrove_swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida mangrove_swamp

Third data integration problem: lack of robust controlled vocabularies. There is progress on this

Third data integration problem: lack of robust controlled vocabularies. There is progress on this one, but often still inconsistently used. aggregated locations table dwc: location. ID dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat 078 25. 679706 -80. 688246 Florida Cypress Dome 137 25. 663749 -80. 713352 Florida Cypress Dome 298 25. 140454 -80. 906958 Florida Mangrove Swamp f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida cypress_dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida mangrove_swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida mangrove_swamp

Fourth data integration problem: generating unique persistent identifiers for instances. No consensus on how

Fourth data integration problem: generating unique persistent identifiers for instances. No consensus on how to fix this one. aggregated locations table dwc: location. ID dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat 078 25. 679706 -80. 688246 Florida Cypress Dome 137 25. 663749 -80. 713352 Florida Cypress Dome 298 25. 140454 -80. 906958 Florida Mangrove Swamp f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida cypress_dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida mangrove_swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida mangrove_swamp

Fifth data integration problem: linking to related resources. (solved locally, not solved globally) Foreign

Fifth data integration problem: linking to related resources. (solved locally, not solved globally) Foreign key aggregated locations table dwc: location. ID dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat 078 25. 679706 -80. 688246 Florida Cypress. Dome 137 25. 663749 -80. 713352 Florida Cypress. Dome 298 25. 140454 -80. 906958 Florida Mangrove. Swamp f 2 f 91 e 0 c-aacb-462591 ee-1 a 78 dcf 2 ca 8 e 25. 401582 -80. 853153 Florida Cypress. Dome 19 e 6 fb 26 -d 409 -467 a 8 ec 8 -e 94 a 34 e 2 d 2 d 3 25. 224878 -80. 505041 Florida Mangrove. Swamp 5 a 2 f 7012 -477 c-43 f 9 a 7 fb-b 53 acf 38 de 8 f 25. 129342 -80. 987372 Florida Mangrove. Swamp habitats table controlled_value name key_species ecoregion Cypress. Dome cypress dome Taxodium distichum Everglades Hardwood. Hammock hardwood hammock Swietinia mahagoni Everglades Mangrove. Swamp mangrove swamp Rhizophora mangle Everglades Primary key

2. Graphs as a machineinterpretable data integration solution

2. Graphs as a machineinterpretable data integration solution

Representing a table as a graph columns = properties subject identifier rows = instances

Representing a table as a graph columns = properties subject identifier rows = instances of the class dwc: location. ID dwc: decimal. Latitude dwc: habitat 078 25. 679706 Cypress. Dome 137 25. 663749 Cypress. Dome 298 25. 140454 Mangrove. Swamp cells = values Graph representation of metadata about one of the location instances Cypress. Dome dwc: habitat subject properties 078 (an instance of a class) values 25. 679706 dwc: decimal. Latitude

Linking data from two tables as a graph aggregated locations table Foreign key dwc:

Linking data from two tables as a graph aggregated locations table Foreign key dwc: location. ID dwc: decimal. Latitude dwc: decimal. Longitude dwc: state. Province dwc: habitat 078 25. 679706 -80. 688246 Florida Cypress. Dome habitats table controlled_value name key_species ecoregion Cypress. Dome cypress dome Taxodium distichum Everglades Primary key_species aggregated locations table Taxodium distichum Cypress. Dome dwc: habitat name cypress dome -80. 688246 078 ecoregion dwc: decimal. Longitude dwc: decimal. Latitude dwc: state. Province 25. 679706 Florida Everglades habitats table

Linking data as a graph makes it easy to describe complex relationships cone Pinopsida

Linking data as a graph makes it easy to describe complex relationships cone Pinopsida f 2 f 91 e 0 c-aacb-4625 -91 ee-1 a 78 dcf 2 ca 8 e reproductive. Structure dwc: habitat key_species class Taxodium distichum Cypress. Dome dwc: habitat name cypress dome -80. 688246 078 ecoregion dwc: decimal. Longitude dwc: decimal. Latitude dwc: state. Province 25. 679706 Florida main. Drink has. Capital Everglades orange juice Tallahasee

Creating a graph in Neo 4 j using Cypher LOAD CSV WITH HEADERS FROM

Creating a graph in Neo 4 j using Cypher LOAD CSV WITH HEADERS FROM "https: //dl. dropboxusercontent. com/u/24288409/habitats. csv" AS csv. Habitat CREATE (a: Habitat { id: csv. Habitat. controlled_value, name: csv. Habitat. name, key. Species: csv. Habitat. key_species, ecoregion: csv. Habitat. ecoregion }) LOAD CSV WITH HEADERS FROM "https: //dl. dropboxusercontent. com/u/24288409/locations. csv" AS csv. Locations CREATE (a: Location { id: csv. Locations. dwc_location. ID, latitude: to. Float(csv. Locations. dwc_decimal. Latitude), longitude: to. Float(csv. Locations. dwc_decimal. Longitude), state: csv. Locations. dwc_state. Province, habitat: csv. Locations. dwc_habitat }) MATCH (location: Location), (habitat: Habitat) WHERE location. habitat=habitat. id CREATE (location)-[: has. Habitat]->(habitat)

But how do you refer to things if they are described by different data

But how do you refer to things if they are described by different data providers? cone Pinopsida f 2 f 91 e 0 c-aacb-4625 -91 ee-1 a 78 dcf 2 ca 8 e reproductive. Structure dwc: habitat key_species class Taxodium distichum Cypress. Dome dwc: habitat name cypress dome -80. 688246 078 dwc: decimal. Longitude dwc: decimal. Latitude dwc: state. Province ecoregion 25. 679706 main. Drink Florida has. Capital Everglades orange juice Tallahasee

Options: authorities, UUIDs, HTTP URIs Wikidata identifier cone Q 1329304 f 2 f 91

Options: authorities, UUIDs, HTTP URIs Wikidata identifier cone Q 1329304 f 2 f 91 e 0 c-aacb-4625 -91 ee-1 a 78 dcf 2 ca 8 e UUID reproductive. Structure dwc: habitat key_species class Q 148950 Q 5200372 dwc: habitat name -80. 688246 ecoregion dwc: decimal. Longitude http: //my. org/078 dwc: decimal. Latitude HTTP URI dwc: state. Province cypress dome 25. 679706 Q 812 main. Drink has. Capital Everglades orange juice Q 37043

3. Linked Data

3. Linked Data

Linked Data (Berners-Lee, 2006) • Use URIs for the names of things • Use

Linked Data (Berners-Lee, 2006) • Use URIs for the names of things • Use HTTP URIs so that people can look up those names • When someone looks up a URI, provide useful information using the standards • Include links to other URIs, so that they can discover more things Examples: http: //www. wikidata. org/entity/Q 5200372 wd: Q 5200372 (a compact URI or CURIE) http: //data. nhm. ac. uk/object/8 e 2 f 81 f 5 -110 d-45 e 8 -8 e 5 a-34 e 8 d 30 a 1 a 49 https: //www. w 3. org/Design. Issues/Linked. Data. html

The world as Linked Data sees it: everything denoted by a URI or a

The world as Linked Data sees it: everything denoted by a URI or a literal wd: Q 1329304 "cone" nhm: f 2 f 91 e 0 c-aacb-4625 -91 ee-1 a 78 dcf 2 ca 8 e plant: reproductive. Structure dwc: habitat eco: key_species wd: Q 5200372 dwc: habitat "-80. 688246" my: 078 dwc: decimal. Longitude dwc: decimal. Latitude dwc: state. Province "25. 679706" wd: Q 812 wd: Q 148950 rdfs: label "cypress dome" eco: ecoregion ex: main. Drink ex: has. Capital tax: class "Everglades" "orange juice" wd: Q 37043

site habitat ecosystem major_ecoregion 078 freshwater marsh wetland terrestrial 9004 reef face coral reef

site habitat ecosystem major_ecoregion 078 freshwater marsh wetland terrestrial 9004 reef face coral reef marine The property/value modeling approach The classic Linked Data approach with "has a" links. "terrestrial" ex: has. Major. Ecoregion http: //my. org/078 ex: has. Ecosystem "wetland" ex: has. Habitat "freshwater marsh" properties literal values

site habitat ecosystem major_ecoregion 078 freshwater marsh wetland terrestrial 9004 reef face coral reef

site habitat ecosystem major_ecoregion 078 freshwater marsh wetland terrestrial 9004 reef face coral reef marine The thesaurus modeling approach Uses SKOS to subj: terrestrial. Habitat "terrestrial"@en describe how a skos: Concept skos: pref. Label humans categorize skos: broader things subj: wetland "wetland"@en (broader/narrower) a skos: Concept skos: broader skos: pref. Label subj: Freshwater. Marsh http: //my. org/078 a skos: Concept dcterms: subject "freshwater marsh"@en skos: pref. Label concept scheme preferred labels

site habitat ecosystem major_ecoregion 078 freshwater marsh wetland terrestrial 9004 reef face coral reef

site habitat ecosystem major_ecoregion 078 freshwater marsh wetland terrestrial 9004 reef face coral reef marine The ontology modeling approach Describes the universe; frequently uses "is a" ont: Terrestrial. Habitat a owl: Class (subclass) links. Many relationships rdfs: sub. Class. Of ont: Wetland entailed but not a owl: Class asserted. rdfs: sub. Class. Of http: //my. org/078 "terrestrial habitat" rdfs: label "wetland" rdfs: label ont: Freshwater. Marsh rdf: type a owl: Class classes "freshwater marsh" rdfs: labels

4. The Semantic Web

4. The Semantic Web

Semantic Web: Entailments Asserted relationship: http: //my. org/078 rdf: type ont: Freshwater. Marsh. Entailed,

Semantic Web: Entailments Asserted relationship: http: //my. org/078 rdf: type ont: Freshwater. Marsh. Entailed, but unasserted relationships: http: //my. org/078 rdf: type ont: Wetland. http: //my. org/078 rdf: type ont: Terrestrial. Habitat a owl: Class rdfs: sub. Class. Of rdf: type ont: Wetland a owl: Class rdf: type http: //my. org/078 rdfs: sub. Class. Of ont: Freshwater. Marsh rdf: type a owl: Class

Property/value approach Pros: simple, easy to implement Cons: uncontrolled, no entailments Thesaurus approach Pros:

Property/value approach Pros: simple, easy to implement Cons: uncontrolled, no entailments Thesaurus approach Pros: structured, multilingual Cons: directs human users, but few automatic entailments Ontology approach Pros: consistent, many machinereasoning entailments Cons: often complex, many machinereasoning entailments

The Semantic “Stack” (2006) Tim Berners-Lee via Steve Bratt, W 3 C http: //www.

The Semantic “Stack” (2006) Tim Berners-Lee via Steve Bratt, W 3 C http: //www. w 3. org/2006/Talks/1023 -sb-W 3 CTech. Sem. Web/W 3 CTech. Sem. Web. pdf

5. What do people believe in?

5. What do people believe in?

Label Believes in… greater capabilities ontologies, OWL, Semantic Web reasoning RDF, SPARQL HTTP URIs

Label Believes in… greater capabilities ontologies, OWL, Semantic Web reasoning RDF, SPARQL HTTP URIs Linked Data http: //www. wikidata. org/entity/Q 5200372 JSON, APIs graph database graphs globally unique identifiers informatics data easier implementation 8 e 2 f 81 f 5 -110 d-45 e 8 -8 e 5 a-34 e 8 d 30 a 1 a 49

greater capabilities Semantic Web Linked Data TDWG Technical Architecture c. 2007 graph database informatics

greater capabilities Semantic Web Linked Data TDWG Technical Architecture c. 2007 graph database informatics easier implementation Darwin Core and Dw. C-A 2009

Label Believes in… greater capabilities ontologies, OWL, Semantic Web reasoning RDF, SPARQL HTTP URIs

Label Believes in… greater capabilities ontologies, OWL, Semantic Web reasoning RDF, SPARQL HTTP URIs Linked Data JSON, Microdata, RDFa, APIs graph database Getty Thesauri Wikidata DOI VIAF ORCID LOC Darwin Core RDF Guide (2015) Schema. org (Google, Microsoft, Yahoo, Yandex) graphs globally unique identifiers informatics data easier implementation Darwin Core (pre-2015) NCBI

greater capabilities ontologies, OWL, Semantic Web reasoning RDF, SPARQL HTTP URIs Linked Data Darwin

greater capabilities ontologies, OWL, Semantic Web reasoning RDF, SPARQL HTTP URIs Linked Data Darwin Core RDF Guide Is silent on the extent of semantics. Assumes that properties will be used appropriately with URI or literal objects as expected for particular terms. Assumes that subject resources will be identified by dereferenceable HTTP URIs. Does NOT specify how this should be accomplished. JSON, Microdata, RDFa, APIs graph database graphs globally unique identifiers informatics data easier implementation Creates dwciri: analogs of some existing terms to facilitate linking to other non-literal objects. Does NOT establish a standard graph model nor how to organize properties in classes. Identifiers (globally unique or not) are literal values of dcterms: identifier Allows existing literal (string) data to be expressed directly as values of existing Dw. C terms, but promotes linking to URIs.

6. Vocabulary management

6. Vocabulary management

2015 Vocabulary Maintenance Specification Task Group (VOCAB) Deliverables submitted 2016 -08 -02: • Standards

2015 Vocabulary Maintenance Specification Task Group (VOCAB) Deliverables submitted 2016 -08 -02: • Standards Documentation Specification • replaces incomplete draft of 2007 • describes human- and machine-readable documents • Vocabulary Maintenance Specification • replaces Dw. C Namespace Policy and applies to all vocabularies. These are separate but very interrelated documents.

Key features of the Standards Documentation Specification • Defines a hierarchy of standards components

Key features of the Standards Documentation Specification • Defines a hierarchy of standards components as abstract resources. • Describes a versioning model that can be applied to all standards components. • Specifies the human-readable metadata needed for describing standards components. • Specifies which properties are used in machinereadable descriptions, but does not prescribe a particular serialization. • Specifies how controlled vocabularies are described using SKOS. • Provides a clear status for incomplete standards.

Key features of the Vocabulary Maintenance Specification • Introduces Vocabulary Maintenance Interest Groups, which

Key features of the Vocabulary Maintenance Specification • Introduces Vocabulary Maintenance Interest Groups, which are charged specifically with the permanent maintenance of a particular vocabulary. • Establishes an annual review of proposed vocabulary changes. • Distinguishes between a basic term layer and enhancements (ontologies, application profiles). • Describes the process by which terms and supporting documents are changed. • Describes a development process for vocabulary enhancements that requires user feedback reports.

greater capabilities ontologies, OWL, Semantic Web reasoning New specifications: Is silent on the extent

greater capabilities ontologies, OWL, Semantic Web reasoning New specifications: Is silent on the extent of semantics. RDF, SPARQL HTTP URIs Linked Data JSON, Microdata, RDFa, APIs graph database graphs globally unique identifiers informatics data easier implementation Establishes a process for the development of graph models and ontologies based on use cases and implementation experience. Does NOT establish a standard graph model nor how to organize properties in classes.

7. Where do we go from here?

7. Where do we go from here?

greater capabilities ontologies, OWL, Semantic Web What is missing? reasoning RDF, SPARQL HTTP URIs

greater capabilities ontologies, OWL, Semantic Web What is missing? reasoning RDF, SPARQL HTTP URIs Linked Data JSON, Microdata, RDFa, APIs graph database graphs globally unique identifiers informatics data easier implementation Assumes that subject resources will be identified by dereferenceable HTTP URIs. Does NOT specify how this should be accomplished. Consensus on who is responsible for minting and maintaining identifiers. Is there a single best practice for maintaining stable HTTP URIs? (stable provider domains, PURL, ARC, DOI model, etc. )

Questions about future integration of machine-processable data Assuming that the central graph model problem

Questions about future integration of machine-processable data Assuming that the central graph model problem can be solved: • If we fail to agree on a way to manage HTTP URIs, do we give up on RDF and drop down to a level that enables a graph database but doesn't require URIs? • How would aggregation and querying of a massive biodiversity data graph be managed?

http: //baskauf. blogspot. com What are the use cases? How does this technology satisfy

http: //baskauf. blogspot. com What are the use cases? How does this technology satisfy them?