Architecture and Standards for Global Biodiversity Informatics A

Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database Interoperability November 2004

TDWG and GBIF TDWG – Taxonomic Databases Working Group • Not-for-profit scientific and educational association • Affiliated to the International Union of Biological Sciences • Mission • To provide an international forum for biological data projects • To develop and promote the use of standards • To facilitate data exchange • Products • Standards/guidelines for recording/exchanging data about organisms • Promotion of use of these standards • Forum for discussion (especially annual meeting) GBIF – Global Biodiversity Information Facility • Megascience activity involving 42 countries/economies and 28 international organisations • Secretariat based in Copenhagen, Denmark • Mission • Free and universal access to world’s biodiversity data via Internet • Sharing primary biodiversity data for society, science and a sustainable future • Products • Registry of biodiversity data resources • Index of biodiversity data • Software tools • Web portals (http: //www. gbif. net) and data services

Primary biodiversity data Class: Insecta Taxonomic Names Order: Lepidoptera Sequence Data Synonym: Pyralis nubilalis Hübner, 1796 Family: Pyralidae Locus: AAL 35331 Definition: acyl-Co. A Z/E 11 desaturase 1 mvpyattadg hpekdecfed. . . Genus: Ostrinia Hübner, 1825 Taxonomic Descriptions Species: Ostrinia nubilalis (Hübner, 1796) Diagnosis: Wingspan 26 -30 mm; sexually dimorphic; male: forewings ochreous to dark brown; female: forewings pale yellow; … Vernacular (EN): European Corn-borer Vernacular (DE): Maiszünsler Vernacular (ES): Piral del maíz Vernacular (FR): Pyrale du maïs Digital Literature and Web Resources Family: Gramineae Pheromones of Ostrinia http: //www. nysaes. cornell. edu/fst/faculty/acree /pheronet/phlist/ostrinia. html Foodplant: Zea mais L. 1753 Ecological Interactions Specimens and Observations Collection: Record id: Country: Coordinates: Date: Collector: DGH Lepidoptera DGHEUR_003217 France 03. 047˚E 48. 730˚N 28 June 2003 Donald Hobern Abiotic Data Average Rainfall Location: 48. 82°N 2. 29°E Jan 182. 3 Feb 120. 6 Mar 158. 1 Apr. . . 204. 9. . .

Standardised structured data <? xml version="1. 0" encoding="UTF-8"? > <response> <record> <darwin: Date. Last. Modified>2003 -06 -08</darwin: Date. Last. Modified> <darwin: Institution. Code>DGH</darwin: Institution. Code> <darwin: Collection. Code>DGH Lepidoptera</darwin: Collection. Code> <darwin: Catalog. Number>DGHEUR_0002976</darwin: Catalog. Number> <darwin: Scientific. Name>Dichomeris marginella (Fabricius, 1781)</darwin: Scientific. Name> <darwin: Basis. Of. Record>O</darwin: Basis. Of. Record> <darwin: Kingdom>Animalia</darwin: Kingdom> <darwin: Order>Lepidoptera</darwin: Order> <darwin: Family>Gelechiidae</darwin: Family> <darwin: Genus>Dichomeris</darwin: Genus> <darwin: Species>marginella</darwin: Species> <darwin: Scientific. Name. Author>(Fabricius, 1781)</darwin: Scientific. Name. Author> <darwin: Identified. By>Donald Hobern</darwin: Identified. By> <darwin: Collector>Donald Hobern</darwin: Collector> <darwin: Year. Collected>2003</darwin: Year. Collected> <darwin: Month. Collected>06</darwin: Month. Collected> <darwin: Day. Collected>08</darwin: Day. Collected> <darwin: Continent. Ocean>Europe</darwin: Continent. Ocean> <darwin: Country>Denmark</darwin: Country> <darwin: County>Københavns Amt</darwin: County> <darwin: Locality>Merianvej, Hellerup</darwin: Locality> <darwin: Longitude>12. 538</darwin: Longitude> <darwin: Latitude>55. 737</darwin: Latitude> <darwin: Coordinate. Precision>100</darwin: Coordinate. Precision> <darwin: Individual. Count>1</darwin: Individual. Count> <darwin: Notes>1 in Skinner trap</darwin: Notes> </record> </response> Observation record formatted using the Darwin Core S M 1 2 8 9 15 16 22 23 29 30 June 2003 T W T F S 3 4 5 6 7 10 11 12 13 14 17 18 19 20 21 24 25 26 27 28

TDWG Data Standards Darwin Core • Simple XML data model to represent taxon occurrence records (only core attributes) • Extensions to handle e. g. curation details, microbial data, image data ABCD Schema – Access to Biological Collection Data • More complex XML data model to represent collection or observation data • Detailed document structure including features for different communities Di. GIR – Distributed Generic Information Retrieval • XML protocol for searching remote data resources • Suitable for use with a wide range of different data models Bio. CASe Protocol • XML protocol for searching remote data resources with more complex schema (e. g. ABCD) • Derived from Di. GIR – new unified Di. GIR/Bio. CASe protocol being developed Taxon Concept Schema • XML data model currently under development for exchange of nomenclatural/taxonomic data • First version to be used for implementation in 2005 SDD Schema – Structured Descriptive Data • XML data model for descriptive data relating to taxa or specimens (highly generalised) • Suitable for representation of character tables, diagnostic keys, etc.

PROTOCOL SPECIMEN <? xml version='1. 0' encoding='UTF-8'? > <response xmlns='http: //www. biocase. org/schemas/protocol/1. 3' xmlns: xsi='http: //www. w 3. org/2001/XMLSchema-instance' xsi: schema. Location='http: //www. biocase. org/schemas/protocol/1. 3 http: //www. bgbm. org/biodivinf/schema/protocol_1_3. xsd'> <header> <version software='Python Interpreter'>2. 3 (#46, Jul 29 2003, 18: 54: 32) [MSC v. 1200 32 bit (Intel)]</version> <send. Time>2004 -10 -10 T 22: 40+02: 00</send. Time><source>192. 168. 1. 12</source><destination>132. 181. 101. 155</destination><type>search</type> </header> <content record. Dropped='0' record. Count=‘ 1' record. Start='0' total. Search. Hits=‘ 1'> <Data. Sets xmlns='http: //www. tdwg. org/schemas/abcd/1. 2'> <Data. Set> <Original. Source><Source. Institution. Code>BGBM</Source. Institution. Code><Source. Name>Bridel Herbar</Source. Name><Source. Last. Updated. Date>2004 -04 -29</Source. Last. Updated. Date></Original. Source> <Dataset. Derivations> <Dataset. Derivation> < Date. Supplied>2004 -07 -29</Date. Supplied> <Supplier> < Organisation><Organisation. Name>Botanic Garden and Botanical Museum Berlin-Dahlem</Organisation. Name></Organisation><Person. Name>Andrea Hahn</Person. Name></Person> < Telephone. Numbers><Telephone. Number><Number>+49 (0)30 838 50286</Number></Telephone. Numbers><URL>http: //www. bgbm. org</URL></URLs> </Supplier> <Rights> < Terms. Of. Use>The use of the data is allowed only for non-profit scientific use and for non-profit nature conservation purpose. </ Terms. Of. Use> < Legal. Owner> < Organisation> < Organisation. Name>Botanic Garden and Botanical Museum Berlin-Dahlem</Organisation. Name> < Organisation. Codes><Organisation. Code>BGBM</Organisation. Code></Organisation. Codes> </ Organisation> </ Legal. Owner> < Copyright. Declaration>No part of this data base may be copied or reproduced without written permission from the legal owner. </ Copyright. Declaration> < IPRDeclaration>The Intellectual Property Rights are held by the legal owner or, in case of living persons, by the collector or determinator. </IPRDeclaration> </Rights> <Statements><Disclaimer>No responsibility is accepted for the accuracy of the information in this data base. </Disclaimer></Statements> </ Dataset. Derivation> </Dataset. Derivations> <Units> <Unit> < Unit. ID>Bridel-1 -362</Unit. ID> <Identifications> <Identification Preferred. Identification. Flag='0'> < Taxon. Identified> < Higher. Taxa><Higher. Taxon. Rank='Family'>Pottiaceae</Higher. Taxon><Higher. Taxon. Rank='Kingdom'>Plantae</Higher. Taxon></Higher. Taxa> < Name. Author. Year. String>Leucophanes octoblepharioides Brid. 1827</Name. Author. Year. String> < Scientific. Name. String>Leucophanes octoblepharioides</Scientific. Name. String> < Author. String>Brid. 1827</Author. String> </ Taxon. Identified> <Identifier>< Identifier. Person. Name><Person. Name>Allen, Noris Salazar</Person. Name></Identifier> < Identification. Date><ISODate. Time. Begin>1986 -07</ISODate. Time. Begin></Identification. Date> </Identification> </Identifications> <Gathering>< Gathering. Site><Continent. Or. Ocean>Asia</Continent. Or. Ocean><Country. Name>NP</Country. Name></Country><Area. Detail>Nepal</Area. Detail></Gathering. Site></Gathering> </Unit> </Units> </Data. Set> </Data. Sets> </content> <diagnostics><diagnostic>OK</diagnostic></diagnostics> </response> COLLECTION (INCLUDING METADATA) PROTOCOL Bio. CASe-ABCD Note that structure of record elements is part of the content schema (ABCD), not part of the protocol

<? xml version='1. 0' encoding='utf-8' ? > <response xmlns='http: //digir. net/schema/protocol/2003/1. 0'> <header> <version>$Revision: 1. 14 $</version> <send. Time>2004 -10 -10 T 13: 48: 12 -0700</ send. Time> <source resource="martius_munchen_infocomp">http: //digir. bebif. be/main/Di. GIR. php</source> <destination>132. 181. 101. 155</destination> <type>search</type> </header> <content xmlns: darwin='http: //digir. net/schema/conceptual/darwin/2003/1. 0' xmlns: xsd='http: //www. w 3. org/2001/XMLSchema'> <record> <darwin: Date. Last. Modified>2004 -05 -12</darwin: Date. Last. Modified> <darwin: Institution. Code>Botanische Staatssammlung München</darwin: Institution. Code> <darwin: Collection. Code>Infocomp</darwin: Collection. Code> <darwin: Catalog. Number>010702 P 1</darwin: Catalog. Number> <darwin: Scientific. Name>Wedelia longifolia Mart. ex Baker</darwin: Scientific. Name> <darwin: Basis. Of. Record>Label</darwin: Basis. Of. Record> <darwin: Scientific. Name. Author>Baker, J. G. </darwin: Scientific. Name. Author> <darwin: Type. Status>Holotypus</darwin: Type. Status> <darwin: Collector. Number>s. n. </darwin: Collector. Number> <darwin: Collector>Martius, C. F. P. von</darwin: Collector> <darwin: Continent. Ocean>South America</darwin: Continent. Ocean> <darwin: Country>Brazil</darwin: Country> <darwin: Locality>'. . . in prov. S. Paulo, inter herbas locis irriguis ad Lorena. . . ' (op. cit. )</darwin: Locality> <darwin: Notes>Tribe: Heliantheae, Reference protologue: Martius, C. F. P. von: Flora Brasiliensis 6(3): 182 -183. 1884. , </darwin: Notes> </record> </content> <diagnostics> <diagnostic code="STATUS_INTERVAL" severity="info">3600</diagnostic> <diagnostic code="STATUS_DATA" severity="info">79, 5, 2</diagnostic> <diagnostic code="MATCH_COUNT" severity="info">1</diagnostic> <diagnostic code="RECORD_COUNT" severity="info">1</diagnostic> <diagnostic code="END_OF_RECORDS" severity="info">true</diagnostic> </diagnostics> </response> PROTOCOL SPECIMEN PROTOCOL Di. GIR-Darwin Core Note that structure of record elements is part of the protocol – the content schema (Darwin Core) only defines the attributes describing the record

Bio. CASe-ABCD compared to Di. GIR-Darwin Core Bio. CASe-ABCD model • Document-based (response document includes metadata and records as a structured package) • Strengths • No problem with modelling complex nested structures and repeating elements • Fits perfectly with UBIF proposal – ABCD Data. Set elements and ABCD Metadata could readily be standardised with the Data. Set/Metadata structures from other TDWG standards such as Structured Descriptive Data (SDD) and Taxon Concept Schema (TCS) – with rather little work. Data. Sets from all three of these could be combined to form a single document with cross-references between sections. • Possible weaknesses • Not simple for specialist networks to extend the structure with additional elements of their own (requires well-planned open extension points to be designed into the schema), especially if a provider wishes simultaneously to be part of more than one such specialist network. • (At present) all elements in the ABCD schema are versioned together. Handling an updated version of the schema requires significant additional effort on the part of providers and users. For example, adding new elements to support plant genetic resource data – without changing the elements for museum/herbarium specimens – requires all users to handle a new version of the schema. Di. GIR-Darwin Core model • Record-based (response returns a set of records which may contain descriptor elements from any schema) • Strengths • Massively flexible and extensible model allowing different networks to use a common protocol and shared core elements alongside their own networkspecific extensions. • (In integrated protocol version) could return ABCD elements as part of response records. If ABCD is treated as a library of elements, this fits even better. • Model maps well to supporting a flexible object-oriented data model for biodiversity informatics. • Possible weaknesses • (In existing version) cannot readily handle complex data structures with nested repeating elements. • Records have no intrinsic data type – currently relies on an implicit understanding between user and data provider.

Exchange via web services Heterogenous Databases Web Services Standardised Structured Data <request> <response> <record> … <request> Internet Users

GBIF network of biodiversity data nodes Specimens: Flowering Plants of Africa Observations: Birds of Central America Di. GIR-Darwin. Core Museum A Specimens: Proteaceae of the World Di. GIR-Darwin. Core Taxon Names: Proteaceae of the World Observations: Butterflies of Belize Bio. CASe-ABCD Observer Network B Bio. CASe-ABCD Taxon Concept Schema Checklist: Birds of Belize Di. GIR-Darwin. Core Bio. CASe-ABCD Specimens: Mammals of North Europe Taxon Concept Schema Taxon Names: Mammals of the World Further Links: Mammals Museum C GBIF Network Taxon Concept Schema Specimens: Bacteria Cultures Taxon Names: Bacteria Further Links: Bacteria University D

Central GBIF registry of data nodes Data Node Type of data Taxon Region Museum A Specimen/Observation Flowering Plants Africa 327000 Specimen/Observation Proteaceae World 23000 Taxonomic Names Proteaceae World 1500 Specimen/Observation Birds Central America Specimen/Observation Butterflies Belize 4200 Name List Birds Belize 587 Specimen/Observation Mammals North Europe 1800 Taxonomic Names Mammals World 8000 General Resources Mammals World 600 Specimen/Observation Bacteria World 1200 Taxonomic Names Bacteria World 5000 General Resources Bacteria World 400 Observer Network B Museum C University D Records 68500

Di. GIR-Bio. CASe Protocol and Nested Networks Get Di. GIR-style records each with a set of Darwin Core descriptors and a complete ABCD Unit User Get full set of Soy Bean crop descriptors. Get standard plant genetic resource Passport data for all crop types. Get complete ABCD documents from each Bio. CASe provider Get Darwin Core records where darwin: Scientific. Name equals Puma concolor from any provider. Bio. CASe Provider Taxon Occurrence Ma. NIS Provider Taxon Occurrence OBIS Provider Taxon Occurrence IPGRI Banana Provider Taxon Occurrence IPGRI Soy Bean Provider Taxon Occurrence Darwin Core Darwin Core ABCD Curatorial Marine IPGRI Passport Banana Descriptor Soy Bean Descriptor

GBIF index to biodiversity data User requests GBIF Data Nodes Biodiversity Data Access Specimen Data Di. GIR/Bi. OCASe Specimen Data Observation Data Di. GIR/Bi. OCASe Specimen Data Name Lists Taxon Concept Specimen Data Links to other data Biodiversity Data Index Taxonomic Name Service (ECAT) Catalogue of Life

GBIF data index

Central portal to biodiversity data 6 records Show specimen records for Erinaceus europaeus 35 records 17 records 58 records: 1. 2. 3. 4. 5. 6. 7. 8. 9. . Museum A Museum A Observer B Paris Nice Paris Avignon Marseille Norwich Southampton GBIF Portal 0 records

GBIF Data Portal

GBIF Data Portal

Participant Nodes with tailored information Show specimen records for Erinaceus europaeus from France Geographic Services Show occurrence of Hérisson d’Europe GBIF France GBIF Portal 26 records: 1. Museum A 2. Museum A 3. Museum A 4. Museum A 5. Museum A 6. Museum A 23. Observer B 29. Observer B Paris Nice Paris Avignon Marseille Calais Paris . . . 58. Museum C Toulouse 58 GBIF records: 1. 2. 3. 4. 5. 6. 7. 8. 9. Museum A Museum A Observer B Paris Nice Paris Avignon Marseille Norwich Southampton þ þ þ ý ý ý Museum C Toulouse þ . . . 58.

Flexible applications A customs official discovers specimens of a possible pest species of weevil (Curculionidae) on a consignment of agricultural produce at a port of entry. The GBIF Network generates an identification key to support identification of pest weevil species to allow the official to determine appropriate response. This application requires access to data from a wide range of sources, including those GBIF participants that are organisations. WANTED Provide key to identify reportable Curculionidae List of names of reportable pest species 1. Elytra brown Elytra not brown 2. Thorax black 2 5 Thorax brown 3 3. Hind tibia black Hind tibia brown Non-pest 4 4. Hind femur brown Hind femur black . . . GBIF Non-pest Descriptive data

Monitoring of data usage 81 records: 1. 2. 3. Museum A Paris Nice Paris . . . GBIF Usage: Museum A Show specimen records for Upupa epops 16 August 2003 Search: Upupa epops 5 records returned Data Usage Logs GBIF Portal Show bird specimen records from Nice 18 Augúst 2003 Search: Birds from Nice 16 records returned Data Usage Reports GBIF Usage: Observer B 16 August 2003 Search: Upupa epops 2 records returned 126 records: 1. 2. 3. . Museum A Upupa epops Apus apus Athene noctua

Future activity Globally unique identifiers • TDWG-GBIF collaboration to develop models to allow data providers to attach persistent identifiers to their data records • Allow software to detect multiple instances of the same record • Allow users to save resolvable references to specimens, collections, taxon concepts, etc. Schema repository • Central library of information on data models • Resource for discovering documentation or mappings between different schemas • Better support for intelligent software applications Data validation tools • Framework for running sets of validation tests against XML data (content values, controlled vocabularies, relating georeference data to named localities, etc. ) • Support different uses (data providers to locate possible problems in data; users to assure themselves of suitability of data; GBIF to provide metadata on data completeness/coherence) Access to a wide range of taxonomic name data • Taxonomic/nomenclatural authorities (nomenclators, global species databases, revisions, etc. ) • Lists used by different communities/organisations (red lists, pest species, regional checklists, etc. ) Customised portals • Organised according to taxon lists used by each user • Notifications of new data based on user profiles (taxonomic, geographic, etc. )

Links Taxonomic Databases Working Group http: //www. tdwg. org/ Including access to working groups Global Biodiversity Information Facility http: //www. gbif. org/ Communications Portal http: //www. gbif. net/ Data Portal http: //circa. gbif. net/Public/irc/gbif/dadi/library? l=/architecture Architecture documents
- Slides: 22