Darwin Core Archives Checklist Extensions Archive Tools Checklist

  • Slides: 30
Download presentation
Darwin Core Archives Checklist Extensions Archive Tools Checklist Bank Markus Döring & David Remsen,

Darwin Core Archives Checklist Extensions Archive Tools Checklist Bank Markus Döring & David Remsen, GBIF 2010

Checklist Scope

Checklist Scope

Darwin Core l Ratified in 2009 l l l Set of terms l l

Darwin Core l Ratified in 2009 l l l Set of terms l l l Significant additions/refinements Ongoing process http: //rs. tdwg. org/dwc/terms/index. htm Not tied to technology Use Text Guidelines for Dw. C-A l http: //rs. tdwg. org/dwc/terms/guides/text/index. htm

Darwin Core Archives l Simplicity l l Complete datasets, compressed Allow for rich dataset

Darwin Core Archives l Simplicity l l Complete datasets, compressed Allow for rich dataset metadata Single CSV /w header minimal requirement Flexible l l for interoperability 1: many extensions Schema descriptor meta. xml Property mapping to column or global valu GNA exchange format l l l Standard extensions Taxonomic core conventions Controlled vocabularies

Best Practices l Include dataset metadata file or URL l l inside <archive metadata=“.

Best Practices l Include dataset metadata file or URL l l inside <archive metadata=“. . . ”> GBIF recognises eml file For simplicity a Dublin Core xml file does it Data file format l l UTF 8 tab or csv files header row NULL as empty string not “N” or “NULL”

Dwc: Taxon – Identifier l Relational data, Record ID l l l = Taxon.

Dwc: Taxon – Identifier l Relational data, Record ID l l l = Taxon. ID for checklist archives = Occurrence. ID for occurrence archives Taxon. Concept. ID l l the primary key that other id terms relate to Asserting that taxa have a shared concept Scientific. Name. ID l Link out to some optional name identifier, GUID really l Identifier are plain strings, can be any format l Literal terms, e. g. parent. Name. Usage l l l All Dwc ID terms have such a literal friend Redundant if id terms are used to be avoided for relations, e. g. homonyms

Dwc: Taxon - Classification l Classification only for accepted taxa, not synonyms l parent.

Dwc: Taxon - Classification l Classification only for accepted taxa, not synonyms l parent. Name. Usage. ID l l Denormalised (prefer the use of parent. Name. Usage. ID) l l l Kingdom, Phylum, Class, Order, Family, Genus, Subgenus No explicit records required for higher taxa Taxon. Rank l l Allows for arbitrary ranks and levels Beware infinite loops Root with parent. ID=NULL or parent. ID=record. ID String, but recommended vocabulary http: //rs. gbif. org/vocabulary/gbif/rank. xml Examples http: //code. google. com/p/gbif-ecat/wiki/publishing. Classifications

Dwc: Taxon - Synonyms l Synonym are records in core file l l accepted.

Dwc: Taxon - Synonyms l Synonym are records in core file l l accepted. Name. Usage. ID l l Synonyms point to the accepted/valid name usage Accepted names have NULL or point to themselves pro parte synonyms concatenate with | symbol all accepted IDs taxonomic. Status l l l But classification should be ignored Accepted, (hetero-/homotypic) synonym, misapplied See http: //rs. gbif. org/vocabulary/gbif/taxonomic_status. xml name. According. To l sec. / sensu part of taxon concepts

Dwc: Taxon – Nomenclature l scientific. Name l l l name. Published. In nomenclatural.

Dwc: Taxon – Nomenclature l scientific. Name l l l name. Published. In nomenclatural. Status nomenclatural. Code l l full name with authorship genus, subgenus, specific. Epithet, verbatim. Taxon. Rank, infraspecific. Epithet, scientific. Name. Authorship http: //rs. gbif. org/vocabulary/gbif/nomenclatural_code. xml original. Name. Usage. ID l Basionym, Pointer to usage that first established the name

Darwin Core Extensions

Darwin Core Extensions

Dwc Extensions - Basics l One to many relation, schema descriptor meta. xml l

Dwc Extensions - Basics l One to many relation, schema descriptor meta. xml l id column required to join extensions row. Type specifies the class of records / extension l Property mapping to column or global value l l List of allowed properties with l l Definition, examples, further link Mandate Vocabulary Basic data types: string, integer, decimal, boolean, date. Time Centrally hosted at http: //rs. gbif. org l l Staging environment Production is manually moderated, but open to community

Dwc: Taxon Extensions l Frozen soon for GNA “Simple Exchange Format” http: //rs. gbif.

Dwc: Taxon Extensions l Frozen soon for GNA “Simple Exchange Format” http: //rs. gbif. org/extension/gbif/1. 0/ l l l Vernaculars Distribution Bibliography Alternative ids & links. Webpage, LSID, DOI, JSON, etc Candidates for further extensions l l l species info images nomenclatural acts & name relations concept relations type specimen

Darwin Core Tools Publishing support

Darwin Core Tools Publishing support

Dw. C-A Reader Java library l Provides iterators across star schema l Dwc terms

Dw. C-A Reader Java library l Provides iterators across star schema l Dwc terms and GNA extension terms as enumerations

Validator Status: Under Evaluation http: //tools. gbif. org/dwca-validator/

Validator Status: Under Evaluation http: //tools. gbif. org/dwca-validator/

Integrated Publishing Toolkit l Compose EML Metadata l Connect to database Upload Data Transform

Integrated Publishing Toolkit l Compose EML Metadata l Connect to database Upload Data Transform to DWCA Publish via GBIF l l l Status: Stable release – end 2010 http: //ipt. gbif. org

Guidelines and Best Practices • • DB Admin skills Database export No tools required

Guidelines and Best Practices • • DB Admin skills Database export No tools required Successful pilots • Ireland • NBN UK • Norway • Avian Knowledge network • IPNI • IRMNG Status: Drafts for November campaign (see roadmap)

Authoring Descriptor XML Metafile Status: Ready for Review http: //tools. gbif. org/dwca-assistant/

Authoring Descriptor XML Metafile Status: Ready for Review http: //tools. gbif. org/dwca-assistant/

Excel Spreadsheet Templates Status: Ready for Review/Testing

Excel Spreadsheet Templates Status: Ready for Review/Testing

Spreadsheet Processor Status: Ready for Review http: //tools. gbif. org/spreadsheet-processor/

Spreadsheet Processor Status: Ready for Review http: //tools. gbif. org/spreadsheet-processor/

Checklist Bank Indexing checklists

Checklist Bank Indexing checklists

GBIF Checklist Bank l Rich index to checklists and their content l l l

GBIF Checklist Bank l Rich index to checklists and their content l l l All of Dwc Taxon and GNA Simple Format extensions: Vernacular names, Identifier & Links, Distribution, References ~35 million name usages, 90 datasets + 8500 derived from occurrence index Checklists l Dw. C-A created by l l l Publisher Adapters (Co. L, ITIS, NCBI, USDA, GRIN, Tree. Of. Life) manual Transformation, static No versioning 4 main types: taxonomic, nomenclatural, occurrences, thematic

Name Usages l Checklists are made up of name usages a plain name string

Name Usages l Checklists are made up of name usages a plain name string with optionally: l l l Classification Taxonomic status, e. g. synonym, misapllied name Original name, i. e. basionym According to, i. e. taxon concept Nomenclatural status Original publication

Lexical Grouping l Name strings are parsed and grouped l l Correct & incorrect

Lexical Grouping l Name strings are parsed and grouped l l Correct & incorrect spellings Homonyms in several groups Semiautomatic process largely based on canonical, year and higher classification Allows for l l Fuzzy matching Checklist crosswalk Rubus silvaticus sylvaticus silvaticum silvaticus Weihe & Nees Vertebrata [animal subphylum] Vertebrate Vertebrata Cuvier, 1812 Vertebrata [algae genus] Vertebrata Gray Vertebrata S. F. Gray, 1821 Gerardia Deam Gerardia paupercula var. borealis (Pennell) Deam paupercula (Gray) Britt. var. borealis (Pennell) paupercula (A. Gray) Britton var. borealis (Pennel paupercula borealis (Pennell) Deam

Nomenclatural Grouping homotypic names l l l Original name relation Homotypic synonyms Not yet

Nomenclatural Grouping homotypic names l l l Original name relation Homotypic synonyms Not yet available

Checklist Bank Portal l l Preliminary until new GBIF portal complete Browse & Search

Checklist Bank Portal l l Preliminary until new GBIF portal complete Browse & Search Statistics Links to source pages Flickr Images

Checklist Bank Webservices l Common API to all resources l RESTful JSON services l

Checklist Bank Webservices l Common API to all resources l RESTful JSON services l l search names, usages, checklists navigate classification l http: //ecat-dev. gbif. org/api/clb

Importing Darwin Core l Highly relational data l Challenges faced l Syntactically damaged sources

Importing Darwin Core l Highly relational data l Challenges faced l Syntactically damaged sources l l Data Quality l l Broken referential integrity Non names, e. g. “Unallocated Family” No standard vocabularies for ranks, status, etc Name strings have several publishing options l l Wrong mappings, charsets, non escaped line breaks or field delimiters Scientific. Name, Authorship, Genus + epithets + rank Classification has several publishing options l Normalised (parent. Usage / parent. Usage. ID) or flat via Linnean Ranks

GBIF Nub l Synthetic “union taxonomy”, checklist #1 l Lexical group = nub name

GBIF Nub l Synthetic “union taxonomy”, checklist #1 l Lexical group = nub name usage l Classification based on prioritized checklists l l Align to 8 Co. L kingdoms Fixed accepted ranks: l l Linnean + subfamily, subgenus, section, subspecies, variety, form Other ranks become “Intermediate rank” synonyms l Homotypic synonyms only l Work in progress!

Personal Name Lists l User accounts with personal name lists l l l Add

Personal Name Lists l User accounts with personal name lists l l l Add classifications, status, distribution, vernaculars, etc from one or more indexed checklists Also on the fly via webservices l l Name string + kingdom/nom code but only for already indexed name strings In development …