Developing Standards for Data Quality Tests and Assertions





















- Slides: 21
Developing Standards for Data Quality Tests and Assertions using a Fitness for Use Framework Paul J. Morris; Museum of Comparative Zoology Lee Belbin; (Convener, BDQ TG 2) The Atlas of Living Australia Abby Benson; US Geological Survey; OBIS Arthur Chapman; (Convener BDQ IG) Australian Biodiversity Inf. Services Shelley James; i. Dig. Bio; RBG Sidney Allan Koch Veiga; (Convener BDQ TG 1) University of São Paulo Miles Nicholls; (Convener BDQ TG 3) CSIRO Pieter Provoost; UNESCO Antonio Mauro Saraiva; (Convener BDQ IG) University of São Paulo Dmitry Schigel; GBIF Alexander M. Thompson; i. Dig. Bio Dave Watson; OBIS John Wiezorek; Museum of Comparative Zoology; Vert. Net Paula Zermoglio; University of Buenos Aires
TDWG Data Quality Interest Group Conveners: Arthur Chapman, Antonio Mauro Saraiva Task Group 1 – BDQ Framework Task Group 2 – Tests and Assertions Standard Set of Tests Convener: Lee Belbin Task Group 3 – User Stories/Use Cases Convener: Allan Koch Veiga Convener Miles Nicholls Proposed Task Group 4 – Vocabularies
TG 3 User Stories/Use Cases TG 1 Framework DQ Management Collections Databases of Record Data Custodians TG 2 Standard Tests TG 4 (proposed) Vocabularies Implementations GBIF, i. Dig. Bio, ALA, Vert. Net OBIS, Kurator, etc.
Data Quality & Fitness for Use Use: What is the purpose or use for which data must have quality? Data: What kind of data are relevant and must have quality in the context of the Use? Fitness: What constitutes “fitness” for the relevant Data in the context of the Use?
Amendments 12/15/75 → 1975 -12 -15
Fitness : Data and Use occurrence. Id: urn: uuid: a 3……. . event. Date: 1970 Use: Phenology event. Date must have a resolution of a day or better. Not fit for this use. Use: Change in range over 100 year timescale event. Date must have a resolution of a year or better. Fit for this use.
Framework/TG Process GBIF TG 3 User Story Framework (TG 1) Use Case DQ DQ Profile Solutions DQ Assurance (filtering) DQ Report DQ Control (improvement) Tests GBIF, OBIS, i. Dig. Bio ALA, Vert. Net, Kurator TG 2 Controlled Vocabularies TG 4
Reporting on Data Quality Validation Report Needs: Criterion: Date collected must be in ISO format. Solutions Information element: dwc: event. Date Dimension: Conformance Resource Type: Single Record Specification: dwc: event. Date parses as an ISO Date Mechanism: event_date_qc v 1. 0. 3 Report Result: Compliant Status: Asserted Detail: “ 1884 -10” is a valid ISO date.
TG 3 User Stories/Use Cases TG 1 Framework DQ Management Collections Databases of Record TG 2 Standard Tests TG 4 (proposed) Vocabularies Implementations GBIF, i. Dig. Bio, ALA, Vert. Net OBIS, Kurator, etc.
Why Controlled Vocabularies? What we find in the data: reproductive. Condition 40, 838 distinct values life. Stage 33, 402 distinct values country 80, 408 distinct values (expected = 250) GBIF Distinct values: https: //tinyurl. com/zhnnyy 4 Courtesy of Tim Robertson
Controlled Vocabularies Who would benefit from having Controlled Vocabularies? Data producers (e. g collectors) could capture data using pick lists and could impart valuable information more efficiently. Data custodians (e. g. , museum collections) could manage, provide and use data more efficiently. Controlled Vocabularies Data aggregators data quality assessment infstructure for data filtering. Data users more effective filtering and data discovery
Tools can have differing opinions (Tools come and go) MCZ data in GBIF Flagged as: Latitude inverted FP-Akka disagrees. Interoperability problem.
DQ Task Group 2: Set of 112 Tests (Amendments, Validations, Measures)
A Test: is dwc: day in range 1 -31? # 12 a GUID 48 aa 7 d 66 -36 d 1 -4662 -a 503 -df 170 f 11 b 03 f IDs Variable DAY_INVALID Description (warning/error) The value given for event day is less than 1 or greater than 31 Description (test - PASS) The value given for event day is between 1 and 31 Specification (Technical) day is <= 1 or => 31 Record Resolution Single. Record Term Resolution Single. Term Data Dependency Internal Output Type Validation Example day=32 Darwin Core Class Event Darwin Core Terms day DQ Dimension Conformance Severity Warning Source ALA
An example data record occurrence. ID: urn: uuid: 205 f 3 f 79 -512 a-44 a 8 -be 35 -76 eef 7 b 89 c 5 d verbatim. Event. Date: 0/10/1973 day: 0 month: 10 year: 1973 event. Date: 1973 -10
Some tests Day In Range=NOT_COMPLIANT Month In Range=COMPLIANT Event. Date Correctly Formatted=COMPLIANT Event. Date precision calendar year or better =COMPLIANT Day Consistent With Month/Year =DATA_PREREQUISITES_NOT_MET day: 0 month: 10 year: 1973 event. Date: 1973 -10
More Details Criterion in Context: Day In Range, Single Record Information Element: dwc: day Data. Resource: Single Record: urn: uuid: 205 f 3 f 79512 a-44 a 8 -be 35 -76 eef 7 b 89 c 5 d Result: NOT_COMPLIANT Details: Provided value for day '0' is not an integer in the range 1 to 31. Mechanism: Kurator: Dw. CEvent. DQ v 1. 0. 3
Kurator Visualization Month In Range COMPLIANT Provided value for month '10' is an integer in the range 1 to 12. Event. Date precision Julian year or better. COMPLIANT Provided value for event. Date '1973 -10 -01' has a duration less than or equal to one Julian year of 365. 25 days. COMPLIANT Event. Date precision calendar year or better. Provided value for event. Date '1973 -10 -01' does not contain a leap day and has a duration less than or equal to one calendar year of 365 days. Day Consistent With Month/Year DATA_PREREQUI SITES_NOT_MET Provided value for day 0 is outside the range 1 -31, unable to test. Day In Range NOT_COMPLIANT Provided value for day '0' is not an integer in the range 1 to 31.
Placopecten magellanicus Gmelin, 1791 FP-Akka (Gmelin, 1791) Placopecten magellanicus (Gmelin, 1791) WAS: Gmelin, 1791; CHANGED TO: Found accepted name Placopecten magellanicus Source: Catalog of Life. Authorship: Differs only in Parentheses Authorship Similarity: 0. 833 Data? Process? DB Software? Goals? [TG 3] Defect Source? [TG 4] DQ Software? [TG 2]
TG 3 User Stories/Use Cases TG 1 Framework DQ Management Collections Databases of Record TG 2 Standard Tests TG 4 (proposed) Vocabularies Implementations GBIF, i. Dig. Bio, ALA, Vert. Net OBIS, Kurator, etc.