Spatial Data Cleaning Species Occurrence Data Arthur D

  • Slides: 31
Download presentation
Spatial Data Cleaning Species Occurrence Data Arthur D. Chapman June 2012

Spatial Data Cleaning Species Occurrence Data Arthur D. Chapman June 2012

Methods for Validating Georeferences • Internal Database Checks – Logical inconsistencies within the database

Methods for Validating Georeferences • Internal Database Checks – Logical inconsistencies within the database – Checking one field against another • Text location vs geocode or District/State • External Database Checks – Checking one database against another • Gazetteers • DEM • Collectors • Outliers in Geographic Space - GIS • Outliers in Environmental Space - Models • Statistical outliers June 2012

Error is inescapable and it should be recognised as a fundamental dimension of data.

Error is inescapable and it should be recognised as a fundamental dimension of data. Chrisman 1991 June 2012 Bolax gummifera, Argentina

Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities June

Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities June 2012

How do we find the suspect records? Some errors are easy to find! But!

How do we find the suspect records? Some errors are easy to find! But! What does this say about the others? Canus lupis locations – extracted from GBIF 2006 Data from FMNH, KU, PSM, UAM, MSB, Humboldt Univ. ? June 2012

Geographic Outliers - GIS Collectors – location vs date June 2012

Geographic Outliers - GIS Collectors – location vs date June 2012

Environmental Outliers • Cumulative Frequency Curves ☻ X ? June 2012

Environmental Outliers • Cumulative Frequency Curves ☻ X ? June 2012

Using Climate to Identify Outliers Reverse Jack-knife Acacia orites - 19 records 9 Temperature

Using Climate to Identify Outliers Reverse Jack-knife Acacia orites - 19 records 9 Temperature parameters NB. Because the value of ‘C’ relates to it’s nearest point, successive values may be very small, so we ensure that if ‘x[i]’ is an outlier, then all points beyond are outliers June 2012 too (even if they are clustered) Acacia dealbata, Australia

Concept of “Outlierness” T=((0. 95(√n)+0. 2) X (Range/50)) where ‘n’ is the number of

Concept of “Outlierness” T=((0. 95(√n)+0. 2) X (Range/50)) where ‘n’ is the number of records “Outlierness” is the degree to which a record is an outlier Outlierness = c[i]/ T >1 <1 June 2012

Flora. Map • • • CIAT (Colombia) PCA Cluster Analysis $US 100 Modelling 10

Flora. Map • • • CIAT (Colombia) PCA Cluster Analysis $US 100 Modelling 10 -minute grids Nothofagus antarctica, Argentina June 2012

Principal Components Analysis - Flora. Map Image from Flora. Map (Jones and Gladkov 2001)

Principal Components Analysis - Flora. Map Image from Flora. Map (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data. A. Principal Components Analysis B. Specimen record. C. Mapped specimen. D. Climate profile June 2012

Cluster Analysis - Flora. Map Image from Flora. Map (Jones and Gladkov 2001) showing

Cluster Analysis - Flora. Map Image from Flora. Map (Jones and Gladkov 2001) showing use of Cluster Analysis to identify an outlier in Rauvolfia littoralis specimen data. A. Cluster Analysis B. Principal Components Analysis. C. Mapped specimen. D. Climate profile. June 2012 E. Specimen record

Diva-GIS • • Free Simple GIS Modelling (BIOCLIM/Domain) Data Cleaning Tools Brown Algae, Argentina

Diva-GIS • • Free Simple GIS Modelling (BIOCLIM/Domain) Data Cleaning Tools Brown Algae, Argentina June 2012

Diva-GIS – Coordinate Check Using Diva-GIS to check coordinates by comparing a file of

Diva-GIS – Coordinate Check Using Diva-GIS to check coordinates by comparing a file of point specimen records (red) against a polygon of Bolivian provinces. Input dialogue box is shown at A, where it can be seen that “STATE” in the point file has been set to the equivalent “DEPARTMENT” in the polygon file. June 2012

Points outside Polygon – Diva GIS Results from Diva-GIS showing point records that fall

Points outside Polygon – Diva GIS Results from Diva-GIS showing point records that fall outside all polygons in the Bolivian provinces polygon file. The highlighted record shows the linking between the results dialogue box and the mapped record June 2012

Mismatched Provinces – Diva GIS Results from Diva-GIS showing point records that do not

Mismatched Provinces – Diva GIS Results from Diva-GIS showing point records that do not match set relationships between the specimen point file and the polygon of Bolivian provinces. The highlighted record where the geocoding on the specimen record causes it to fall in the wrong province June 2012

Cumulative Frequency Curves - Diva. Gi. S Results from Diva-GIS showing the use of

Cumulative Frequency Curves - Diva. Gi. S Results from Diva-GIS showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A 1 and A 2 show possible outliers in climate space, B 1 and B 2 the corresponding mapped records. The Blue lines represent the June 2012 97. 5 percentile

Bioclimatic Envelop – Diva GIS Results from Diva-GIS showing the use of the Bioclimatic

Bioclimatic Envelop – Diva GIS Results from Diva-GIS showing the use of the Bioclimatic Envelope from BIOCLIM to identify outliers in climate space. In this case the percentile cut off is set at 95. Red points on the envelope correspond with red points on the map, green points in the June 2012 envelope correspond with yellow points on the map

Reverse Jack-knife – Diva-GIS • Stuff from Diva-GIS June 2012

Reverse Jack-knife – Diva-GIS • Stuff from Diva-GIS June 2012

ANUCLIM • • $AUD 1000 (with data files) Modelling (BIOCLIM / ESOCLIM) Cumulative Frequency

ANUCLIM • • $AUD 1000 (with data files) Modelling (BIOCLIM / ESOCLIM) Cumulative Frequency Curves Parameter Extremes June 2012

Cumulative Frequency - ANUCLIM Log file of Eucalyptus fastigata from ANUCLIM Version 5. 1

Cumulative Frequency - ANUCLIM Log file of Eucalyptus fastigata from ANUCLIM Version 5. 1 (Houlder et al. 2002) showing the species accumulation curve with an identified outlier (labelled “bad”). Information from the “bad” record is displayed at the top of the log file (from Houlder et al. 2000). June 2012

Parameter extremes - ANUCLIM Log file of Eucalyptus fastigata from ANUCLIM Version 5. 1

Parameter extremes - ANUCLIM Log file of Eucalyptus fastigata from ANUCLIM Version 5. 1 (Houlder et al. 2002) showing the parameter extremes (top) and associated species accumulation curve (bottom) (from Houlder et al. 2000) June 2012

sp. Outlier - CRIA June 2012

sp. Outlier - CRIA June 2012

CRIA Data Cleaning http: //splink. cria. org. br/dc June 2012

CRIA Data Cleaning http: //splink. cria. org. br/dc June 2012

CRIA Data Cleaning June 2012

CRIA Data Cleaning June 2012

CRIA Data Cleaning June 2012

CRIA Data Cleaning June 2012

CRIA Data Cleaning June 2012

CRIA Data Cleaning June 2012

ALA Data Cleaning The Atlas of Living Australia is using Reverse Jack-knifing to identify

ALA Data Cleaning The Atlas of Living Australia is using Reverse Jack-knifing to identify suspect records June 2012

GBIF and Outlierness Values No longer operating June 2012

GBIF and Outlierness Values No longer operating June 2012

Errors in data In general, error must not be treated as a potentially embarrassing

Errors in data In general, error must not be treated as a potentially embarrassing inconvenience, because error provides a critical component in judging fitness for use. Chrisman, 1991 June 2012 Mizodendrum sp. , Argentina

Reference Chapman, A. D. (2005). Principles and Methods of Data Cleaning – Primary Species

Reference Chapman, A. D. (2005). Principles and Methods of Data Cleaning – Primary Species Occurrence Data. Report for the Global Biodiversity Information Facility 2005. 75 pp. Copenhagen: GBIF http: //www. gbif. org/orc/? doc_id=1262 June 2012