Data Discovery and Doer Happiness Uses for Optical

  • Slides: 20
Download presentation
Data Discovery and Doer Happiness: Uses for Optical Character Recognition (OCR) Output. Presenter: Deborah

Data Discovery and Doer Happiness: Uses for Optical Character Recognition (OCR) Output. Presenter: Deborah Paul Florida State University Integrated Digitized Biocollections (i. Dig. Bio) at Biodiversity Information Standards (TDWG) 2014 Conference Elmia Congress Centre, Rydberg Hall, Jönköping, Sweden Oct 27 th, 2014 Authors: Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, William Ulate, Reed Beaman i. Dig. Bio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Tren d Minimal Data Capture • • “filed as” name higher geography barcode image

Tren d Minimal Data Capture • • “filed as” name higher geography barcode image • all sheets in folder get the same initial data • only the barcode differs e m a n s a d file Biological collection data capture: a rapid approach using curatorial data 2

Raw OCR output, warts and all, can be used to: enter records faster use

Raw OCR output, warts and all, can be used to: enter records faster use the database entry ditto feature find duplicates quickly find the labels with lots of handwriting create your own record sets to transcribe by: – collector – country or county – your Great Aunt Penelope – taxon – language • create cogent sets to speed up validation and database updates • make transcribers / validators jobs easier and more fun! • • • 3

? t x e T t o G ? g n i t i

? t x e T t o G ? g n i t i r w d n a Got H 4

Label R OC No. . . 2 L 31. National Herbarium of Canada FLORA

Label R OC No. . . 2 L 31. National Herbarium of Canada FLORA OF’T TERRITORIES. Hab. and Loc. , Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt. , 69° 12’ N. , and 138° to 138° 30’ W. . . Collector, A. E. Porsild July 23 -25, 1934 Next imagine output from 1000 s of labels or notebooks or text files! 5

Seeing the dark data… Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013.

Seeing the dark data… Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013. 6

 • It’s surprising what can be used to help filter specimens – the

• It’s surprising what can be used to help filter specimens – the black art of search terms! 7

Some work from the i. Dig. Bio CITSCribe Hackathon Overall Word Cloud Workflow Images

Some work from the i. Dig. Bio CITSCribe Hackathon Overall Word Cloud Workflow Images OCR OCR Engine Crowd sourcing (BVP) OCR OCR Output Dw. C Parsed Output Web Service (Jason Davies) Index (Solr) OCR confidence (n-gram) Histogram (Google Charts, Facet Explorer) Word Cloud Cluster (carrot 2) Google Charts: http: //developers. google. com/chart/interactive/docs/gallery N-gram: http: //github. com/idigbio-citsci-hackathon/OCR-Error-Estimation Facet explorer: http: //github. com/idigbio-citsci-hackathon/facet-explorer Jason Davies WC: http: //www. jasondavies. com/wordcloud/ Apache Solr: http: //lucene. apache. org/solr/ carrot 2: http: //project. carrot 2. org/ 8

Word Clouds with… N-gram Scoring, Faceting, Solr + Carrot 2 9

Word Clouds with… N-gram Scoring, Faceting, Solr + Carrot 2 9

Imagine Integration with current software Use for initial sort or validation 10

Imagine Integration with current software Use for initial sort or validation 10

11

11

Managing your crowdsourcing data behind the scenes – OCR too! 12

Managing your crowdsourcing data behind the scenes – OCR too! 12

Work on Automated Parsing Algorithms a. OCR group finishing up a study comparing parsing

Work on Automated Parsing Algorithms a. OCR group finishing up a study comparing parsing algorithm strategies against a known standard to better define what’s possible at the moment for automated parsing of OCR output to standard Darwin Core terms. 13

http: //tinyurl. com/Lichen. Records 14

http: //tinyurl. com/Lichen. Records 14

Inside the 1899 Harriman Expedition 15

Inside the 1899 Harriman Expedition 15

Inside the 1899 Harriman Expedition 16

Inside the 1899 Harriman Expedition 16

Workflow Modules and Sample Digitization Workflows with OCR integrated • The i. Dig. Bio

Workflow Modules and Sample Digitization Workflows with OCR integrated • The i. Dig. Bio DROID and a. OCR groups produced a step-by-step series of tasks for implementing OCR in a digitization workflow. • Project specific workflows are available from RBGE, NYBG, SALIX 2, ASU Herbarium, Scio. TR, TTD -TCN, … • Yours? 17

OCR use, Voice Recognition, User Interface Optimization, Image Analysis, … Got Tex t? Han

OCR use, Voice Recognition, User Interface Optimization, Image Analysis, … Got Tex t? Han dwr • a. OCR WG and Synthesys 3 itin g? • user-interface interest group • exemplar ML and NLP workflows • combining OCR with Voice recognition software in Symbiota (Macroalgal TCN) • Automated image analysis • combining touch-screen technology into the digitization workflow (Scio. TR)($6. 99) 18

Tack så mycket re e h d e t n e s y Work

Tack så mycket re e h d e t n e s y Work pre n a m y b le b i s s o p e mad … y l l a i c e p and es § § § § § Tack så mycket! Andrea Matsunaga, Researcher, i. Dig. Bio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, i. Dig. Bio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) Stephen Gottschalk, New York Botanic Gardens (NYBG) i. Dig. Bio Augmenting Optical Character Recognition WG Ma. CC TCN SALIX 2 Smithsonian 19

Find out more at i. Dig. Bio facebook. com/i. Dig. Bio twitter. com/i. Dig.

Find out more at i. Dig. Bio facebook. com/i. Dig. Bio twitter. com/i. Dig. Bio www. idigbio. org vimeo. com/idigbio. org/rss-feed. xml webcal: //www. idigbio. org/events-calendar/export. ics i. Dig. Bio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.