Data Discovery and Doer Happiness Uses for Optical
- Slides: 20
Data Discovery and Doer Happiness: Uses for Optical Character Recognition (OCR) Output. Presenter: Deborah Paul Florida State University Integrated Digitized Biocollections (i. Dig. Bio) at Biodiversity Information Standards (TDWG) 2014 Conference Elmia Congress Centre, Rydberg Hall, Jönköping, Sweden Oct 27 th, 2014 Authors: Deborah Paul, Andrea Matsunaga, Miao Chen, Jason Best, Sylvia Orli, William Ulate, Reed Beaman i. Dig. Bio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Tren d Minimal Data Capture • • “filed as” name higher geography barcode image • all sheets in folder get the same initial data • only the barcode differs e m a n s a d file Biological collection data capture: a rapid approach using curatorial data 2
Raw OCR output, warts and all, can be used to: enter records faster use the database entry ditto feature find duplicates quickly find the labels with lots of handwriting create your own record sets to transcribe by: – collector – country or county – your Great Aunt Penelope – taxon – language • create cogent sets to speed up validation and database updates • make transcribers / validators jobs easier and more fun! • • • 3
? t x e T t o G ? g n i t i r w d n a Got H 4
Label R OC No. . . 2 L 31. National Herbarium of Canada FLORA OF’T TERRITORIES. Hab. and Loc. , Arctic Coast west of Mackenzie River delta: Between King Pt. and Kay Pt. , 69° 12’ N. , and 138° to 138° 30’ W. . . Collector, A. E. Porsild July 23 -25, 1934 Next imagine output from 1000 s of labels or notebooks or text files! 5
Seeing the dark data… Robyn E Drinkwater, Robert Cubey, Elspeth Haston at TDWG 2013. 6
• It’s surprising what can be used to help filter specimens – the black art of search terms! 7
Some work from the i. Dig. Bio CITSCribe Hackathon Overall Word Cloud Workflow Images OCR OCR Engine Crowd sourcing (BVP) OCR OCR Output Dw. C Parsed Output Web Service (Jason Davies) Index (Solr) OCR confidence (n-gram) Histogram (Google Charts, Facet Explorer) Word Cloud Cluster (carrot 2) Google Charts: http: //developers. google. com/chart/interactive/docs/gallery N-gram: http: //github. com/idigbio-citsci-hackathon/OCR-Error-Estimation Facet explorer: http: //github. com/idigbio-citsci-hackathon/facet-explorer Jason Davies WC: http: //www. jasondavies. com/wordcloud/ Apache Solr: http: //lucene. apache. org/solr/ carrot 2: http: //project. carrot 2. org/ 8
Word Clouds with… N-gram Scoring, Faceting, Solr + Carrot 2 9
Imagine Integration with current software Use for initial sort or validation 10
11
Managing your crowdsourcing data behind the scenes – OCR too! 12
Work on Automated Parsing Algorithms a. OCR group finishing up a study comparing parsing algorithm strategies against a known standard to better define what’s possible at the moment for automated parsing of OCR output to standard Darwin Core terms. 13
http: //tinyurl. com/Lichen. Records 14
Inside the 1899 Harriman Expedition 15
Inside the 1899 Harriman Expedition 16
Workflow Modules and Sample Digitization Workflows with OCR integrated • The i. Dig. Bio DROID and a. OCR groups produced a step-by-step series of tasks for implementing OCR in a digitization workflow. • Project specific workflows are available from RBGE, NYBG, SALIX 2, ASU Herbarium, Scio. TR, TTD -TCN, … • Yours? 17
OCR use, Voice Recognition, User Interface Optimization, Image Analysis, … Got Tex t? Han dwr • a. OCR WG and Synthesys 3 itin g? • user-interface interest group • exemplar ML and NLP workflows • combining OCR with Voice recognition software in Symbiota (Macroalgal TCN) • Automated image analysis • combining touch-screen technology into the digitization workflow (Scio. TR)($6. 99) 18
Tack så mycket re e h d e t n e s y Work pre n a m y b le b i s s o p e mad … y l l a i c e p and es § § § § § Tack så mycket! Andrea Matsunaga, Researcher, i. Dig. Bio Miao Chen, Indiana University, Data to Insight Center Jason Best, Botanical Research Institute of Texas Sylvia Orli, IT Head, Smithsonian Botany Department William Ulate, Technical Director, BHL Reed Beaman, Informatics Specialist, i. Dig. Bio Elspeth Haston, et al Royal Botanic Garden Edinburgh (RBGE) Stephen Gottschalk, New York Botanic Gardens (NYBG) i. Dig. Bio Augmenting Optical Character Recognition WG Ma. CC TCN SALIX 2 Smithsonian 19
Find out more at i. Dig. Bio facebook. com/i. Dig. Bio twitter. com/i. Dig. Bio www. idigbio. org vimeo. com/idigbio. org/rss-feed. xml webcal: //www. idigbio. org/events-calendar/export. ics i. Dig. Bio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- I am the seer, doer and enjoyer
- Seer doer and enjoyer
- Behavior styles talker doer supporter controller
- Leader doer thinker carer
- Leader doer thinker carer
- Uses of optical fibres in medicine
- Care of compound microscope
- Introduction to data mining and knowledge discovery
- What is smart data discovery
- Data analytics lifecycle
- Cardholder data discovery
- Knowledge data discovery
- Data discovery
- Data discovery
- Kontinuitetshantering i praktiken
- Typiska drag för en novell
- Tack för att ni lyssnade bild
- Ekologiskt fotavtryck
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Underlag för särskild löneskatt på pensionskostnader