Document Image Analysis CSE 717 An Introduction Document
- Slides: 10
Document Image Analysis CSE 717 An Introduction
Document Image Analysis § DIA is theory and practice of recovering the symbol structures of digital images scanned from paper or produced by computer § DIA is a subfield of Digital Image processing § Digital images of natural objects: X-rays, fingerprints, faces, scenery, etc. are NOT part of DIA § Digital images of symbolic objects: Postal addresses, printed articles, forms, music sheets, engineering drawings, topographic maps belong to DIA § Source: Scanners, printers, fax machines, hand! § Incidental text: license plates, billboards, subtitles, in photos and video § WWW ? ? § DIA’s grand goal is take us to the land of paperless office
Document Image Analysis Textual Processing Optical Character Recognition Text Page Layout Analysis Skew, blocks, paragraphs Graphical Processing Line Processing Region and Symbol Processing Lines, curves, corners Filled regions
Document Image Analysis Processing Text Graphics Pixels Preprocessing Representation, Noise removal, binarization, skew, script id, font id Representation, Noise removal, binarization, thinning, vectorization Glyph Recognition Primitive Recognition Connected components, strokes, punctuations, words Straight lines, curve segments, junctions, nodes, loops, characters Text Recognition Structure Recognition Word segmentation, text line reconstruction, table analysis, linguistics Text fields, legends, labels, dimensions, graphics symbols Page Layout Analysis Interpretation Text versus non-text, physical component analysis, logical component analysis, functional component analysis, compression Component recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression Information Retrieval Database, CAD Document Classification, indexing, search, security, authentication, privacy Validation, search, update Primitives Structures Documents Corpus
Postal Examples Meter Mark Sender’s Address Endorsem ent Digital Post Mark In Case of Undeliverable as Addressed Return to Sender Linear Code Delivery Address
Forms
Unconstrained Text
Graphics Documents
References § § § Handbook of Character Recognition and Document Image Analysis, H. Bunke and PSP Wang (editors), World Scientific Press Document Image Analysis, Gorman and Kasturi , IEEE Computer Society Press International Conference on Document Analysis and Recognition proceedings International Workshop on Document Analysis Systems proceedings Symposium on Document Image Understanding Technology
• OCR Features and Systems –Script ID, Devanagari OCR, Tamil OCR, MP versus HW • Handwriting Recognition –Postal applications, Arabic Documents • Classifiers and Learning –Multi-classifier systems • Layout Analysis –Skew correction, geometric methods, test/graphics separation, logical labeling • Tables and Forms –Detecting tables in HTML documents, use of graph grammars, semantics • Document Engineering –Processing of historical documents (palm leaf manuscripts). • Camera Based DIA –Locating and reading Barcodes • New Applications -CAPTCHA