Document Image Analysis CSE 717 An Introduction Document

  • Slides: 10
Download presentation
Document Image Analysis CSE 717 An Introduction

Document Image Analysis CSE 717 An Introduction

Document Image Analysis § DIA is theory and practice of recovering the symbol structures

Document Image Analysis § DIA is theory and practice of recovering the symbol structures of digital images scanned from paper or produced by computer § DIA is a subfield of Digital Image processing § Digital images of natural objects: X-rays, fingerprints, faces, scenery, etc. are NOT part of DIA § Digital images of symbolic objects: Postal addresses, printed articles, forms, music sheets, engineering drawings, topographic maps belong to DIA § Source: Scanners, printers, fax machines, hand! § Incidental text: license plates, billboards, subtitles, in photos and video § WWW ? ? § DIA’s grand goal is take us to the land of paperless office

Document Image Analysis Textual Processing Optical Character Recognition Text Page Layout Analysis Skew, blocks,

Document Image Analysis Textual Processing Optical Character Recognition Text Page Layout Analysis Skew, blocks, paragraphs Graphical Processing Line Processing Region and Symbol Processing Lines, curves, corners Filled regions

Document Image Analysis Processing Text Graphics Pixels Preprocessing Representation, Noise removal, binarization, skew, script

Document Image Analysis Processing Text Graphics Pixels Preprocessing Representation, Noise removal, binarization, skew, script id, font id Representation, Noise removal, binarization, thinning, vectorization Glyph Recognition Primitive Recognition Connected components, strokes, punctuations, words Straight lines, curve segments, junctions, nodes, loops, characters Text Recognition Structure Recognition Word segmentation, text line reconstruction, table analysis, linguistics Text fields, legends, labels, dimensions, graphics symbols Page Layout Analysis Interpretation Text versus non-text, physical component analysis, logical component analysis, functional component analysis, compression Component recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression Information Retrieval Database, CAD Document Classification, indexing, search, security, authentication, privacy Validation, search, update Primitives Structures Documents Corpus

Postal Examples Meter Mark Sender’s Address Endorsem ent Digital Post Mark In Case of

Postal Examples Meter Mark Sender’s Address Endorsem ent Digital Post Mark In Case of Undeliverable as Addressed Return to Sender Linear Code Delivery Address

Forms

Forms

Unconstrained Text

Unconstrained Text

Graphics Documents

Graphics Documents

References § § § Handbook of Character Recognition and Document Image Analysis, H. Bunke

References § § § Handbook of Character Recognition and Document Image Analysis, H. Bunke and PSP Wang (editors), World Scientific Press Document Image Analysis, Gorman and Kasturi , IEEE Computer Society Press International Conference on Document Analysis and Recognition proceedings International Workshop on Document Analysis Systems proceedings Symposium on Document Image Understanding Technology

 • OCR Features and Systems –Script ID, Devanagari OCR, Tamil OCR, MP versus

• OCR Features and Systems –Script ID, Devanagari OCR, Tamil OCR, MP versus HW • Handwriting Recognition –Postal applications, Arabic Documents • Classifiers and Learning –Multi-classifier systems • Layout Analysis –Skew correction, geometric methods, test/graphics separation, logical labeling • Tables and Forms –Detecting tables in HTML documents, use of graph grammars, semantics • Document Engineering –Processing of historical documents (palm leaf manuscripts). • Camera Based DIA –Locating and reading Barcodes • New Applications -CAPTCHA