Document Images and EDiscovery David Doermann UMIACS May

  • Slides: 53
Download presentation
Document Images and E-Discovery David Doermann, UMIACS May 4, 2009

Document Images and E-Discovery David Doermann, UMIACS May 4, 2009

Goals of This Lecture • To help you understand: – – – Why you

Goals of This Lecture • To help you understand: – – – Why you may want to acquire document images? How you acquire them? What you get when you do? What you can do with them? Why it is not as easy as you may think to organize them…? • Discuss some of the issues for those doing EDiscovery.

Assumptions • We have some need to access a set of “documents” – Archiving?

Assumptions • We have some need to access a set of “documents” – Archiving? – Litigation? – FOIA Request • Often have (massive) Heterogeneous Collections – Different languages – Different layouts – Different sources • We are lucky if metadata is consistent and Uniform.

Why acquire Document Images? • Paperless Solution – Efficient transfer – Organization – Convenience

Why acquire Document Images? • Paperless Solution – Efficient transfer – Organization – Convenience • Access to a variety of content – Universal reader – email, attachments, spread sheets – Don’t need original applications • Prevent Change? – Easier to certify?

How do we acquire? • Scanning? – High speed, automated, multiform – books, etc

How do we acquire? • Scanning? – High speed, automated, multiform – books, etc • Digital Copiers – Corporate Memory • Application Output – Print to Image – Mass Conversion • Cameras? • Cell Phones? – Qip. It, Scan. R, Hotcard All have implications for use

Where do we find them? • • • Internet Email Attachments Online Proceedings Electronic

Where do we find them? • • • Internet Email Attachments Online Proceedings Electronic Fax Mass Digitization Repositories…

What you can do with them? • Can we Access it? – Search –

What you can do with them? • Can we Access it? – Search – Browse – “Read” • Index and Retrieve them? In their basic form not really! • We can – View – Print – Not much else Why?

T DO SE BA CU TA ME N DA What is an image? IMAGE

T DO SE BA CU TA ME N DA What is an image? IMAGE • Pixel representation of intensity map • No explicit “content”, only relations • Image analysis – Attempts to mimic human visual behavior – Draw conclusions, hypothesize and verify Image databases Use primitive image analysis to represent content Transform semantic queries into “image features” color, shape, texture … spatial relations 10 27 33 29 27 34 33 54 54 47 89 60 25 35 43 9

DO CU ME NT SE BA TA DA Document Images IMAGE • A collection

DO CU ME NT SE BA TA DA Document Images IMAGE • A collection of dots called “pixels” – Arranged in a grid and called a “bitmap” • Pixels often binary-valued (black, white) – But grayscale or color is sometimes needed • 300 dots per inch (dpi) gives the best results – But images are quite large (1 MB per page) – Faxes are normally 100 -200 dpi • Usually stored in TIFF or PDF format Yet we want to be able to process them like text files!

T SE DO BA CU TA ME N DA IMAGE Document Image Database •

T SE DO BA CU TA ME N DA IMAGE Document Image Database • Collection of scanned images • Need to be available for indexing and retrieval, abstracting, routing, editing, dissemination, interpretation …

Other “Documents”

Other “Documents”

Indexing Page Images (Traditional Conversion) Document Scanner Page Image Page Decomposition Structure Representation Text

Indexing Page Images (Traditional Conversion) Document Scanner Page Image Page Decomposition Structure Representation Text Regions Optical Character Recognition Character or Shape Codes

Document Image Analysis • General Flow: – – Obtain Image - Digitize Preprocessing Feature

Document Image Analysis • General Flow: – – Obtain Image - Digitize Preprocessing Feature Extraction Classification • General Tasks – – Logical and Physical Page Structure Analysis Zone Classification Language ID Zone Specific Processing • Recognition • Vectorization

Document Analysis • What you need to do before you can treat images as

Document Analysis • What you need to do before you can treat images as “edocuments”…. – Document Image Analysis • Page decomposition • Optical character recognition – Traditional Indexing with Conversion • Confusion matrix • Shape codes – Doing things Without Conversion • Duplicate Detection, Classification, Summarization, Abstracting • Keyword spotting, etc

Why is document analysis difficult? • 2 D Array of “values” • Represents a

Why is document analysis difficult? • 2 D Array of “values” • Represents a Symbolic Language • Many Variations in symbols Aa. AAAA • 3 -4 times larger then normal digital images • And this is just machine printed Latin Text!

Page Analysis (assume are looking for text) • Skew correction – Based on finding

Page Analysis (assume are looking for text) • Skew correction – Based on finding the primary orientation of lines • Image and text region detection – Based on texture and dominant orientation • Structural classification – Infer logical structure from physical layout • Text region classification – Title, author, letterhead, signature block, etc.

Page Layer Segmentation • Document image generation model – A document consists many layers,

Page Layer Segmentation • Document image generation model – A document consists many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc.

Page Segmentation • Typically based on Spatial Proximity – White space – Margins –

Page Segmentation • Typically based on Spatial Proximity – White space – Margins – Differences in Content Type • Can be very sensitive to noise • Distinguish between – Top Down – What know what should be there – Bottom up – We know what is there locally

Image Detection

Image Detection

Text Region Detection

Text Region Detection

More Complex Example Printed text Handwriting Noise Before MRF-based postprocessing After MRF-based postprocessing

More Complex Example Printed text Handwriting Noise Before MRF-based postprocessing After MRF-based postprocessing

Application to Page Segmentation Before enhancement After enhancement

Application to Page Segmentation Before enhancement After enhancement

Language Identification • Language-independent skew detection – Accommodate horizontal and vertical writing • Script

Language Identification • Language-independent skew detection – Accommodate horizontal and vertical writing • Script class recognition – Asian script have blocky characters – Connected scripts can’t be segmented easily • Language identification – Shape statistics work well for western languages – Competing classifiers work for Asian languages What about handwriting?

Optical Character Recognition • Pattern-matching approach – Standard approach in commercial systems – Segment

Optical Character Recognition • Pattern-matching approach – Standard approach in commercial systems – Segment individual characters – Recognize using a neural network classifier • Hidden Markov model approach – – Experimental approach developed at BBN Segment into sub-character slices Limited lookahead to find best character choice Useful for connected scripts (e. g. , Arabic)

OCR Accuracy Problems • Character segmentation errors – In English, segmentation often changes “m”

OCR Accuracy Problems • Character segmentation errors – In English, segmentation often changes “m” to “rn” • Character confusion – Characters with similar shapes often confounded • OCR on copies is much worse than on originals – Pixel bloom, character splitting, binding bend • Uncommon fonts can cause problems – If not used to train a neural network

Improving OCR Accuracy • Image preprocessing – Mathematical morphology for bloom and splitting –

Improving OCR Accuracy • Image preprocessing – Mathematical morphology for bloom and splitting – Particularly important for degraded images • “Voting” between several OCR engines helps – Individual systems depend on specific training data • Linguistic analysis can correct some errors – Use confusion statistics, word lists, syntax, … – But more harmful errors might be introduced

OCR Speed • Neural networks take about 10 seconds a page – Hidden Markov

OCR Speed • Neural networks take about 10 seconds a page – Hidden Markov models are slower • Voting can improve accuracy – But at a substantial speed penalty • Easy to speed things up with several machines – For example, by batch processing - using desktop computers at night

Problem: Logical Page Analysis (Reading Order) • Can be hard to guess in some

Problem: Logical Page Analysis (Reading Order) • Can be hard to guess in some cases – Newspaper columns, figure captions, appendices, … • Sometimes there are explicit guides – “Continued on page 4” (but page 4 may be big!) • Structural cues can help – Column 1 might continue to column 2 • Content analysis is also useful – Word co-occurrence statistics, syntax analysis

Retrieval of OCR’d Text • Requires robust ways of indexing • Statistical methods with

Retrieval of OCR’d Text • Requires robust ways of indexing • Statistical methods with large documents work best • Key Evaluations – Success for high quality OCR (Croft et al 1994, Taghva 1994) – Limited success for poor quality OCR (1996 TREC, UNLV) – Clustering successful for > 85% accuracy (Tsuda et al, 1995)

N-Grams • Powerful, Inexpensive statistical method for characterizing populations • Approach – Split up

N-Grams • Powerful, Inexpensive statistical method for characterizing populations • Approach – Split up document into n-character pairs fails – Use traditional indexing representations to perform analysis – “DOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENT • Advantages – Statistically robust to small numbers of errors – Rapid indexing and retrieval – Works from 70%-85% character accuracy where traditional IR fails

Matching with OCR Errors • Above 80% character accuracy, use words – With linguistic

Matching with OCR Errors • Above 80% character accuracy, use words – With linguistic correction • Between 75% and 80%, use n-grams – With n somewhat shorter than usual – And perhaps with character confusion statistics • Below 75%, use word-length shape codes

Processing Images of Text • Characteristics – Does not require expensive OCR/Conversion – Applicable

Processing Images of Text • Characteristics – Does not require expensive OCR/Conversion – Applicable to filtering applications – May be more robust to noise • Possible Disadvantages – Application domain may be very limited – Processing time may be an issue if indexing is otherwise required

Keyword Spotting Techniques: – Work Shape/HMM - (Chen et al, 1995) – Word Image

Keyword Spotting Techniques: – Work Shape/HMM - (Chen et al, 1995) – Word Image Matching - (Trenkle and Vogt, 1993; Hull et al) – Character Stroke Features - (Decurtins and Chen, 1995) 4 Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996) Applications: – Filing System (Spitz - SPAM, 1996) – Numerous IR – Processing handwritten documents Formal Evaluation : – Scribble vs. OCR (De. Curtins, SDIUT 1997)

Shape Coding • Approach – Use of Generic Character Descriptors – Make Use of

Shape Coding • Approach – Use of Generic Character Descriptors – Make Use of Power of Language to resolve ambiguity – Map Character based on Shape features including ascenders, descenders, punctuation and character with holes

Additional Applications • Handwritten Archival Manuscripts – (Manmatha, 1997) • Page Classification – (Decurtins

Additional Applications • Handwritten Archival Manuscripts – (Manmatha, 1997) • Page Classification – (Decurtins and Chen, 1995) • Matching Handwritten Records – (Ganzberger et al, 1994) • Headline Extraction • Document Image Compression (UMD, 1996 -1998)

Some UMD Research • Multilingual OCR • Evaluation • Duplicate detection • ….

Some UMD Research • Multilingual OCR • Evaluation • Duplicate detection • ….

Detection • Stamps, Logos, Signatures • These content regions benefit from detection based approach

Detection • Stamps, Logos, Signatures • These content regions benefit from detection based approach • Saliency Measures adapt to interclass variation • Logos – location, density, symmetry, size • Signature – flow, oscillations • Standard classifiers – SVM, Fisher, Decision Trees.

Shape matching (a) (b) Illustration of signature matching using shape contexts and local-neighborhood-graph 12/14/2021

Shape matching (a) (b) Illustration of signature matching using shape contexts and local-neighborhood-graph 12/14/2021 42 14 December 2021

Image content categorization 43 Distinguishing between text and non-text documents ¤ Pureconstructed images Images

Image content categorization 43 Distinguishing between text and non-text documents ¤ Pureconstructed images Images withdatabase text Document images We a 4, 500 image by crawling Web images from Google Image search engine using a wide variety of text keywords

Page Segmentation

Page Segmentation

Clutter Detection and Removal Clutter as one single connected component

Clutter Detection and Removal Clutter as one single connected component

Is Indexing Enough? • Many applications benefit from image based indexing – – Less

Is Indexing Enough? • Many applications benefit from image based indexing – – Less discriminatory features Features may therefore be easier to compute More robust to noise Often computationally more efficient • Many classical IR techniques have application for DIR • Structure as well as content are important for indexing • Preservation of structure is essential for in-depth understanding

 • What else is useful? – Document Metadata? – Logos? Signatures? • Where

• What else is useful? – Document Metadata? – Logos? Signatures? • Where is research heading? – Cameras to capture Documents? • What massive collections are out there? – Tobacco Litigation Documents • 49 million page images – Google Books – Other Digital Libraries

What next for E-Discovery? (questions to ask) • Now you want to use the

What next for E-Discovery? (questions to ask) • Now you want to use the images… • What Meta data is required? • How was the collection created?

Observations • Structure is a great indicator of content – Locating Letters, Financial Forms,

Observations • Structure is a great indicator of content – Locating Letters, Financial Forms, etc • Be careful of what you assume about the collection – Handwriting present? – Noise, Scanning resolution? – Is there implicit information in the layout?

Litigation Specific Issues • Volume Scanning Separated from Metadata • Document Determination • Multiple

Litigation Specific Issues • Volume Scanning Separated from Metadata • Document Determination • Multiple Edits/Prints of the same document (Duplicate Determination) – Much harder for images – May have unique BATES numbers • Cost of scanning or adding manual metadata? • When is an image sufficient? – Probably not for handwriting analysis

Summary • Nothing Magic about getting access to images • Metadata is typically required,

Summary • Nothing Magic about getting access to images • Metadata is typically required, and automation can be made more difficult by quality… • No substitute for “eyes on” the image…and most systems are set up with this in mind

E-Discovery Issues?

E-Discovery Issues?