Scanned Documents INST 734 Module 10 Doug Oard

Agenda • Document image retrieval • Representation Ø Retrieval Thanks for David Doermann for

Character N-Grams • Approach – Split up document into overlapping n-character sequences • May

Shape Codes • Group all characters that have similar shapes – {A, B, C,

Retrieval Using Image Features • Advantages – Efficiency: process only what you need •

Evaluation • The simplest approach: Model-based evaluation – Apply character confusion statistics to an

Key Points • Converting images to text is a good start – But there

Slides: 7

Download presentation

Scanned Documents INST 734 Module 10 Doug Oard

Agenda • Document image retrieval • Representation Ø Retrieval Thanks for David Doermann for most of these slides

Character N-Grams • Approach – Split up document into overlapping n-character sequences • May or may not break on space between words – “DOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENT • Somewhat language-neutral – Use average word length to get a stemming-like effect • Statistically robust to small numbers of errors – Good choice between 70%-85% character accuracy • Above 85%, use words, dictionary-based correction, and stemming • Below 70%, use shape codes – Modeling character confusion can ad a bit more robustness

Shape Codes • Group all characters that have similar shapes – {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0} – {a, c, e, n, o, r, s, u, v, x, z} – {b, d, h, k, } – {f, t} – {g, p, q, y} – {i, j, l, 1} – {m, w} • Faster than character recognition – Seconds per page, and very accurate • Preserves recall, but with lower precision

Retrieval Using Image Features • Advantages – Efficiency: process only what you need • Particularly important for filtering tasks – Effectiveness: Limit effect of cascaded errors • Applications – Handwriting • Character segmentation is hard – Keyword spotting • Capitalization of names is easily seen

Evaluation • The simplest approach: Model-based evaluation – Apply character confusion statistics to an existing collection • A bit better: Print-then-scan evaluation – Scanning is slow, but availability is no problem • Best: Scan-only evaluation – Requires relevance judgments on document images

Key Points • Converting images to text is a good start – But there is more to think about • Image processing yields useful features • Structure and content are both important – Always true for any collection – Especially clear with document images