Document Image Analysis Lecture 20 Intro to Layout

Page layout analysis • Structural (Physical, Geometric) Layout Analysis • Functional (Syntactic, Logical) Layout

Structural • Isolation of columns, paragraphs, lines words, tables, figures. Maybe letters. • Without

Functional • Typically domain dependent • May require merging or splitting of syntactic components

Functional Components • First page of a technical article may have • • •

Finding structural blocks UC Berkeley CS 294 -9 Fall 2000 6

Common Approaches • Top Down analysis – Horizontal and vertical profiles – Recursive: columns,

Standard images…the Scanned Input UC Berkeley CS 294 -9 Fall 2000 8

Smear character boxes UC Berkeley CS 294 -9 Fall 2000 9

Smear words to get lines UC Berkeley CS 294 -9 Fall 2000 10

Smear lines to get paragraphs UC Berkeley CS 294 -9 Fall 2000 11

Issues: • Sensitivity to noise. Solutions: – Clean up via kfill or similar filtering,

Interactive semi-automatic zoning (RJF) UC Berkeley CS 294 -9 Fall 2000 13

Zoom in UC Berkeley CS 294 -9 Fall 2000 14

Scroll around UC Berkeley CS 294 -9 Fall 2000 15

View individual pixels UC Berkeley CS 294 -9 Fall 2000 16

Semi… Turn up the noise filter until we start to kill some of the

auto… Turn the horizontal smear knob until the number of components drops suddenly from

matic. . Tweek the vertical smear knob. Lines become paragraphs. (Turn further, and paragraphs

Specify read order UC Berkeley CS 294 -9 Fall 2000 20

Interactive functional tagging: mark subject/author/etc? Here we attempt automatic id of math… Automatic math

Docstrum/ L. O’Gorman UC Berkeley CS 294 -9 Fall 2000 5 nearest neighbors (ogorman

Example of “spectrum” UC Berkeley CS 294 -9 Fall 2000 Each point represents distance

Statistics for skew and spacing UC Berkeley CS 294 -9 Fall 2000 24

Extract Lines, group to paragraphs • Statistically close enough horizontally to be words, then

Sections with different skew UC Berkeley CS 294 -9 Fall 2000 6 business cards,

Extracted text lines, blocks UC Berkeley CS 294 -9 Fall 2000 Useful? General? 27

Without layout analysis • • Reading across columns Misplacing captions Misplacing footnotes Misunderstanding page

A Diversion: Separating Math from Text • • Why separate math from text? Types

Why separate math/text/images/. . • OCR programs do not work for math becomes, in

Mathematics on a Page Inline is harder to pick out because it may look

Previous Work • Isolation by hand (most math parser papers) • Texture/ statistics based

Proposal: Post-Processing of OCR • Start with commercial best-effort recognition • Reprocess the intermediate

Separate uncertain areas • Re-consider “the rest of the image” as potential mathematics zones:

Alternatively, Starting from our own naïve OCR • • Connected component recognition Separate characters

Two bags: Math vs Text • Initially Math – + - = / Greek,

Sample Text Bag UC Berkeley CS 294 -9 Fall 2000 37

Sample Math Bag UC Berkeley CS 294 -9 Fall 2000 38

Second Pass • Correct for too much Math • Grow “clumps” (expand BBs) to

Importance of Context Here are 12 L’s and a 1 UC Berkeley CS 294

Third Pass • Too much is in the text bag now – blur the

On Ambiguity and Correctness • Can we find the math in ad - bc

Conclusions • We can make a first cut on separating math from text •

Slides: 43

Download presentation

Document Image Analysis Lecture 20: Intro to Layout Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center UC Berkeley CS 294 -9 Fall 2000 1

Page layout analysis • Structural (Physical, Geometric) Layout Analysis • Functional (Syntactic, Logical) Layout Analysis • Read-order determination UC Berkeley CS 294 -9 Fall 2000 2

Structural • Isolation of columns, paragraphs, lines words, tables, figures. Maybe letters. • Without some layout analysis, much of the previous work would be impossible! • Without layout analysis, what is the sequence of words in a multi-column format? UC Berkeley CS 294 -9 Fall 2000 3

Functional • Typically domain dependent • May require merging or splitting of syntactic components • Encoding into ODA (object oriented document architecture) or SGML (DTD describes components like section, title. . ) UC Berkeley CS 294 -9 Fall 2000 4

Functional Components • First page of a technical article may have • • • Title Author Abstract, body/column 1 body/column 2 footnotes Pagination Journal name/volume/date… • Business letter might have • • • Sender Date Logo Recipient Body Signature UC Berkeley CS 294 -9 Fall 2000 5

Finding structural blocks UC Berkeley CS 294 -9 Fall 2000 6

Common Approaches • Top Down analysis – Horizontal and vertical profiles – Recursive: columns, paragraphs/lines/words – As illustrated earlier • Bottom Up analysis – Use adjacency based on • Pixels / morphology of dilation (millions) • RLE/ merge lines (thousands) • Connected Components (hundreds) • Look at the background (shape-directed covers) • Also, human hints. UC Berkeley CS 294 -9 Fall 2000 7

Standard images…the Scanned Input UC Berkeley CS 294 -9 Fall 2000 8

Smear character boxes UC Berkeley CS 294 -9 Fall 2000 9

Smear words to get lines UC Berkeley CS 294 -9 Fall 2000 10

Smear lines to get paragraphs UC Berkeley CS 294 -9 Fall 2000 11

Issues: • Sensitivity to noise. Solutions: – Clean up via kfill or similar filtering, ruthlessly – Divide the page (artificially) and keep the noise from affecting the document globally • Slanted lines. Solution(s): – Deskew (since it is not too hard(? )) – Use nearest neighbors “docstrum” • Concave regions (text flow around a box). Solution(? ) look at background • Variation in font, spacing can throw off analysis – Allow for local analysis UC Berkeley CS 294 -9 Fall 2000 12

Interactive semi-automatic zoning (RJF) UC Berkeley CS 294 -9 Fall 2000 13

Zoom in UC Berkeley CS 294 -9 Fall 2000 14

Scroll around UC Berkeley CS 294 -9 Fall 2000 15

View individual pixels UC Berkeley CS 294 -9 Fall 2000 16

Semi… Turn up the noise filter until we start to kill some of the punctuation. How? As we turn up the threshold, the number of connected components drops, then reaches a stable plateau after the noise is gone, and then drops again as we remove punctuation, the dots above the “i” etc. UC Berkeley CS 294 -9 Fall 2000 17

auto… Turn the horizontal smear knob until the number of components drops suddenly from about 3000 to about 600. Character boxes have been merged into wordboxes Turn the horizontal smear knob until the number of components drop from about 600 to about 100. Wordboxes have become lineboxes. UC Berkeley CS 294 -9 Fall 2000 18

matic. . Tweek the vertical smear knob. Lines become paragraphs. (Turn further, and paragraphs become columns). UC Berkeley CS 294 -9 Fall 2000 19

Specify read order UC Berkeley CS 294 -9 Fall 2000 20

Interactive functional tagging: mark subject/author/etc? Here we attempt automatic id of math… Automatic math zone. This is a challenge because the zone is in two parts, containing the math … f(p)=F(p) UC Berkeley CS 294 -9 Fall 2000 21

Docstrum/ L. O’Gorman UC Berkeley CS 294 -9 Fall 2000 5 nearest neighbors (ogorman 93) 22

Example of “spectrum” UC Berkeley CS 294 -9 Fall 2000 Each point represents distance and angle of a cc. N^2, but not so bad. 23

Statistics for skew and spacing UC Berkeley CS 294 -9 Fall 2000 24

Extract Lines, group to paragraphs • Statistically close enough horizontally to be words, then lines • Statistically close enough and parallel enough and the same length as… group two lines into the same text block. • (arguably saving time by not deskewing; dealing with non-constant skew) Example follows. . UC Berkeley CS 294 -9 Fall 2000 25

Sections with different skew UC Berkeley CS 294 -9 Fall 2000 6 business cards, nearest neighbors 26 vectors

Extracted text lines, blocks UC Berkeley CS 294 -9 Fall 2000 Useful? General? 27

Without layout analysis • • Reading across columns Misplacing captions Misplacing footnotes Misunderstanding page numbers (which should be REMOVED in the reformatting process) • Need extraction of biblio data: title, author, abstract, keywords • Nearly every subsequent step is compromised by lack of context. UC Berkeley CS 294 -9 Fall 2000 28

A Diversion: Separating Math from Text • • Why separate math from text? Types of mathematics encountered Previous Work Two approaches – post-processing commercial OCR – character-based (details!) • Errors and their correction • Ambiguities UC Berkeley CS 294 -9 Fall 2000 29

Why separate math/text/images/. . • OCR programs do not work for math becomes, in Textbridge, Designation as a “picture” is only a partial solution UC Berkeley CS 294 -9 Fall 2000 30

Mathematics on a Page Inline is harder to pick out because it may look like italics text UC Berkeley CS 294 -9 Fall 2000 31

Previous Work • Isolation by hand (most math parser papers) • Texture/ statistics based heuristics – useful for display math “paragraphs” – not useful for in-line math • Character based pseudo-parsing (but without font information or true parsing feedback) • Incomplete UC Berkeley CS 294 -9 Fall 2000 32

Proposal: Post-Processing of OCR • Start with commercial best-effort recognition • Reprocess the intermediate data structure (e. g. for Text. Bridge, the XDOC file) • Accept recognition of text zones with high recognition certainty. (Lines with no errors surrounded by lines with no errors are considered solved) UC Berkeley CS 294 -9 Fall 2000 33

Separate uncertain areas • Re-consider “the rest of the image” as potential mathematics zones: uncertain regions (including nearby “certain” characters/lines) • Isolate characters, identify fonts, etc. • Play out heuristic rules for separating text and math zones. • Consider eradicating math and re-submitting text; separately recognizing math and reinserting in XDOC UC Berkeley CS 294 -9 Fall 2000 34

Alternatively, Starting from our own naïve OCR • • Connected component recognition Separate characters by initial classification Repeatedly re-examine via rules Determine text zones, remove math / feed remainder to commercial OCR – How best to blank-out math? XXX • Most likely human interaction remains UC Berkeley CS 294 -9 Fall 2000 35

Two bags: Math vs Text • Initially Math – + - = / Greek, scientific symbols, 0 -9, italics, bold, (), [], sin, cos, tan, dots, commas, decimal points • Initially Text – Roman Letters, junk UC Berkeley CS 294 -9 Fall 2000 36

Sample Text Bag UC Berkeley CS 294 -9 Fall 2000 37

Sample Math Bag UC Berkeley CS 294 -9 Fall 2000 38

Second Pass • Correct for too much Math • Grow “clumps” (expand BBs) to categorize – – – 3. 14159 vs “end of sentence. ” (comment) vs f(x) hyphen-words vs x 2 - y 2 horizontal lines generally isolated 1 or is it l “ell” or I “eye” • “bags” or “zones” of geometric-relation boxes containing either words or potential math UC Berkeley CS 294 -9 Fall 2000 39

Importance of Context Here are 12 L’s and a 1 UC Berkeley CS 294 -9 Fall 2000 40

Third Pass • Too much is in the text bag now – blur the math to allow for embedded Roman text like “sin” or “l” • Re-clump the mathematics to see if new bridges have been formed • Some italics in the math bag may be really – English words in theorems – emphasized text UC Berkeley CS 294 -9 Fall 2000 41

On Ambiguity and Correctness • Can we find the math in ad - bc by ad hoc methods? • If we are unable to disambiguate English words, why should we be able to disambiguate mathematics? • Abuse of mathematical notation is widespread: can we insist that new papers either have a non-ambiguous notation or an underlying electronic non-ambiguous notation? UC Berkeley CS 294 -9 Fall 2000 42

Conclusions • We can make a first cut on separating math from text • If we wish to “enliven” math publication with semantic underpinnings, this may help in their production • Incorporation of AI rule-based transformations as well as hand correction are likely to be important UC Berkeley CS 294 -9 Fall 2000 43