Document image analysis for metadata extraction U S

National Library of Medicine World’s largest medical library U. S. govt. agency, part of

NLM-Mission Develop and provide biomedical information to: The clinical and research communities (e. g.

R&D: why and how n Aim: to introduce appropriate technologies – To support NLM’s

Preservation of Digital Materials n Technical obsolescence of storage media and supporting hardware and

Candidates for Digital Preservation (NLM collections) n Profiles in Science – Archival collections of

Goal: System for Preservation of Electronic Resources (SPER) n Automated metadata extraction – Technical

Our Problem • Extracting descriptive metadata (e. g. , article title, authors, affiliation, page

In other words… n Grubb RL. Hemodynamic factors in the prognosis of symptomatic carotid

Automated Metadata Extraction Methods n n TIFF OCR segment label physical zones (using DIAU

Why learned rules or models? Diverse layout styles n Style differs in different journals

Significant Features (examples) n Geometric – – – n Absolute location and size of

MARS DCMS OCR Check. In Scanner MEDLINE Indexing Autozone Upload Autolabel Autoformat MARS Database

Image processed by Autozone Original bitmap Zoned

Features for Autozone For each text-line n n n n n Median character height

Image processed by Autozone and Autolabel Original bitmap Zoned Labeled

Autoreformat Original bitmap Text syntax reformatted Zoned Labeled e. g. , John A. Smith

Lexical analysis to overcome OCR errors Original bitmap Text syntax reformatted Zoned Labeled Lexical

Edit workstation: colors identify fields labeled automatically

Edit workstation High confidence characters(%)

Reconcile workstation in MARS – main screen

Pattern matching to correct words for Reconcile operator

Reconcile workstation GUI - closeup Bitmapped image Incorrect word Operator click selects correct word

Automated Metadata Extraction based on Learning Methods n Automatically learn layout rules or models

Two Machine Learning Methods n Learn labeling rules from dynamically generated features based on

Learning Labeling Rules: Dynamic Feature Generation System (DFGS) MARS (simplified) Scanned journals OCR Zoning

Learn Document 2 D Layout Models based on a Bayesian Approach n Represent 2

Attributed Hidden Semi-Markov Models Hidden Semi-Markov Models(HSMMs) (attributed HSMMs ) π , ρ) (attributed

Map attributed HSMMs To 2 D Document Layout n States: document regions such as

States n Key states – Title, author, affiliation, abstract n Marginal states – Header,

State Observations and Durations n State observations (contextual n State durations (geometric features) –

2 D Layout Model: a set of attributed hidden semi-Markov models lm P tm

Bayesian Learning Method P(M 0|X) < P(M 1|X) < P(M 2|X) < P(M|X) n

Model Merging Constraints n Do not allow loop since order of zones are important

The Recursive Learning Algorithm m 2: 2. Learn 1 D models m at the

Priors n Break a model into three components n Dirichlet distribution for multinomial prior

Likelihood Approximating the likelihood in Bayesian Learning by the Viterbi path

Global Weights n Adjust the contributions of prior and likelihood to the posterior probability

Comparison of four Labeling Methods • • 198 title textlines 181 author textlines 600

Future Work n For TIFF images, extend feature set to font size, font attributes,

George R. Thoma, Ph. D. Chief, Communications Engineering Branch Lister Hill National Center for

Publications 1. Bayesian Learning of 2 D Document Layout models for Automated Preservation Metadata

References 6. Best-first model merging for hidden Markov model induction, A. Stolcke and S.

Examples HSMM-based DFGS-based HMM-based Heuristic-rule-based

Bayesian Learning Goal is to find Need to know the explicit form of prior

Layout Model Learning Results (Training set: 19 journal title pages) Model component Number of

Model Merging Algorithm n Best-first merging with look ahead [Stolcke and Omohundro, 1994] n

The Recognition Algorithm n A duration Viterbi algorithm [Rabiner et al, 1985] n Recursively

Slides: 52

Download presentation

Document image analysis for metadata extraction U. S. National Library of Medicine George R. Thoma, Ph. D. Chief, CEB Lister Hill Center

National Library of Medicine World’s largest medical library U. S. govt. agency, part of NIH Collects all significant material in biomedicine and health care Database producer (MEDLINE, Gen. Bank, . . ) Research centers U. S. National Library of Medicine Extramural grants

NLM-Mission Develop and provide biomedical information to: The clinical and research communities (e. g. , MEDLINE) Public health and public safety agencies (e. g. , HSDB) The lay public (e. g. , MEDLINEplus) Develop and provide tools for biomedical research (e. g. , Web. MIRS, x-ray atlas, genomic data analysis) Develop and provide tools for informatics research (e. g. , UMLS, vocabulary tools, knowledge representation, medical ontologies) Conduct inhouse R&D Sponsor extramural research (Telemedicine, Visible Human Project, Next Generation Internet, Medical Informatics…. ) Two important missions: Provide fellowships for faculty, students 1. Create citations to the biomedical journal literature for MEDLINE® 2. Preservation

R&D: why and how n Aim: to introduce appropriate technologies – To support NLM’s services and functions – To create and disseminate information for biomedical communities: research, clinical and informatics – To provide information for the lay public n How: – – – Identifying suitable domains Designing/developing prototype systems Using these as testbeds to address key questions – Implementing/deploying operational systems

Preservation of Digital Materials n Technical obsolescence of storage media and supporting hardware and software n Ever-increasing volume of endangered digital materials n Critical component: metadata for future access and migration to newer formats n Avoid labor cost of manual metadata entry

Candidates for Digital Preservation (NLM collections) n Profiles in Science – Archival collections of leaders in biomedical research and public health – TIFF, PDF, HTML, audio, video files n Pub. Med Central – Digital archive of life sciences journals – XML, PDF, TIFF – Contains about 170 journal titles

Goal: System for Preservation of Electronic Resources (SPER) n Automated metadata extraction – Technical metadata from file header – Descriptive metadata (heuristic rules and machine learning techniques) – Minimum human interaction n Conform to standards (DC, NISO, METS) n Intelligent file migration – Lossy or lossless migration – When to migrate Metadata and files Ingest SPER GUIs Metadata extraction Migration Search Query results Storage

Our Problem • Extracting descriptive metadata (e. g. , article title, authors, affiliation, page numbers, journal name, publication date, publisher, databank accession numbers, grant numbers, etc……. PLUS abstract) • Example: Grubb RL. Hemodynamic factors in the prognosis of symptomatic carotid occlusion. JAMA. 1998. 280 (12) 1055 -60…. .

In other words… n Grubb RL. Hemodynamic factors in the prognosis of symptomatic carotid occlusion. JAMA. 1998. 280 (12) 105560…. .

Automated Metadata Extraction Methods n n TIFF OCR segment label physical zones (using DIAU techniques) Use heuristic rules related to layout (geometric) and context (key words) – Currently in production (MARS* for citation generation from journal articles) n Use the learned rules or models – Experiments *Medical Article Records System: automatic extraction of article title, author names, affiliations, abstract, from scanned journals, to populate MEDLINE.

Why learned rules or models? Diverse layout styles n Style differs in different journals n Style varies in different issues of a journal n Manual rule or model creation expensive n Automated rule or model learning from previous results n Use style related features

Significant Features (examples) n Geometric – – – n Absolute location and size of zones (x 1, y 1, x 2, y 2) Relative location of zones (top, bottom, left of, right of) Page margin and gap between zones Contextual – – – Font size (12 pt, 20 pt) Font attribute (bold, italic) Key words (University, city, department …)

MARS DCMS OCR Check. In Scanner MEDLINE Indexing Autozone Upload Autolabel Autoformat MARS Database Confidence. Edit Scanner Pattern. Match Lexicons/rules Edit. Diff Edit Reconcile Journal flow Admin

Image Original bitmap

Image processed by Autozone Original bitmap Zoned

Features for Autozone For each text-line n n n n n Median character height and width Average character height Maximum character height Average height of lower case characters (without ascenders or descenders) Average character confidence value Number of alphanumeric characters Aspect ratio of line (height/width) % italics, bold, upper case, digits Approximate location on page

Image processed by Autozone and Autolabel Original bitmap Zoned Labeled

Autoreformat Original bitmap Text syntax reformatted Zoned Labeled e. g. , John A. Smith John A

Lexical analysis to overcome OCR errors Original bitmap Text syntax reformatted Zoned Labeled Lexical analysis

Scan workstation in MARS system

Scan workstation operation

Edit workstation: colors identify fields labeled automatically

Edit workstation

Edit workstation High confidence characters(%)

Reconcile workstation in MARS – main screen

Pattern matching to correct words for Reconcile operator

Reconcile workstation GUI - closeup Bitmapped image Incorrect word Operator click selects correct word from pattern matching

Automated Metadata Extraction based on Learning Methods n Automatically learn layout rules or models from previous (similar) TIFF documents n Use the learned rules or models to segment and label TIFF document pages of similar layout styles

Two Machine Learning Methods n Learn labeling rules from dynamically generated features based on string matching techniques (DFGS) – Exploit the MARS system and DFGS is now in the MARS production system – Three types of features to infer rules – Provide an unstructured and partial description of a document page – Good for arbitrary layouts but sensitive to variations in absolute zone locations – Requires that the physical segmentation (“zoning”) is done accurately n Learn a 2 -D layout model with logical labels based on a Bayesian approach – Provide a structured, either partial or full 2 D description of a document page – Physical segmentation and logical labeling are performed simultaneously using the models – Not sensitive to document noise and variations in absolute zone locations – Use background – Sensitive to document skew Example: title field lm P rm X tm h g 1 ti g 2 C g 3 B bm Y C 1 g 4 C 2 X Y ab g 5 K au g 6 af

Learning Labeling Rules: Dynamic Feature Generation System (DFGS) MARS (simplified) Scanned journals OCR Zoning and labeling Reformatting syntax Zone. Czar 1 Reformat Text verification MEDLINE ® Upload Reconcile Mao S, Kim J, Thoma GR. A Dynamic Feature Verified Generation System for Automated Metadata Matched Generation text features Extraction in Preservation of Digital Materials. Proc. Zone. Match 1 st 1 International Workshop on Document Image Analysis for Libraries , Pages 225 -232, Palo Alto, CA, Dynamic Feature Zone. Czar 2 Zone. Match 2 Individual Generation System 2004. Feature sets January (DFGS) Loop Feature Combination 2 and matching score Candidate combined Feature sets ZRJournal. Specific. Information ZMControl

Learn Document 2 D Layout Models based on a Bayesian Approach n Represent 2 D layouts by a set of attributed hidden semi-Markov models (HSMMs) n A Bayesian method for learning 2 D layout models from segmented and labeled, but unstructured training data n Simultaneous physical segmentation and logical labeling using learned layout models n Character bounding boxes as basic image units [Liang et al, 1996 and Ha et al, 1995]

Attributed Hidden Semi-Markov Models Hidden Semi-Markov Models(HSMMs) (attributed HSMMs ) π , ρ) (attributed HSMMs ) = (A, B, C, n n n A: state transition probability matrix that defines a Markov model B C π : initial state probability distribution vector B: state observation probability matrix that defines the “hidden” part C: state duration probability matrix that defines the “semi” part ρ: direction attribute (x or y) 1 1 0. 6 3 2 4 1 0. 8 5 1 0. 4 0. 2 6 ρ=X 1

Map attributed HSMMs To 2 D Document Layout n States: document regions such as text regions, page margins, gaps between text regions n State transitions: boundaries and order of document regions n State observation: features of document regions n State duration: sizes of document regions ρ=Y

States n Key states – Title, author, affiliation, abstract n Marginal states – Header, footer, section text – Text from neighboring page – Noise streak n Combinatorial states: can be partitioned at another dimension n Margin and gap states

State Observations and Durations n State observations (contextual n State durations (geometric features) – Number of characters – Majority font size – Majority key word – Majority attribute (Bold, italics) features) – The size of zones, page margins, and gaps between zones (width and height)

2 D Layout Model: a set of attributed hidden semi-Markov models lm P tm rm h g 1 ti g 2 au g 6 ab lm P rm tm h g 1 X ti C g 3 g 4 C 2 B bm Y af C g 7 g 5 k g 2 C 1 X C 1 g 4 C 2 ad g 3 Y ab B bm g 5 K au g 6 af g 7 ad

Bayesian Learning Method P(M 0|X) < P(M 1|X) < P(M 2|X) < P(M|X) n Start with an initial model M 0 n Let X be the observation sample associated with M 0 n Merge the states of M 0 until we find a model M such that

Model Merging Constraints n Do not allow loop since order of zones are important n Do not allow text state to be merged with gap or margin state n Two states to be merged should be spatially close

The Recursive Learning Algorithm m 2: 2. Learn 1 D models m at the i level, let M = M U m. 3. Use M in a recursive Duration Viterbi Algorithm to segment training pages 4. Find out the segmented region that can be further split, exit if none exits. 5. i = i+1, go back to step 2. Y h Start from a set of training pages, let i = 0, and 2 D model M = Φ. tm 1. g 1 ti g 2 C g 3 B bm m 1: X lm P rm

Priors n Break a model into three components n Dirichlet distribution for multinomial prior ([Stolcke and Omohundro, 1994] proposed priors for HMMs) n Multinomial and geometric distributions

Likelihood Approximating the likelihood in Bayesian Learning by the Viterbi path

Global Weights n Adjust the contributions of prior and likelihood to the posterior probability n Control when the model generalization should stop

Comparison of four Labeling Methods • • 198 title textlines 181 author textlines 600 affiliation textlines 2079 abstract textlines • Heuristic rules and DFGS: 1. 2. • assume zoning is done. Use font size, font attribute, key words as features HMM- and HSMM-based methods: 1. 2. simultaneous zoning and labeling Only use character count as feature (3 others later) Zoning and labeling accuracy (%) (Test set: 69 pages)

Future Work n For TIFF images, extend feature set to font size, font attributes, key words n Map the layout model to other document formats, e. g. , HTML, PDF n Use text-line (rather than zone) as basic state unit

George R. Thoma, Ph. D. Chief, Communications Engineering Branch Lister Hill National Center for Biomedical Communications National Library of Medicine 8600 Rockville Pike, Bethesda, MD 20894 USA thoma@nlm. nih. gov 301 496 4496 archive. nlm. nih. gov

Publications 1. Bayesian Learning of 2 D Document Layout models for Automated Preservation Metadata Extraction, Song Mao and George R. Thoma. Submitted to the 4 th IASTED International Conference on VISUALIZATION, IMAGING, AND IMAGE PROCESSING. 2. Style-Independent Labeling: Design and Performance Evaluation. Song Mao, Jong Woo Kim and G. R. Thoma, SPIE Conference on Document Recognition and Retrieval, pages 14 -22, San Jose, CA, January 2004. 3. A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. Song Mao, Jong Woo Kim and G. R. Thoma, The First International Workshop on Document Image Analysis for Libraries, Pages 225 -232, Palo Alto, CA, January 2004. 4. Stochastic Attributed K-D tree Modeling of Technical Paper Title Pages, Song Mao, Azriel Rosenfeld, Tapas Kanungo, IEEE International Conference on Image Processing, pages 533 -536, Barcelona, Spain, September 2003. 5. Stochastic Language Model for Style-Directed Physical Layout Analysis of Documents, Tapas Kanungo and Song Mao, IEEE Transactions on Image Processing, pages 583 -596, vol. 12, no. 5, May 2003.

References 6. Best-first model merging for hidden Markov model induction, A. Stolcke and S. M. Omohundro, Technical Report TR-94 -003, ICSI, Berkeley, CA, 1994 7. Document layout structure extraction using bounding boxes of different entities, J. Liang, J. Ha, R. M. Haralick, rd 3 IEEE Workshop on Applications of Computer Vision (WACV ’ 96), December, 1996 8. Document page decomposition using bounding boxes of connected components of black pixels, J. Ha, R. M. Haralick, I. T. Phillips Document Recognitin II, SPIE Proceedings, vol 2422, pp. 140 -151, Feb 1995 9. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, T. Hastie, R. Tibshurani, and J. H. Friedman, Spinger Series in Statistics, 2001.

Examples HSMM-based DFGS-based HMM-based Heuristic-rule-based

Bayesian Learning Goal is to find Need to know the explicit form of prior P(M) and likelihood P(X|M). Obtained from training set.

Layout Model Learning Results (Training set: 19 journal title pages) Model component Number of Initial states Number of final states 1 95 5 2 189 17 3 205 32

Model Merging Algorithm n Best-first merging with look ahead [Stolcke and Omohundro, 1994] n Algorithm steps – Let M 0 be the empty model. Let i=0. Loop: 1. 2. 3. 4. Get 5 new samples X and incorporate them into Mi Find the best merge that maximize P(Mi|X) Let Mi+1 be the new model if P(Mi+1|X) < P(Mi|X), perform look-ahead: If probability does not improve after merging some more states, break from the loop, else let Mi+1 be the merged model. 5. Let i=i+1. – If data is exhausted, break from the loop and return Mi as the inducted model.

The Recognition Algorithm n A duration Viterbi algorithm [Rabiner et al, 1985] n Recursively apply it to a document page using a set of learned attributed hidden semi-Markov models [Mao et al, 2003]