From Tessellations to Table Interpretation Ramana C Jandhyala
From Tessellations to Table Interpretation Ramana C. Jandhyala Doc. Lab, RPI
Introduction • Novel aspects of our work – – Focus on computer-constructed web tables Using commercial software Describing tables using XY trees Extracting relationship of headers to content cells • Formalizes the 200 table-experiment conducted by Raghav. These tables were imported from 10 websites into Excel and manually edited into a form that can be processed algorithmically. • Average editing time – 104 sec. • Average table size – 587 cells. • Augmentations not considered!
Rectangular Tessellations • Rectangular Tiling/Discrete Rectangular Tessellation – Partition of an isothetic rectangle into rectangles – Geometry uniquely defined by locations and types of junction points – Number Nall(m) increases exponentially with table size. • XY Tessellations – Special case of rectangular tessellations – Got by successive horizontal and vertical cuts – Number of XY tilings Nxy(m) decrease rapidly (Klarner. Magliveras), i. e. Lim Nxy(m) / Nall(m) = 0 m->inf
Taxonomy of web tables • All tables have a stub, row headings, column headings and data cells. • Some common layouts – admissible tessellations
Taxonomy of web tables (contd. ) • Human-understandable tables - NT, S, xy(m), mathematically indefinable and unknown number • Convert them to smaller set of admissible tables – NA, S, xy(m) • Layout-equivalent tables enough for algorithmic analysis.
Taxonomy of web tables (contd. ) • Number of different layout-equivalent admissible candidates - NL, S, xy(m) • For now, NL, S, xy(m) < NA, S, xy(m) • Context-free grammars – characterize entire families of layout-equivalent tables
Logical Structure of Tables • XY trees only capture physical layout • To understand a table – need to analyse logical structure, i. e. relationship between header cells and content cells [Wang]. • Wang notation – consists of category trees (headings) and delta cells (content). – Number of category trees – dimensionality of the table – Cartesian product of category trees lead to delta cells. – Size of table – product of number of rows and columns of delta cells
Logical Structure of Tables (contd. ) • Well-formed tables – Labeled table candidates for which Wang Notation exists • Most tables not well-formed, but easily convertible into well-formed format using virtual headers. • Analyzing logical structure not sufficient for table understanding!
• Our project – front end for creating narrowdomain ontologies by combining information from web tables • Our work based on following inequalities NL, S, xy(m)< NA, S, xy(m) < NT, S, xy(m) << Nxy(m) << Nall(m) • Examples of each class shown in next slide.
Tessellations to XY trees • Horizontally and vertically ordered lists of junction points – not sufficient for reconstructing XY tree! • Do not capture the adjacency topology. • Need coordinates and junction types (NEcorner, T-junction, crossing etc. )
Table to XY tree – EX 2 XY • • Applicable to any tessellation for which XY tree exists. Input – Excel Table Output – XY tree (parenthesized notation) Algorithm: – Cut. V(R) – cuts a rectangle R vertically and returns leftmost subrectangle. – Cut. H(R) – cuts R horizontally and returns topmost sub-rectangle. – Both used in a pair of procedures P 1 and P 2, which call each other recursively. – P 1 cuts given rectangle vertically and submits first sub-rectangle to P 2 for horizontal cuts. Similarly with P 2. – Main procedure calls P 1 for vertical cuts, and P 2 for horizontal cuts.
Example – Original HTML table
Example (contd. ) – After import into Excel
Example – After Editing
A snippet of the output (both parenthetical and XML outputs) Parenthetical version of the output ( [ { : : 15, 2 : : 16, 2 Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars): : 17, 2: 30, 2 } { : : 15, 3 : : 16, 3 Canada: : 17, 3 Newfoundland Labrador: : 18, 3 Prince Edward Island: : 19, 3 Nova Scotia: : 20, 3 New Brunswick: : 21, 3 Quebec: : 22, 3 Ontario: : 23, 3 Manitoba: : 24, 3 Saskatchewan: : 25, 3 Alberta: : 26, 3 British Columbia: : 27, 3 Yukon: : 28, 3 Northwest Territories: : 29, 3 Nunavut: : 30, 3 } { Year: : 15, 4: 15, 8 [ 2004: : 16, 4 2005: : 16, 5 2006: : 16, 6 2007: : 16, 7 2008: : 16, 8 ]. . . XML version of the output. . <block id='1. 1. 2. 1' range='17, 2: 30, 2'> <content>Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)</content> </block> <block id='1. 1. 2. 2' range='17, 3: 30, 3'> <content></content> </block> <block id='1. 2. 2. 1' range='16, 4: 16, 4'> <content>2004</content> </block> <block id='1. 2. 2. 2' range='16, 5: 16, 5'> <content>2005</content> </block> <block id='1. 2. 2. 3' range='16, 6: 16, 6'> <content>2006</content> </block> <block id='1. 2. 2. 4' range='16, 7: 16, 7'> <content>2007</content> </block>. . .
Grammar for tables • The grammar uses nested parenthetical notation (P-notation). • P-notation has 1: 1 correspondence with general trees. • For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]}]} (neglecting the textual labels)
Grammar • Grammar for parsing the column headers of all such layoutequivalent tessellations: – – S : = A (Rule 1) A : = {B} (Rule 2) B : = c [X] B | c [X] (Rules 3 and 4) X : = c X | A | c (Rules 5, 6, 7 and 8) • where • • • S – start symbol A – nonterminal that generates all admissible strings for column headers B – generates >=1 instances of categories in the form c[X] Each c becomes a root category and X generates its subcategory tree X generates strings of size >=1 with arbitrary occurrences of c and A. • The derivation for the previous example using a LALR parser is shown on the next slide
• Example demonstrates both power and limitation of grammars. • A grammar can recognize broad classes. • But grammars cannot check that headings are properly labels for well-formed tables • If accepted by the grammar, need additional geometric alignment and lexical checks to verify Wang notation.
XY tree to Wang Notation • XY 2 WANG converts an XY tree generated from a restricted family of admissible tables to Wang Notation. • Example: • Uses an indented table-of-contents format as a data structure.
XY 2 WANG • Input – XY trees with arbitrary number of categories and arbitrary nesting. • Output – XML version of Wang Notation • For a table T = (C, d), – Category Notation: C = { (A, {(A 1, phi), (A 2, phi)}), (B, {(B 1, phi), (B 2, phi), (B 3, phi)}) } – Delta mappings δ({A. A 1, B. B 1}) = d 11 δ({A. A 1, B. B 2}) = d 12 …
XY 2 WANG: Algorithm • Algorithm: – First locate 4 principal regions – stub, row/column headers and content cells. – Extract Wang labeled domains under assumption that each spanning cell is the header of smaller cells either to its right (row headers) or bottom (column headers). – Compute Cartesian product of category paths and match each key to the content of a delta cell.
XY 2 WANG: Table-of-contents data structure • Example of a table and its corresponding table-of-contents data structure is shown
• XY 2 WANG also handles more complex scenarios like: – Higher Wang dimensionality – Deeper nesting of headers – Repetitive headers – Detection of not well-formed tables • These are included in the following pseudocode
Conclusion • Hierarchical structure of categories and flat structure of data cells is recovered from XY trees. • Geometric and topological equivalence classes on tessellations and their XY trees are defined. • Commonly encountered tables are examples of such classes. • These tables are identified by parsing XY trees with a grammar. • Assuming the header labels are consistent, Wang category notation is extracted.
Future work • Account for aggregates – major component of web tables. • Need to integrate other augmentations (footnotes, units, captions etc. ) • Expand on the grammar: current version accounts only for column headers. • Automate the conversion from imported web tables to standard formats. • Semantic interpretation of groups of conceptually overlapping tables based on precise representation of layout-invariant syntax.
Current Work • Converting web tables to standard formats for ease of processing. – Internal conventions: A’, A’’, hybrids • Learning from XY trees using tree edit distance – Learning from existing manipulations. – Ex: The user modifies table T 1 to a standard format T 1’. The steps are all recorded. Now use this information to predict the standard format of a new table T 2.
Current work (contd. ) • Relation of tree-edit distance to pre-order and post-order string edit distance – Some interesting results and conjectures, but still half-boiled! – (Result) Pre- and post- order traversals enough for reconstructing a general tree. – (Conjecture) For 2 XY trees, distances between corresponding pre- and post-order strings equal, but not for general trees! – (Conjecture) For 2 XY trees, tree-edit distance equal to pre/post order distances – Are tables with same content, but different layouts, collinear (in terms of string/tree edit distance)? • Developing software to calculate tree edit distances, should clear many things. (Any suggestions? )
- Slides: 30