Automatically Extracting Ontologically Specified Data from HTML Tables
Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University BYU Data Extraction Group Funded by NSF ER 2002
Information Exchange Source Target Information Extraction Leverage this … … to do this Schema Matching BYU Data Extraction Group ER 2002
Information Extraction BYU Data Extraction Group ER 2002
Extracting Pertinent Information from Documents BYU Data Extraction Group ER 2002
A Conceptual-Modeling Solution Year Price 1. . * Make 1. . * has has 0. . 1 Car 0. . 1 has Model 1. . * has 1. . * Mileage has 0. . * 0. . 1 is for 1. . * Feature 1. . * Phone. Nr 0. . 1 has 1. . * Extension BYU Data Extraction Group ER 2002
Car-Ads Ontology Car [->object]; Car [0. . 1] has Year [1. . *]; Car [0. . 1] has Make [1. . *]; Car [0. . . 1] has Model [1. . *]; Car [0. . 1] has Mileage [1. . *]; Car [0. . *] has Feature [1. . *]; Car [0. . 1] has Price [1. . *]; Phone. Nr [1. . *] is for Car [0. . *]; Phone. Nr [0. . 1] has Extension [1. . *]; Year matches [4] constant {extract “d{2}”; context "([^$d]|^)[4 -9]d[^d]"; substitute "^" -> "19"; }, … … End; BYU Data Extraction Group ER 2002
Recognition and Extraction Car 0001 0002 0003 Year 1989 1998 1994 Make Model Mileage Price Phone. Nr Subaru SW $1900 (336)835 -8597 Elantra (336)526 -5444 HONDA ACCORD EX 100 K (336)526 -1081 BYU Data Extraction Group Car 0001 0002 0002 0002 0003 Feature Auto AC Black 4 door tinted windows Auto pb ps cruise am/fm cassette stereo a/c Auto jade green gold ER 2002
Schema Matching for HTML Tables with Unknown Structure BYU Data Extraction Group ER 2002
Table-Schema Matching (Basic Idea) • Many Tables on the Web • Ontology-Based Extraction – Works well for unstructured or semistructured data – What about structured data – tables? • Method – Form attribute-value pairs – Do extraction – Infer mappings from extraction patterns BYU Data Extraction Group ER 2002
Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} Different Source Table Schemas – {Run #, Yr, Make, Model, Tran, Color, Dr} – {Make, Model, Year, Colour, Price, Auto, Air Cond. , AM/FM, CD} – {Vehicle, Distance, Price, Mileage} – {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} BYU Data Extraction Group ER 2002
Problem: Attribute is Value BYU Data Extraction Group ER 2002
Problem: Attribute-Value is Value ? BYU Data Extraction Group ? ER 2002
Problem: Value is not Value BYU Data Extraction Group ER 2002
Problem: Implied Values `` `` BYU Data Extraction Group `` ER 2002
Problem: Missing Attributes BYU Data Extraction Group ER 2002
Problem: Compound Attributes BYU Data Extraction Group ER 2002
Problem: Factored Values BYU Data Extraction Group ER 2002
Problem: Split Values BYU Data Extraction Group ER 2002
Problem: Merged Values BYU Data Extraction Group ER 2002
Problem: Values not of Interest BYU Data Extraction Group ER 2002
Problem: Information Behind Links Single-Column Table (formatted as list) BYU Data Extraction Group Table extending over several pages ER 2002
Solution • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns BYU Data Extraction Group ER 2002
Solution: Remove Internal Factoring ACURA Legend Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table BYU Data Extraction Group ER 2002
Solution: Replace Boolean Values Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM AutoβAir CondβAM/FM β CD Table βYes, BYU Data Extraction Group ER 2002
Solution: Form Attribute-Value Pairs Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond. , Air Cond. >, <AM/FM, AM/FM>, <CD, > BYU Data Extraction Group ER 2002
Solution: Adjust Attribute-Value Pairs Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM> BYU Data Extraction Group ER 2002
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM BYU Data Extraction Group ER 2002
Solution: Infer Mappings Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM μ(Year, EachπMake rowμis(Model, car. πModel μ(Year, Table Year, Colour, Price, Auto, Colour, Air Cond, AM/FM, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table πa. Year Table Price, Auto, CD)* Air Cond, AM/FM, CD)* {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} Note: Mappings produce sets for attributes. Joining to form records BYU Data Extraction Group is trivial because we have OIDs for table rows (e. g. for each Car). ER 2002
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group ER 2002
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πPrice. Table {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group ER 2002
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM ρ Colour←Feature π Colour. Table U ρ Auto←Feature π Auto β Auto. Table U ρ Air Cond. ←Feature π Air Cond. Yes, Air Cond. Table U ρ AM/FM CDTable β Yes, AM/FM←Feature π AM/FM β Yes, Table U ρ CD←Featureπ CDβ Yes, {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group ER 2002
Experiment • • Tables from 60 sites 10 “training” tables 50 test tables 357 mappings (from all 60 sites) – 172 direct mappings (same attribute and meaning) – 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split) BYU Data Extraction Group ER 2002
Results • 10 “training” tables – 100% of the 57 mappings (no false mappings) – 94. 6% of the values in linked pages (5. 4% false declarations) • 50 test tables – 94. 7% of the 300 mappings (no false mappings) – On the bases of sampling 3, 000 values in linked pages, we obtained 97% recall and 86% precision • 16 missed mappings – – – 4 partial (not all unions included) 6 non-U. S. car-ads (unrecognized makes and models) 2 U. S. unrecognized makes and models 3 prices (missing $ or found MSRP instead) 1 mileage (mileages less than 1, 000) BYU Data Extraction Group ER 2002
Conclusions • Summary – Transformed schema-matching problem to extraction – Inferred semantic mappings – Discovered source-to-target mapping rules • Evidence of Success – Tables (mappings): 95% (Recall); 100% (Precision) – Linked Text (value extraction): ~97% (Recall); ~86% (Precision) • Future Work – Discover and exploit structure in linked text – Broaden table understanding – Integrate with current extraction tools www. deg. byu. edu BYU Data Extraction Group ER 2002
- Slides: 34