MTab Matching Tabular Data to Knowledge Graph using

MTab Matching Tabular Data to Knowledge Graph using Probability Models Phuc Nguyen, Natthawut Kertkeidkachorn, Ryutaro Ichise, Hideaki Takeda Semantic Web Challenge, 30, October, 2019 1

Matching tables to DBpedia Challenge website: http: //www. cs. ox. ac. uk/isg/challenges/sem-tab 2

MTab: Assumptions 1. DBpedia: completed, corrected 2. Tables: vertical relational 3. Tables: independent 4. All cells in a column have the same - Entity type - Data type 5. Table header is in the first row of table 3

MTab: Key Ideas § MTab combines the voting algorithm and the probability models. § Tackle two major problems - Entity lookup: No result - Literal matching: cell values could not exactly matched to KG. 4

MTab framework 5

Step 1: Pre-Processing 1. Text decoding: use “fix text for you” (ftfy) [1] to correct noisy textual data 2. Language prediction: use pre-trained “fast text” model (Facebook) [2] 3. Data type prediction: use Duckling (Facebook) [3] 4. Entity type prediction: use Spa. Cy [4] 5. Entity lookup: - Query: each cell or neighbor cells in the same row. - Target: 1) DBpedia lookup, 2) DBpedia endpoint, 3) Wikipedia, 4) Wikidata [1] Speer, R. : ftfy. Zenodo (2019), version 5. 5 - Parameters: language, limit of ranking result [2] Joulin et al. Bag of tricks for efficient text classification. In: EACL 2017. pp. 427– 431. ACL (April 2017) [3] Duckling, link: https: //github. com/facebook/duckling [4] Honnibal, M. , Montani, I. : spa. Cy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017) 6

Step 2: Cell Estimate Entity Candidates based on lookup results • Aggregate scoring from lookup results • Normalize those scores 7

Step 3: Column Estimate Type Candidates Use data type of duckling and Spa. Cy to categorize columns to two types: - Numerical columns - Textual columns Numerical Columns Textual Columns 8

Step 3: Column Estimate Type Candidates: Numerical columns Numerical Columns • Use Emb. Num [1] to find relevance relations (results is a ranking of relevance relations). • Infer domain of relation to find corresponding types [1] Nguyen et al: Embnum: Semantic labeling for numerical values with deep metric learning. JIST 2018 9

Step 3: Column Estimate Type Candidates: Textual Columns 1. Type candidate signals from numerical columns 2. Aggregated signals from entity lookup types for all column cells 3. Aggregated signals from types of Spa. Cy entity type for all column cells 4. The Normalized Levenshtein distance between table header and DBpedia classes Textual Columns 10

Step 4: Relation (Column-Column) Estimate Relation Candidates § Estimate relation scores of two cells in the same row § Aggregate these scores for all rows Entity-Entity Columns relation Entity-Literal Columns relation 11

Step 5: Re-estimate Entity Candidates 1. Entity candidates given lookup results (Step 2) 2. Entity candidates given type of entities (Step 3) 3. Entity candidates given cell values and entity labels - Normalized Levenshtein distance - Heuristic abbreviation rules 4. Entity candidate given other cell values in the same row (Step 4) The highest estimation score is the output of CEA tasks Step 6, 7: Re-Estimate Type and Relation Candidates with majority voting based on CEA results 12

Results (Primary Score) Sem. Tab CEA (F 1) CTA (AH) CPA (F 1) Round 1 1. 000 (F 1) 0. 987 Round 2 0. 911 1. 414 0. 881 Round 3 0. 970 1. 956 0. 844 Round 4 0. 983 2. 012 0. 832 13

Summary • Novelty • MTab is built on top of multiple lookup services. • MTab adopted many new signals (literal) from table elements. • Limitation • Accuracy strongly relies on the lookup results. • Computation-intensive • MTab is built on specific assumptions • Future work • Improve efficiency: Match only some parts of tables • Improve effectiveness: Relaxing MTab assumptions 14