Automatically Generating Government Linked Data from Tables Varish
Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County November 5, 2011 Dr. Tim Finin Dr. Anupam Joshi
What ? 2
dbpedia-owl: state http: //dbpedia. org/class/ Administrative. Region State FIPS County FIPS Group Label Value Alabama 1 Macon 87 Farms with Black or African American operators Value of sales of grains, oil seeds, dry beans, and dry peas (farms) 5 Arizona …. Navajo …. …. Arkansas 5 Union 139 Farms with women principal Operators Total value of agricultural products sold (farms) 56 California 6 Humboldt 23 … …. 19 http: //dbpedia. org/resource/Arizona Map literals as values of properties Introduction Related Work Baseline Results Joint Inference Conclusion 3
Contribution State FIPS County FIPS Group Label Value Alabama 1 Macon 87 Farms with Black or African American operators Value of sales of grains, oil seeds, dry beans, and dry peas (farms) 5 Arizona …. Navajo Arkansas 5 California 6 @prefix…. dbpedia: <http: //dbpedia. org/resource/>. …. …. …. @prefix dbpedia-owl: <http: //dbpedia. org/ontology/>. @prefix 139 dbpprop: <http: //dbpedia. org/property/>. Union Farms with Total value of 56 women agricultural @prefix dgtwc: <http: //data-gov. tw. rpi. edu/2009/data-gov-twc. rdf#>. principal products sold ”State”@en is rdfs: label of dbpedia-owl: Adminstrative. Region. (farms) [ a dgtwc: Data. Entry; Operators Humboldt dbpedia-owl: state 23 … …. 19 dbpedia: Alabama; dbpedia: FIPS county code 000; dbpedia: Federal Information Processing Standard state code 001; dbpedia-owl: ethnic. Group “Farm with women principal operators”@en; dbpedia-owl: number 6444]. All this in a completely automated way !! Introduction Related Work Baseline Results Joint Inference Conclusion 4
Why ? 5
Tables are everywhere !! … yet … The web – 154 million high quality relational tables [1] Introduction Related Work Baseline Results Joint Inference Conclusion 6
Evidence–based medicine The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. # of Clinical trials published in 2008 # of meta analysis published in 2008 However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … 7 Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010
> 400, 000 raw and geospatial datasets ~ < 1 % in RDF Introduction Related Work Baseline Results Joint Inference Conclusion 8
Current Systems – Require users to have knowledge of the Semantic Web – Do not automatically link to existing classes and entities on the Semantic Web / Linked Data cloud – RDF data in some cases is as useless as raw data – Majority of the work focused on relational data where schema is available – Web tables systems use ‘semantically poor knowledge bases’ Introduction Related Work Baseline Results Joint Inference Conclusion 9
Dataset 1425 <rdf: Description rdf: about=“#entry 1”> <value>6444</value> <label>Number of Farms</label> <group>Farms with women principal operators</group> <county fips>000</county fips> <state fips>01</state fips> <state>Alabama</state> <rdf: type rdf: resource=“http: //datagov. tw. rpi. edu/2009 /data-gov-twc. rdf#Data. Entry”/> </rdf: Description> Introduction Related Work Baseline Results Joint Inference Conclusion 10
How ? 11
Building a table interpretation framework • Preliminary work / Baseline system • Analysis and Evaluation of baseline • “Domain Independent” Framework grounded in graphical models and probabilistic reasoning Introduction Related Work Baseline Results Joint Inference Conclusion 12
The System’s Brain (Knowledgebase) Yago Wikitology 1 – A hybrid knowledgebase where structured data meets unstructured data Syed, Z. , and Finin, T. 2011. Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer. 1 – Wikitology was created as part of Zareen Syed’s Ph. D. dissertation 13
The Baseline System 14
T 2 LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T 2 LD Framework Introduction Related Work Baseline Results Joint Inference Conclusion 15
Predicting Class Labels for column Class State Alabama Arizona 1. Alabama 2. Alabama_(band) 3. Alabama_(people) {dbpedia-owl: Place, dbpediaowl: Administrative. Region, yago: S tates. Of. The. United. States, dbpedia-owl: Band, yago: Native. American. Tribes …} {dbpedia-owl: Place, yago: States. Of. The. United. States, dbpedia-owl: Film, …. …. . } Arkansas California Instance {………………………. } dbpedia-owl: Place, dbpediaowl: Administrative. Region, yago: States. Of. The. United. States, dbpediaowl: Band, yago: Native. American. Tribes, dbpedia-owl: Film. . . Introduction Related Work Baseline Results Joint Inference Conclusion 16
Linking table cells to entities 1. Macon County, Alabama 2. Macon County, Illinois Macon + County + Alabama + 1 + 87 + Farms with Black or African American operators +. . . + dbpediaowl: Administrative. Regio n Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 2 – SVM (Computes Confidence) Link to the top ranked entity Don’t link Introduction Related Work Baseline Results Joint Inference Conclusion 17
Identify Relations State Rel ‘A’ County Alabama Rel ‘A’ Macon Arizona Rel ‘A’, ‘C’ Navajo Arkansas Rel ‘A’, ‘B’, ‘C’ Union California Rel ‘A’, ‘B’ Humboldt Introduction Related Work Baseline Results Joint Inference Conclusion 18
Generating a linked RDF representation @prefix dbpedia: <http: //dbpedia. org/resource/>. @prefix dbpedia-owl: <http: //dbpedia. org/ontology/>. @prefix dbpprop: <http: //dbpedia. org/property/>. @prefix dgtwc: <http: //data-gov. tw. rpi. edu/2009/data-gov-twc. rdf#>. ”State”@en is rdfs: label of dbpedia-owl: Adminstrative. Region. [ a dgtwc: Data. Entry; dbpedia-owl: state dbpedia: Alabama; dbpedia: FIPS county code 000; dbpedia: Federal Information Processing Standard state code 001; dbpedia-owl: ethnic. Group “Farm with women principal operators”@en; dbpedia-owl: number 6444]. Introduction Related Work Baseline Results Joint Inference Conclusion 19
Evaluation of the baseline system 20
Dataset summary Number of Tables 15 Total Number of rows 199 Total Number of columns 56 (52) Total Number of entities 639 (611) * The number in the brackets indicates # excluding columns that contained numbers Introduction Related Work Baseline Results Joint Inference Conclusion 21
Evaluation # 1 (MAP) • Compared the system’s ranked list of labels against a human–ranked list of labels • Metric - Average Precision (a. p. ) [Mean Average Precision gives a mean over set of queries] • Commonly used in the Information Retrieval domain to compare two ranked sets Introduction Related Work Baseline Results Joint Inference Conclusion 22
Evaluation # 1 (MAP) 1. 2 Average Precision 1 Average Precision 0. 8 System Ranked: 1. Person 2. Politician 3. President 0. 6 Evaluator Ranked: 1. President 2. Politician 3. Office. Holder 0. 4 MAP = 0. 411 0. 2 0 0 10 20 30 Column # 40 50 Introduction Related Work Baseline Results Joint Inference Conclusion 60 23
% of correct and incorrect instances linked Accuracy for Entity Linking 100% 90% 16. 95% 19. 57% 38. 10% 80% 70. 78% 60% 50% 40% 83. 05% Incorrect Correct 80. 43% 61. 90% 30% 29. 22% 10% 0% Person Place Organization Categories Other Overall Accuracy: 66. 12 % Introduction Related Work Baseline Results Joint Inference Conclusion 24
Lessons Learnt Predict Class for Columns Linking the table cells Identify and Discover relations T 2 LD Framework • Sequential System – Error percolated from one phase to the next • Current system favors general classes over specific ones (MAP score = 0. 411) • Largely, a system driven by “heuristics” • Although we consider evidence, we don’t do assignment jointly Introduction Related Work Baseline Results Joint Inference Conclusion 25
A “Domain Independent” Framework Domain Knowledge – Linked Data Cloud / Medical Domain / Open Govt. Domain Query KB KB m, n, o, … x, y, z, … Probabilistic Graphical Model / Joint Inference Model a, b, c, … Linked Data 26
Joint Inference over evidence in a table ü Probabilistic Graphical Models 27
Parameterized graphical Captures model interaction between row values R 11 R 12 Function that captures the affinity between the column headers and row values R 13 C 1 R 22 C 2 R 23 R 31 Factor Node R 32 R 33 Row value C 3 Variable Node: Column header Captures interaction between column headers Introduction Related Work Baseline Results Joint Inference Conclusion 28
Challenges 29
Challenges - Literals Population 690, 000 Age 345, 000 75 510, 020 65 120, 000 50 25 Population / Profit ? Age / Percentage ? Use evidence from the rest of the table to decide Introduction Related Work Baseline Results Joint Inference Conclusion 30
Challenges - Metadata Introduction Related Work Baseline Results Joint Inference Conclusion 31
More Challenges ! • Sampling and Interpretation – Data set 1425 has > 400, 000 rows ! • Human in the Loop Introduction Related Work Baseline Results Joint Inference Conclusion 32
Conclusion • Presented a framework for inferring the semantics of tables and generating Linked data • Evaluation of the baseline system show feasibility in tackling the problem • Work in progress for building framework grounded in graphical models and probabilistic reasoning • Working on tackling challenges posed by tables from domains such as the medical and open government data Introduction Related Work Baseline Results Joint Inference Conclusion
References 1. Cafarella, M. J. ; Halevy, A. Y. ; Wang, Z. D. ; Wu, E. ; and Zhang, Y. 2008. Webtables: exploring the power of tables on the web. PVLDB 1(1): 538– 549 2. M. Hurst. Towards a theory of tables. IJDAR, 8(2 -3): 123 -131, 2006. 3. D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages 164 -175, 2006. 4. Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, 2010. 5. Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37 th Int'l Conference on Very Large Databases (VLDB), 2011. 6. Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36 th Int'l Conference on Very Large Databases (VLDB), 2010 34
Thank You ! Questions ? varish 1@cs. umbc. edu @varish http: //ebiq. org/h/Varish/Mulwad finin@cs. umbc. edu joshi@cs. umbc. edu 35
- Slides: 35