Text Based Similarity Metrics and Delta for Semantic
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Viswanathan and Tim Finin, University of Maryland, Baltimore County Motivation Text similarity is very useful in information retrieval for near duplicate and similarity detection Ø Case 3: Different versions of the same SW graph Ø In addition, when this case is detected, generate a delta between the two versions Classification Similarity metrics computed for each candidate pair Approach Input: corpus of SWDs Problem • Given a collection of SW graphs as RDF documents, identify pairs of graphs that are similar • Generate a delta for pairs of graphs identified as having a versioning relationship Contributions • Defined text-based similarity metrics characterizing relations between SW graphs • Evaluated these metrics for three specific cases of similarity Ø Case 1: Same classes and properties used but differ only in literal content Identify pairs of similar documents Convert to n-triples format Convert to canonical form Compute Text. Based Similarity Metrics Create Reduced Forms Generate delta between versions Identify ontology versions SW Graph Canonicalization <person: John> <a: lives. In> _: x <a: Is. Part. Of> ”USA”. <person: John> <a: likes> ”cheese”. _: x <a: has. Capital> : y. “~” <a: has. Capital> “~”. # _: x _: y “~” <a: Is. Part. Of> ”USA”. # _: x <person: John> <a: likes> ”cheese”. <person: John> <a: lives. In> “~”. #_: x BNode Table Old bnode identifier New bnode identifier _: y _: x _: g 1 _: g 2 <a: has. Capital> _: g 1. _: g 2 <a: Is. Part. Of> ”USA”. <person: John> <a: likes> ”cheese”. <person: John> <a: lives. In> _: g 2. Four reduced forms • Only literals from the original n-triple file • All non-literal content from original n-triple file • Base-URI of every node replaced by “” • Literals and base-URIs replaced by “” Naïve Bayes Classifier: Similarity in classes and properties SVM Classifier: Versioning Relationship Generating Deltas Except Version 2 • Assigns uniform identifiers to blank nodes • Provides a deterministic order to statements • Empirical method that works for most examples Ø Case 2: Differ only in base-URI Naïve Bayes/SVM classifier: Difference only in Base-URI Version 1 Subtractive Delta Version 2 Additive Delta Except Version 1 Delta Version 2 Evaluation • Three datasets of 400+ semantic web documents for training and testing • 17 combinations of similarity metrics tested: Jaccard, Containment, Cosine similarity, Hamming distance between Simhash fingerprints Type of Similarity True False Precision Recall Positives Similarity in classes & properties 0. 986 0. 014 0. 987 0. 986 Difference only in base URI 0. 988 0. 012 0. 988 Versioning Relationship 0. 909 0. 091 0. 913 0. 909 UMBC AN HONORS UNIVERSITY IN MARYLAND
- Slides: 1