Ontology Alignment Problem Statement Given N Ontologies O

Ontology Alignment

Problem Statement �Given N Ontologies (O 1 ◦ In a Particular Domain ◦ Different Level of Coverage , …, On) �Goal ◦ Evaluate Commonality of Entities ◦ Rank Entities

Challenges & Solutions �Ontology Alignments ◦ Largest Common Subgraph (LCS) ◦ Vector Space Model (TF/ IDF) �Accuracy of Entities in Aligned Concepts ◦ Ranking Entities

LCS Algorithm for Multiple Ontologies Find the LCS for two Ontologies Align LCS with other Ontologies

Largest Common Subgraph (LCS) Algorithm between two Ontologies S 1: Semantic Similarity � Node Similarity (NS) � Background Knowledge (i. e. , Word. Net/Wikipedia) � Structural Similarity (SS) � Neighbor Similarity � Properties Similarity � Instance-based Similarity (IS) S 2: Total Similarity = NS + SS + IS

Data Structure for LCS Algorithm C’ 2 C 5 C 2 C 1 C’ 3 C 4 C’ 6 C’ 1 C 3 C 7 C 6 C’ 4 Similarity Measure for Corresponding Entities Node Similarity + Structural Similarity C 1 (C 1, C’ 1, . 95), (C 1, C’ 6, . 77), (C 1, C’ 3, . 71), (C 1, C’ 4, . 65), (C 1, C’ 5, . 54), (C 1, C’ 2, . 34) C 2 (C 2, C’ 3, . 85), (C 2, C’ 2, . 67), (C 2, C’ 1, . 51), (C 2, C’ 4, . 45), (C 2, C’ 5, . 24), (C 2, C’ 6, . 14) C 3 (C 3, C’ 4, . 90), (C 3, C’ 1, . 67), (C 3, C’ 3, . 51), (C 3, C’ 2, . 45), (C 3, C’ 5, . 34), (C 3, C’ 6, . 24) C 4 (C 4, C’ 2, . 95), (C 4, C’ 1, . 65), (C 4, C’ 3, . 51), (C 4, C’ 4, . 45), (C 4, C’ 5, . 23), (C 4, C’ 6, . 14) C 5 (C 5, C’ 4, . 80), (C 5, C’ 1, . 67), (C 5, C’ 3, . 65), (C 5, C’ 2, . 35), (C 5, C’ 5, . 34), (C 5, C’ 6, . 24) C 6 (C 6, C’ 1, . 20), (C 6, C’ 1, . 15), (C 6, C’ 3, . 12), (C 6, C’ 2, . 12), (C 6, C’ 5, . 09), (C 6, C’ 6, . 08) C 7 (C 7, C’ 4, . 31), (C 7, C’ 1, . 25), (C 7, C’ 3, . 23), (C 7, C’ 2, . 15), (C 7, C’ 5, . 14), (C 7, C’ 6, . 12) C’ 5

Node Similarity: Instance-based Representing types using N-grams* �Node Similarity (Name-Match) ◦ Find Common N-gram (N = 2) for corresponding columns CA CB Str. Name FENAME Status Street Laddress Raddress LOCUSTGROVE DR LOCUST GROVE BUILT TRAIL RANGE DR 1600 1798 LOUISE LN LOUISE BUILT CR 45/MANE T CT 2500 2598 N-gram types from A. Str. Name = {LO, OC, CU, ST, …. . } N-gram types from B. Street = {TR, RA, R 4, 5/, …. . } *Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani Thuraisingham & Shashi Shekhar, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407410, Irvine, California, USA, November 2008.

Node Similarity: Instance-based Visualizing Entropy and Conditional Entropy H(C) = –Σpi log pi for all x є C 1 U C 2 H(C | T) = H (C, T) – H(C) for all x є C 1 U C 2 and t є T

Node Similarity: Faults of this Method • Semantically similar columns are not guaranteed to have a high similarity score City Country cty. Name country Dallas USA Shanghai China Houston USA Beijing China Kingston Jamaica Tokyo Japan Halifax Canada Mexico City Mexico New Delhi India Kuala Lumpur Malaysia A є O 1 B є O 2 2 -grams extracted from A: {Da, al, la, as, Ho, ou, us…} 2 -grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}

Node Similarity: Instance-based K-medoid + NGD instance similarity Step 1: Extract distinct keywords from compared colu C 1 є O 1 road. Name City Johnson Rd. C 2 Road County Plano Custer Pwy Collin School Dr. Richardson 15 th St. Collin Zeppelin St. Lakehurst Parker Rd. Collin C 2 є O 2 Keywords extracted from columns = {Johnson, Rd. , School, 15 th, …} Step 2: Group distinct keywords together into “Johnson”, ”School”, ”Dr. ”…. semantic clusters “Rd. ”, ”Dr. ”, ”St. ”, ”Pwy”, … : Column 1 C 1 U C 2 Step 3: Calculate : Column 2 Similarity = H(C|T) / H(C)

Node Similarity: Instance-based Problems with K-medoid + NGD* It is possible that two different geographic entities (ie: Dallas, TX and Dallas County) in the same location will have a very low computed NGD value, and thus, be mistaken for being similar: similarity =. 797 road. Name City Road County Johnson Rd. Plano Custer Pwy Cooke School Dr. Richardson 15 th St. Collin Zeppelin St. Lakehurst Parker Rd. Collin Alma Dr. Richardson Alma Dr. Collin Preston Rd. Addison Campbell Rd. Denton Dallas Pkwy Dallas Harry Hines Blvd. Dallas *Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Semantic Schema Matching Witho Shared Instances, ” to appear in Third IEEE International Conference on Semantic Computi Berkeley, CA, USA - September 14 -16, 2009.

Node Similarity: Instance-based Using geographic type information* We use a gazetteer to determine the geographic type of an instance: O 1 Geotypes O 2 *Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Geographically-Typed Semantic Schema Matching, ” to appear in ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2009), Seattle, Washington, USA, November 2009.

Node Similarity: Instance-based Results of Geographic Matching Over 2 Separate Road Network Data Sources

Structural Similarity ◦ Structural Similarity Measurement I. Neighbor Similarity C 2 C 5 C’ 3 C’ 1 C 3 C 6 C’ 4 C’ 5

Structural Similarity Measurement I. is. A Properties Similarity C 2 sub. Class C 5 has. Drink C’ 3 has. Food has. Color C 1 C’ 2 is. A C 4 is. A has. Flavor has. Topping sub. Class C 3 C 6 is. A C 7 C’ 6 C’ 1 has. Flavor C’ 4 subclass C’ 5 RTC 1 = [3 is. A, 2 sub. Class, 1 has. Flavor, 1 has. Color, 0 has. Food, 1 has. Topping] RTC 2 = [1 is. A, 1 sub. Class, 2 has. Flavor, 0 has. Color, 1 has. Food]

Similarity Results of Pairwise Ontology Matching(I 3 CON Benchmark) 1 F-Measure 0. 9 0. 8 0. 7 Matching using Name Similarity + RTS 0. 5 Karlsruhe Similarity 0. 4 ATL 0. 3 0. 2 Russia LCS Similairty Pets (No instance) Pets Computer Networks Hotels Sports 0. 6 Animals 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 LCS Similairty Karlsruhe Similarity ATL 0. 1 0 Animals Sports Hotels Computer Networks Pets (No instance) Russia Matching using Name Similarity + (RTS and Neighbor)

Ontology Matching Vector Space Model (VSM) � Define the VSM for Each Entity • Collection of Words in label, edge types, comment and neighbors. is. A C 2 sub. Class C 5 C’ 2 has. Drink C’ 3 has. Food is. A has. Color C 1 has. Flavor is. A sub. Class C 4 is. A has. Flavor C 3 has. Topping C’ 6 C’ 1 C 6 C 7 has. Flavor C’ 4 VSM(C 1)= [1 C 1, 1 C 2, 1 C 3, 1 C 5, 1 C 6, 1 is. A, 2 sub. Class, 1 has. Flavor] VSM(C’ 1)= [1 C’ 3, C’ 4, 1 C’ 5, 1 is. A, 2 has. Flavor] subclass C’ 5

Ontology Matching Vector Space Model (VSM) • Update VSM by Word Score Using TF/IDF • Calculate Cosine Similarity for corresponding entities Cos(VSM(C 1) , VSM(C 2) )

Aligned Concepts • Aggregate different ontologies • Example

Aligned Concepts • Statistical Model

Aligned Concepts • Calculate the probabilities of appearance of each entity in GO • Use Maximum likelihood Estimation • Calculate and

Reification Reification can be considered as a metadata about RDF/OWL statements. Ontology Alignment approaches rely on probabilistic measures to find matches between concepts in different ontologies. Reification data can be attached with the alignment information to show the 'match factor' between two concepts in OWL-2. Advanced analytic algorithms can benefit from reification in establishing the relevance of search results.

OWL - 2 OWL – 2 is an extension to OWL. Some of the new features in OWL 2 are as follows Syntactic sugar (eg. Disjoint union of classes) Property chains Richer datatypes, data ranges Qualified cardinality restrictions new constructs that increase expressivity simple metamodeling capabilities extended annotation capabilities Following link lists all the new features in OWL 2 http: //www. w 3. org/TR/2009/REC-owl 2 -new-features 20091027/

Ontology Extraction from Text Documents

Problem Statement �Our solution for ontology construction of documents ◦ Use hierarchical clustering algorithm to build a hierarchy for documents �Hierarchical Agglomerative Clustering (HAC) �Modified Self-Organizing Tree (MSOT) �Hierarchical Growing Self-Organizing Tree (HGSOT) ◦ Assign concept for each node in the hierarchy �Usage of the Word. Net

Concept Assignment � Concept Assignment to document �LVQ 1: topic vector (t) is built by training with the training documents. �Clusters in LVQ are predefined. Each topic cluster is represented by a node in the output map, and the LVQ use prelabeled data for training. �Only the best match node’s vector (winning vector) will be updated, rather than its neighbors. Vector updating rule will use following equations: If data x and best match node c belong to the same class, If data x and best match node c belong to the different class.

Concept Assignment ◦ Concept sense disambiguation �One keyword associated with more than one concept in Word. Net. �Keyword “gold” has 4 senses in Word. Net and keyword “copper” has five senses in Word. Net. �For disambiguation of concepts we apply the same technique (i. e. , cosine similarity measure) used in topic tracking. �To construct a vector for each sense we will use a short description that appears in Word. Net.

Concept Assignment �Concept assignment for leaf node ◦ If there are majority documents have the same concept we assign the concept to the leaf. ◦ If there is not majority we will choose a generic concept of all concept from Word. Net to the leaf. �Concept assignment for non leaf node ◦ If there are majority children have the same concept we assign the concept to the internal node. ◦ If there is not majority we will choose a generic concept of all concept from Word. Net to the internal node.