WP 2 Learning Webservice Domain Ontologies Miha Grar
WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan Institute http: //www. tao-project. eu Funded by: European Commission – 6 th Framework Project Reference: IST-2004 -026460
Outline of the Presentation ³ The goal of WP 2 ³ Introduction to application mining ³ Creating a document network ³ Transforming a document network into feature vectors ³ LATINO: Link-analysis and text-mining toolbox ³ Onto. Gen: a system for semi-automatic data-driven ontology construction ³ WP 2 and the Dassault case study ³ Conclusions and future work 2 9/26/2020
Learning Web-service Ontologies ³ The goal is to facilitate the acquisition of domain ontologies from legacy applications by: 1. Identifying data sources that contain knowledge to be transitioned into an ontology 2. Employing data mining techniques to aid the domain expert in building the ontology 3 9/26/2020
Case 2: C++/Java source code Case 3: Database Case 4: … OL part works for all cases Case 1: Regular Web service Intermediate data representation Application Mining Case-specific “adapters” 4 9/26/2020 Ontology
Application Mining Intermediate data representation Link analysis Structured data = networks Unstructured data = textual documents Document network = A set of interlinked documents; each link has a type and a weight 5 9/26/2020 Text mining
GATE Case Study ³Software library for natural language processing (NLP) ³~600 Java classes ²Language resources = data ²Processing resources = algorithms ²Graphical user interfaces = GUI ³Developed at University of Sheffield ³Freely available at http: //gate. ac. uk/download/ 6 9/26/2020
Data Sources ³ Structured ² Code samples ² Web service usage logs ² Source code ² Reference manual (function declarations) ² WDSL … ³ Unstructured ² Web pages ² User’s manual ² Tutorials, lectures, forums, newsgroups, etc. ² Reference manual (textual descriptions) ² Source code comments … 7 9/26/2020
A Typical Java Class Comment references /** The format of Documents. Subclasses of Document. Format know about * particular MIME types and how to unpack the information in any * markup or formatting they contain into GATE annotations. Each MIME * type has its own subclass of Document. Format, e. g. Xml. Document. Format, * Rtf. Document. Format, Mpeg. Document. Format. These classes register themselves * with a static index residing here when they are constructed. Static * get. Document. Format methods can then be used to get the appropriate * format class for a particular document. */ public abstract class Document. Format extends Abstract. Language. Resource implements Language. Resource{ Class comment /** The MIME type of this format. */ private Mime. Type mime. Type = null; A field Field type Field comment Field name A method Method comment /** * Find a Document. Format implementation that deals with a particular * MIME type, given that type. * @param a. Gate. Document this document will receive as a feature * the associated Mime Type. The name of the feature is Comment reference * Mime. Type and its value is in the format type/subtype * @param mime. Type the mime type that is given as input */ static public Document. Format get. Document. Format(gate. Document a. Gate. Document, Mime. Type mime. Type){ Return type Method name } // get. Document. Format(a. Gate. Document, Mime. Type) } // class Document. Format 8 9/26/2020 Class name Super-class (base class) Implemented interface
Creating a Document Network Document. Format. class 9 9/26/2020 Document. Format
Creating a Document Network Document. Format. class Document. Format Language. Resource Rtf. Document. Format Mime. Type 2 Document. Format Abstract. Language. Resource Document Xml. Document. Format Mpeg. Document. Format 10 9/26/2020
GATE Comment Reference Network See next slide 11 9/26/2020
GATE Comment Reference Network 12 9/26/2020
Transforming Networks into Feature Vectors 2 6 3 7 4 0 8 1 5 9 13 10 9/26/2020 0 1 0. 5 1 3 4 5 6 7 8 9 10 11 0. 25 0. 5 0. 5 2 1 0. 25 3 0. 25 1 0. 5 0. 25 0. 5 1 0. 25 1 0. 5 0. 25 4 0. 25 5 0. 25 6 0. 5 0. 25 7 0. 25 0. 5 8 0. 5 0. 25 9 0. 25 0. 5 10 11 2 11 1 0. 25 0. 25 1 0. 25 0. 5 1 0. 25 0. 5 1 1
Combining Feature Vectors Feature vector Structure feature vector Feature vector Document. Format Content feature vector • • 14 Stop-words Stemming n-grams TF-IDF 9/26/2020 Structure feature vector Content feature vector Combined feature vector
LATINO & Onto. Gen Demo ³ LATINO: Link analysis and text mining toolbox ²Software being developed in the course of TAO WP 2 ²Data preprocessing, machine learning, and data visualization capabilities ³ Onto. Gen ²A system for data-driven semi-automatic ontology construction ²SEKT technology (http: //sekt-project. org) ²Freely available at http: //ontogen. ijs. si 15 9/26/2020
LATINO & Onto. Gen Demo GATE source code 16 9/26/2020 LATINO Feature vectors Ontology Onto. Gen
Onto. Gen Demo 17 9/26/2020
Dassault Case Study: Inclusion Dependencies ³ Inclusion dependencies express subsetrelationships between database tables and are thus important indicators of redundancy ³ Discovery of ID important in the context of information integration ³ Dassault Case Study ²Problem: Dassault databases contain ID which should be taken into account when transitioning databases to ontologies ²LATINO/Onto. Gen can help detect ID 18 9/26/2020
Dassault Case Study: Inclusion Dependencies ³ Dataset ² The content of database tables in XML format ² Ignore non-textual and empty table columns ³ LATINO setting ² Instances: columns (i. e. fields) in tables ² Documents: concatenated values ² Relations between instances: ±Cosine similarity between documents ±Similarity between sets of values °Jaccard, |A B|/|A B| °Alt. , |A B|/min{|A|, |B|} ±Edit distance (normalized) between column names 19 9/26/2020
Dassault Case Study: Inclusion Dependencies 20 9/26/2020
Candidates according to bag-of-words cosine similarity: task_ingredients_consumable. ING_nato_vendor_code Dassault Case Study: Inclusion Dependencies Task_Id. TID_task_owner 1. 00 AC_Periodicity. PER_Aircraft : moop_aircraft 1. 00 AC_Periodicity. PER_Aircraft : mopa_kav 1. 00 AC_Periodicity. PER_Aircraft : movi_kav 1. 00 AC_Periodicity. PER_Aircraft : AC_Zonal_ac 1. 00 AC_Tools. ATO_nato_vendor_code : task_miscellaneous. MIS_nato_vendor_code. . . 0. 99 AC_Tools. ATO_nato_vendor_code : task_ingredients_consumable. ING_nato_vendor_code 0. 99 task_ingredients_consumable. ING_nato_vendor_code : task_tools. TOO_Nato_vendor_code 0. 99 task_ingredients_consumable. ING_nato_vendor_code : task_miscellaneous. MIS_nato_vendor_code 0. 99 Task_Id. TID_task_owner : task_ingredients_consumable. ING_nato_vendor_code 0. 98 AC_Zonal_ac : LRU_SRU_Description. LS_Aircraft. . . 0. 50 task_periodicity. PER_periodicity_usage_parameter 2 : task_usage_parameter. USP_Libelle 0. 50 task_periodicity. PER_threshold_usage_parameter : task_usage_parameter. USP_Code 0. 49 Task_Id. TID_usage parameter : task_periodicity. PER_threshold_usage_parameter 2 0. 48 mope_kpe : task_periodicity. PER_threshold_tol_usage_param 0. 48 mope_kpe : task_periodicity. PER_periodicity_usage_param. . . 0. 21 AC_Vendor_Code. Vendor Code : task_LRU_Ata_Code. LRU_Fab 0. 21 AC_Zonal_title : moid_lida 0. 21 task_compagny_owner. COO_Cage_Code : task_LRU_Ata_Code. LRU_Fab 0. 21 Task_Id. TID_Task_Writer : task_miscellaneous. MIS_part_number 0. 20 Task_Id. TID_Scheduled-task : task_spare. SPR_part_number 21 9/26/2020
Conclusions and Future Work ³ Plans for LATINO ² (Recognized? ) open-source architecture for text mining and link analysis ² Build a user community, put up a Web site, training, promotion … ² Applications! ±… ±… in case studies in other EU projects outside the context of EU projects competing in data mining contests ³ Future work ² Implementation of a visualization tool similar to Document. Atlas (required for setting the weights and exploring the semantic space) ² Evaluation! ± Can we solve problems introduced by case studies better if we use LATINO methodology rather than using standard text mining approach? ² Continue the development of LATINO 22 9/26/2020
- Slides: 22