Dept Computer Science Korea Univ XML clustering methods

  • Slides: 17
Download presentation
Dept. Computer Science, Korea Univ. XML clustering methods Sohn Jong-Soo mis 026@korea. ac. kr

Dept. Computer Science, Korea Univ. XML clustering methods Sohn Jong-Soo mis 026@korea. ac. kr Intelligent Information System Lab. Korea Univ. 2007. 11. 06 Intelligent Information System Lab.

Dept. Computer Science, Korea Univ. 0. Index n n n Introduction XML and XML

Dept. Computer Science, Korea Univ. 0. Index n n n Introduction XML and XML schema Relational vs. XML Paper overview My works Intelligent Information System Lab.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 1. Introduction n XML ■

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 1. Introduction n XML ■ It has become a standard for information exchange and retrieval ■ With the continuous growth in the XML data § The ability to manage massive collections of XML data and to discover knowledge from them becomes essential For web based information system n Clustering method ■ Database objects, text data, multimedia data ■ XML data is different § Semi-structured § Hierarchical

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 2. XML and XML schema

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 2. XML and XML schema n XML ■ XML document ■ XML schema § Can be obtained separately without scanning the whole document ■ Style sheet § XLS, CSS XML Content Structure Style XML file XML schema, DTD XLS, CSS XML-13 XML-2 XML-3 XML-4 XSLT ( DOM, SAX) XML-1234 XSLT ( DOM, SAX) XML-24

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 2. XML and XML schema

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 2. XML and XML schema n XML documents have elements and attributes ■ Elements (indicated by begin & end tags) attribute begin element § can be nested but cannot interleave each other § can have arbitrary number of sub-elements § can have free text as values end <chap title = “Introduction To XML”> elemen some free text t <sect title = “What is XML? ”> … </sect> <sect title = “Elements”> … </sect> <sect title = “Why XML? ”> … </sect> … possibly more free text Elements w/ same </chap> name can be nested

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 2. XML and XML schema

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 2. XML and XML schema n Database Side: XML is a new way to organize data ■ Relational databases organize data in tables ■ XML documents organize data in ordered trees n Document Side: XML is a semantic markup language ■ HTML focuses on presentation ■ XML focuses on semantics/structure in the data chap sect sect <html> <h 1> Chapter 1… </h 1> some free text <h 2> Section 1… </h 2> some more free text <h 3> Section 1. 1 </h 3> </html>

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 3. Relational vs. XML n

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 3. Relational vs. XML n Relational data are well organized – fully structured (more strict): ■ E-R modeling to model the data structures in the application; ■ E-R diagram is converted to relational tables and integrity constraints (relational schemas) n XML data are semi-structured (more flexible): ■ Schemas may be unfixed, or unknown (flexible – anyone can author a document) ■ Suitable for data integration (data on the web, data exchange between different enterprises).

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 3. Relational vs. XML n

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 3. Relational vs. XML n XML is not meant to replace relational database systems ■ RDBMSs are well suited to OLTP applications § (e. g. , electronic banking) § which has 1000+ small transactions per minute. ■ XML is suitable data exchange over heterogeneous data sources § (e. g. , Web services) § that allow them to “talk”.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 3. Relational vs. XML n

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 3. Relational vs. XML n Advantages of using XML ■ Manage large volume of XML data ■ Provide high-level declarative language ■ Efficiently evaluate complex queries n XML Data Management Issues: ■ XML Data Model ■ XML Query Languages ■ XML Query Processing, Optimization and Classification § I have interest in this branch !

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n XML

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n XML schema clustering with semantic and hierarchical similarity measures ■ This paper presents a XML schema clustering process § By organising the heterogeneous XML schemas into various groups ■ Combining the semantic and syntactic relationships § To calculate the linguistic similarity bet. Two elements Considering the ancestor-child relationship ■ Generalizing a suitable schema class hierarchy § Using Xmine methodology

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n Evaluating

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n Evaluating Structural Similarity in XML Documents ■ Develop a dynamic programming algorithm § to find this distance for any pair of documents ■ It define a new method for computing the distance § between any two XML documents in terms of their structure § The lower this distance the more similar the two documents are in terms of structure § the more likely they are to have been created from the same DTD

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n A

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications ■ This paper proposes a matching algorithm for measuring the structural similarity § between an XML document and a DTD ■ The matching algorithm by comparing the document structure against the one the DTD requires § is able to identify commonalities and differences

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview ■ This

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview ■ This paper focused on five applications of the algorithm: (1) the classification of XML documents against a set of DTDs (2) the generation of a new schema § for a DTD by extracting structural information during the classification of XML documents; (3) the development of an XML-based search engine § able to answer approximate structural queries (4) the selective dissemination of XML documents (5) the protection of the contents of documents classified § against a set of DTDs of a database, by propagating the authorization policies specified at DTD level

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n Schema

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 4. Paper overview n Schema Matching for Transforming Structured Documents ■ Understanding the matching problem in the context of structured document transformations ■ And developing matching methods those output serves as the basis for the automatic generation of transformation scripts ■ Four basic matching process (1) linguistic matching (2) datatype compatibility (3) Designer type hierarchy (4) structural matching

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 5. My works n XML

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. 5. My works n XML data classification ■ Using a XML schema and its XML files ■ ID 3 Algorithm § By classification tool on XML data ■ It will contribute to XML data preprocessing for datamining n Problems ■ XML has hierarchical data type § It can’t present like a table ■ Insufficient of sample data

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. References n E. Bertino, G.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. References n E. Bertino, G. Guerrini, M. Mesiti, A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications, Information Systems 29 (1) (2004) 23– 46. n A. Boukottaya, C. Vanoirbeek, 2005, November 02– 04, Schema matching for transforming structured documents. Paper presented at the The 2005 ACM Symposium on Document engineering, Bristol, United Kingdom. n A. Doan, R. Domingos, A. Y. Halevy, 2001, Reconciling schemas of disparate sources: a machine-learning approach. Paper presented at the ACM SIGMOD, Santa Barbara, California, United States. n S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of XML structural similarities, IEEE Transaction on Knowledge and Data Engineering 7 (2) (2005) 160– 175. n R. Nayak, S. Xu, XCLS: a fast and effective clustering algorithm for heterogenous XML documents. Paper presented at the The 10 th Pacific. Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 2006.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. References n A. Nierman, H.

Dept. Computer Science, Korea Univ. Intelligent Information System Lab. References n A. Nierman, H. V. Jagadish, 2002, December, Evaluating structural similarity in XML documents. Paper presented at the fifth International Conference on Computational Science (ICCS’ 05), Wisconsin, USA. n Richi Nayak, Wina Iryadi 2006, XML schema clustering with semantics and hierarchical similarity measures. n http: //www. w 3 c. org/xml