A Robust System Architecture For Mining Semistructured Data

A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE 6331 11301999

Introduction A versatile system architecture for text mining that differentiates and maintains structured plus unstructured data components.

Motivation • A digital library could contain tons of document concepts, using SQL possible to generate quantitative rules, based on a certain criteria. • What about rules related to a subset such as, – which journal publishes articles associated within an area of interest.

Presentation Organization • Overview of the IRIS system. • Differences between structured & unstructured data. • How is the data stored. • Algorithm used for rule generation. • Conclusion.

Overview of the IRIS system GUI Rule Generator Concept Library Database IDM Document Collection

Brief Description Of Individual Components • Rule Generator - parses the user request via GUI and determines an execution strategy. • Database contains structured data which has mappings b/w tuples and the document. • Concept library maintains unstructured data as concepts - mappings exist b/w concepts and documents.

Contd. . • IDM ( Information discovery module ) – extracts concepts and structured values from a document collection – updates the database and concept library.

Components of the Rule Generator parser optimizer processor • Parser - accepts data and reconditions it for the optimizer. • Optimizer - uses the constraints, rule type and generates an efficient execution plan. • Processor - executes plans laid out by the optimizer.

Components of the IDM Discoverer Extractor Refresher • Discoverer - Intelligent agent that determines domains. • Extractor - Based on the domain knowledge, it populates the database and concept library. • Refresher - Helps maintain consistency of the database and concept library.

Differences b/w the two data types • Structured data type – Certain features that forms key entities. E. g. . , Author, Publisher, Date etc. • Unstructured data type – Blocks of text that are unidentifiable as structured. E. g. . , Abstract headings, paragraphs etc.

How is the data stored ? • Structured data is stored using a relational schema that is mapped to a database. • Unstructured data is stored in a compressed form using ECH(extended concept hierarchy).

Extended Concept Hierarchy • This is a hierarchical form of representing data. its not always constrained to a tree structure. relationships maintain additional links b/w the entities in the hierarchy.

Example University ECH Employees Admin Faculty Full Associate Provost Dean

Calculation of minimum support (min sup) in ECH If C 1 & C 2 are the two concepts found in the document, then min sup = documents( C 1 ) documents( C 2 ) where ‘documents ( c )’ is the number of documents where concept ‘c’ occurs.

Example for calculating min sup Say concept C 1 appears in 500 documents and C 2 appears in 600 documents, 100 of which concept C 1 also appears. Min sup = 100 / 1000 = 0. 1

Algorithm used for rule generation • Get Document ids of documents containing structured data value - using SQL statements. ( set ‘A’ ). • Get Document ids of documents containing unstructured concept - using ECH. ( set ‘B’ ). • C = A B. • Get document ids of concept Cr where Cr is related to C 1 via edge P, C or S. If the min sup of Cr & C 1 are above min sup. ( set ‘D’ ). • E = C D. • confidence = ( num elements in E ) / ( num elements in C ).

Advantages of Using this system • Distinguishing b/w structured -vsunstructured data, helps generate more interesting rules. • Being domain specific - accuracy improves. • Scalable as any database can be used as the database component. • Meaningful data is stored - compact representation of the document.

Bibliography • L. Singh, P. Scheurmann & B. Chen, “IRIS: Our prototype rule generation system”, 1999. • L. Singh, P. Scheurmann & B. Chen, “Generating Association Rules from Semi-structured documents using an Extended concept Hierarchy”, 1999.