Data Mining with Unstructured Data A Study And

  • Slides: 58
Download presentation
Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) Samrat Sen

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) Samrat Sen UB - CS 711, Data Mining with Unstructured Data

Goals l Issues in Text Mining with Unstructured Data l Analysis of Data Mining

Goals l Issues in Text Mining with Unstructured Data l Analysis of Data Mining products l Study of a Real Life Classification Problem l Strategy for solving the problem 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 2

Issues in Text Mining l Different from KDD and DM techniques in structured Databases

Issues in Text Mining l Different from KDD and DM techniques in structured Databases Problems: 1. Concerned with predefined fields 2. Based on learning from attribute- value database e. g P. T. O 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 3

Issues in Text Mining Potential Customer Table Married to Table Person Age Sex Income

Issues in Text Mining Potential Customer Table Married to Table Person Age Sex Income Customer Husband Wife Ann S 32 F 10, 000 yes Egor Ann S Jane G 53 F 20, 000 no Sri H Jane Sri S 35 M 65, 000 yes Egor 25 M 10, 000 yes Induced Rules If Married(Person, Spouse) and Income(Person) >= 25, 000 Then Potential-Customer(Spouse) If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 4

Issues in Text Mining l Algorithm techniques like Association Extraction from Indexed data, Prototypical

Issues in Text Mining l Algorithm techniques like Association Extraction from Indexed data, Prototypical Document Extraction from full Text • Industry standard data mining tools cannot be used directly e. g a usual process has to have the Text Transformer, Text Analyzer, Summary generator 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 5

Issues in Text Mining • The input and output interfaces, the file formats •

Issues in Text Mining • The input and output interfaces, the file formats • • may cost in time and money. Exhaustive domains have to be set up for classification. Cost and Benefits have to be weighed before model selection. 1. Gain from positive prediction 2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come up) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 6

Data Mining Products/Tools l DARWIN – from Oracle l Intelligent Data Miner – from

Data Mining Products/Tools l DARWIN – from Oracle l Intelligent Data Miner – from IBM l Intermedia Text with Oracle Database with context query feature (theme based document retrieval) FOR MORE INFO. . . http: //www. oracle. com/ip/analyze/warehouse/datamining/ http: //www-4. ibm. com/software/data/iminer/ 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 7

Data Mining Products/Tools • New Specification being proposed by SUN for a Data Mining

Data Mining Products/Tools • New Specification being proposed by SUN for a Data Mining API * • SQLServer 2000 – Data mining and English query writing features • Verity Knowledge Organizer FOR MORE INFO. . . * http: //java. sun. com/about. Java/communityprocess/jsr_073_dmapi. html#3 Additional Text Mining sites: 1. http: //textmining. krdl. org. sg/resourves. html 2. www. intext. de/TEXTANAE. htm 3. www. cs. uku. fi/~kuikka/systems. html 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 8

DARWIN Functions 1. 2. 3. Prediction (from known values) Classification (into categories) Forecasting (future

DARWIN Functions 1. 2. 3. Prediction (from known values) Classification (into categories) Forecasting (future predictions) Approach 1. 2. 3. Plan Prepare Dataset Build and Use models 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 9

DARWIN The problem is defined in terms of data fields and data records l

DARWIN The problem is defined in terms of data fields and data records l The fields are classified as follows: l - Categorical and Ordered Fields - Predictive Fields - Target Fields • DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 10

DARWIN - Models l Tree model – Based on classification and regression tree algorithm

DARWIN - Models l Tree model – Based on classification and regression tree algorithm l Net model – A feed forward multilayer neural network l Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 11

DARWIN – Tree Model Create Tree Training Data Test/Evaluate Tree (Information on error rates

DARWIN – Tree Model Create Tree Training Data Test/Evaluate Tree (Information on error rates of pruned sub-trees) I/P Prediction Dataset Predict with Tree (using the selected sub-tree) Analyze Results 12/20/2021 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 12

DARWIN – Net Model Neural Network Model Create Net Training Dataset Train Net (Information

DARWIN – Net Model Neural Network Model Create Net Training Dataset Train Net (Information on error rates of pruned sub-trees) Trained Neural Network I/P Prediction Dataset Analyze Results 12/20/2021 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 13

DARWIN – Match Model Create Match Model Training Data Optimize match weights I/P Prediction

DARWIN – Match Model Create Match Model Training Data Optimize match weights I/P Prediction Dataset Predict with Match Analyze Results 12/20/2021 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 14

DARWIN – Analyzing Evaluates the performance of a given model on a given dataset,

DARWIN – Analyzing Evaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes. Summarize Data Provides a statistical summary of the values taken by a data in the specified fields of a dataset Frequency Count Provides information on the frequency with which particular data values appear in a dataset 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 15

DARWIN – Analyzing Performance Matrix Can be used to compare simple fields or simple

DARWIN – Analyzing Performance Matrix Can be used to compare simple fields or simple functions of fields Sensitivity Provides a model showing the relative importance of attributes used in building a model 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 16

DARWIN – Code Generation • Darwin can generate C, C++, Java code for a

DARWIN – Code Generation • Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program • Java code can also be generated to embed a model in a Web Applet FOR MORE INFO. . . http: //technet. oracle. com/docs/products/datamining/doc_index. htm 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 17

DARWIN l For more info http: //technet. oracle. com/software/products/intermedia/soft ware_index. html 1. Oracle Data

DARWIN l For more info http: //technet. oracle. com/software/products/intermedia/soft ware_index. html 1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions Ø http: //www. oracle. com/ip/analyze/warehouse/datamining/ Ø http: //www. oracle. com/oramag/oracle/98 -Jan/fast. html 1. Managing Unstructured Data with Oracle 8 Ø http: //technet. oracle. com/products/datamining/ 1. Product manuals Ø 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 18

DARWIN 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 19

DARWIN 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 19

Oracle – Intermedia Text l Ranking technique called theme proving is used Documents grouped

Oracle – Intermedia Text l Ranking technique called theme proving is used Documents grouped into categories and subcategories l Integrated with the Oracle – 8 database. l Absolutely no training or tuning required 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 20

Oracle – Intermedia Text l Lexical Knowledge Base - 200, 000 concepts from very

Oracle – Intermedia Text l Lexical Knowledge Base - 200, 000 concepts from very broad domains - 2000 major categories - Concepts mapped into one or more words/phrases in canonical form - Each of these have alternate inflectional variations, acronyms, synonyms stored - Total vocabulary of 450, 000 terms - Each entry has other parameters like parts of speech 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 21

Oracle – Intermedia Text Theme Extraction -Themes are assigned initial ranks based on structure

Oracle – Intermedia Text Theme Extraction -Themes are assigned initial ranks based on structure of the document and the frequency of theme. - All the ancestor themes also included in the result - Theme proving done before final ranking Queries Direct match, phrase search (‘contains’), case-sensitive query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 22

Oracle – Intermedia Text l Oracle at Trec 8 (Eighth text retrieval conference- http:

Oracle – Intermedia Text l Oracle at Trec 8 (Eighth text retrieval conference- http: //otn. oracle. com/products/intermedia/htdocs/imt_trec 8 pap. ht m) Recall at 1000 Average Precision Initial precision (at recall 0. 0) Final precision (at recall 1. 0) 12/20/2021 71. 57% (3384/4728) 41. 30% 92. 79% 07. 91% UB - CS 711, Data Mining with Unstructured Data 23

Intermedia Text-Model 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 24

Intermedia Text-Model 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 24

Interface Options 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 25

Interface Options 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 25

Language Selection Java for robot l PL/SQL for data retrieval l 12/20/2021 UB -

Language Selection Java for robot l PL/SQL for data retrieval l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 26

Code Execution 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 27

Code Execution 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 27

Overview of the System Customer Browser Listening at port 80 Server process 12/20/2021 Intermedia

Overview of the System Customer Browser Listening at port 80 Server process 12/20/2021 Intermedia Text Client Browser Web Server Tag stripper UB - CS 711, Data Mining with Unstructured Data Oracle 8 i JDBC 28

Intermedia Text Steps for Building an application Load the documents l Index the document

Intermedia Text Steps for Building an application Load the documents l Index the document l Issue Queries l Present the documents that satisfy the query l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 29

Loading Methods l Loading Methods – – – Insert Statements SQL Loader Ctxsrv –

Loading Methods l Loading Methods – – – Insert Statements SQL Loader Ctxsrv – This is a server daemon process which builds the index at regular intervals – Ctxload Utility Used for Thesaurus Import/Export Text Loading Document Updating/Exporting 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 30

Create and Populate a Simple Table CREATE TABLE quick ( quick_id quick_pk text NUMBER

Create and Populate a Simple Table CREATE TABLE quick ( quick_id quick_pk text NUMBER CONSTRAINT PRIMARY KEY, VARCHAR 2(80) ); INSERT INTO quick VALUES ( 1, 'The cat sat on the mat' ); INSERT INTO quick VALUES ( 2, 'The fox jumped over the dog' ); INSERT INTO quick VALUES ( 3, 'The dog barked like a dog' ); COMMIT; 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 31

Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on

Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0; DRG-10599: column is not indexed l You must have a Text index on a column before you can do a “contains” query on it 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 32

Create the Text Index CREATE INDEX quick_text on quick ( text ) INDEXTYPE IS

Create the Text Index CREATE INDEX quick_text on quick ( text ) INDEXTYPE IS CTXSYS. CONTEXT; CTXSYS is the system user for inter. Media Text l The INDEXTYPE keyword is a feature of the Extensible Indexing Framework l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 33

Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on

Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0; TEXT -----------The cat sat on the mat You should regard the CONTAINS function as boolean in meaning l It is implemented as a number since SQL does not have a boolean datatype l The only sensible way to use it is with >0 l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 34

Run a Text Query SELECT SCORE(42) s, text FROM quick WHERE CONTAINS ( text,

Run a Text Query SELECT SCORE(42) s, text FROM quick WHERE CONTAINS ( text, 'dog', 42 ) >= 0 /* just for teaching purposes! */ ORDER BY s; S TEXT -- -------------7 The dog barked like a dog 4 The fox jumped over the dog The better is the match, the higher is the score l The value can be used in ORDER BY but has no absolute significance l The score is zero when the query is not matched l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 35

Intermedia Text - Indexing Pipeline Filtered Doc text Doc Datastore Filter Section Offsets Column

Intermedia Text - Indexing Pipeline Filtered Doc text Doc Datastore Filter Section Offsets Column data Database Sectioner Index Data Engine Tokens Lexer Plain text • First step is creating an index Datastore • Reads the data out of the table (for URL datastore performs a ‘GET ‘) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 36

Intermedia Text - Indexing Pipeline • Filter : The data is transformed to some

Intermedia Text - Indexing Pipeline • Filter : The data is transformed to some text type, • • • this is needed as some of formats may be binary as when storing doc, pdf, HTML types Sectioner: Converts to plain text, removes tags and invisible info. Lexer: Splits the text into discrete tokens. Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 37

Intermedia Text - Indexing Pipeline Example of index creation Statements • Insert into docs

Intermedia Text - Indexing Pipeline Example of index creation Statements • Insert into docs values(1, ’first document’); • Insert into docs values(2, ’second document’); Produces an index DOCUMENT doc 1 position 2, doc 2 position 2 FIRST doc 1 position 1 SECOND doc 2 position 1 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 38

Testing procedure l Document set from newsgroups 122 documents from a text mining site

Testing procedure l Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used l Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 39

Newsgroup Results 1. 2. 3. 2. 4. 5. 3. 6. 7. 8. 4. 5.

Newsgroup Results 1. 2. 3. 2. 4. 5. 3. 6. 7. 8. 4. 5. 9. 10. 6. 11. 12. 13. 7. Religion , Atheism – 15 on bible, islam, religious beliefs Comp-os-ms-windows-misc - 17 about operating sys, protocols, installation Comp. graphics – 27 on hardware and software for computer graphics Ice Hockey - 18 Computer hardware – 12 on installation of different peripheral devices Mideast. politics - 14 on political development in mideast Science. space - 19 on various space programs, devices, theories 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 40

Newsgroup Results Group Retrieved Wrong Not Retrieved Recall Precision Science and technology 120 16

Newsgroup Results Group Retrieved Wrong Not Retrieved Recall Precision Science and technology 120 16 1 99% 78% Computer Hardware Industry 12 0 5 71% 100% Governme nt 103 26 8 90% 74% 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 41

Newsgroup Results politics 17 3 0 100% 82% Military 5 1 0 80% Social

Newsgroup Results politics 17 3 0 100% 82% Military 5 1 0 80% Social Environm ent Religion 48 2 14 77% 96% 22 3 2 90% 86% Islam 4 0 0 100% Leisure recreation 22 4 5 78% 82% 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 42

Newsgroup Results Sports 21 1 0 90% Hockey 18 0 0 100% Recall =

Newsgroup Results Sports 21 1 0 90% Hockey 18 0 0 100% Recall = predictions # of correct positive -----------# of positive examples Precision = predictions 12/20/2021 ------ # of correct positive UB - CS 711, Data Mining with -----------Unstructured Data 43

Query Syntax: Binary Operators l AND cat cat cat & l OR | l

Query Syntax: Binary Operators l AND cat cat cat & l OR | l EQUIV = l MINUS - l NOT & | = ~ , dog dog dog ~ 12/20/2021 l ACCUM , UB - CS 711, Data Mining with Unstructured Data 44

Semantics: Binary Operators The semantics of all the binary operators is defined in terms

Semantics: Binary Operators The semantics of all the binary operators is defined in terms of SCORE l However, the score for even the simplest query expression - a single word - is calculated by a subtle rule – the score is higher for a document where the query word occurs more frequently than for one where it occurs less frequently – but when “word 1” occurs N times in document D, its score is lower than when “word 2” occurs N times in document D if “word 1” occurs more often in the whole document set than “word 2” l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 45

The Salton Algorithm • inter. Media Text uses an algorithm which is similar to

The Salton Algorithm • inter. Media Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products • The score for a word is proportional to. . . f ( 1+log ( N/n) ) . . . where –f is the frequency of the search term in the document –N is the total number documents –and n is the number of documents which contain the search term • The score is converted into an integer in the range 0 - 100. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 46

The Salton Algorithm Assumption l. Inverse frequency scoring assumes that frequently occurring terms in

The Salton Algorithm Assumption l. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 47

The Salton Algorithm l. This table assumes that only one document in the set

The Salton Algorithm l. This table assumes that only one document in the set contains the query term. # of Documents in Document Set Occurrences of Term in Document Needed to Score 100 34 1 5 20 10 17 50 13 100 12 500 10 1, 000 9 10, 000 7 100, 000 5 1, 000 12/20/2021 UB - CS 711, Data Mining with 4 Unstructured Data 48

Summary of operators l Binary operators… & | = - ~ , • Built-in

Summary of operators l Binary operators… & | = - ~ , • Built-in expansion. . . ? $ ! • Thesaurus. . . BT, BTG, BTP, BTI, NTG, NTP, NTI, PT, RT, SYN, TRSYN, TT 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 49

Summary of operators • Stored query expression. . . SQE • Grouping and escaping.

Summary of operators • Stored query expression. . . SQE • Grouping and escaping. . . () {} • Special. . . NEAR WITHIN ABOUT 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 50

Application Details- Customer profile Analyzer The http server For (User web Page caching) Is

Application Details- Customer profile Analyzer The http server For (User web Page caching) Is started Oracle web Server also started 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 51

Log In Screen- Customer & User Log in Screen Used both By the customer

Log In Screen- Customer & User Log in Screen Used both By the customer And the users The oracle web. Server takes care Of the secure Connections, while For the http server, The user id is Common for the session -no user can invoke a Document from server Without user id. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 52

Customer Interface – Http Server The user Uses the Interface Provided By the custom

Customer Interface – Http Server The user Uses the Interface Provided By the custom http server 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 53

Main User Screen User can Choose the Type of data To be analyzed. Two

Main User Screen User can Choose the Type of data To be analyzed. Two types of data exist 1. Newsgroups 2. User Browsed URL’s 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 54

Selection of Category and options User chooses Category and Other options Like. Generating theme

Selection of Category and options User chooses Category and Other options Like. Generating theme Generating gist Generatingmarked-up text Date range 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 55

Results Page – Gist Generation Can use this Page for drilling Down to the

Results Page – Gist Generation Can use this Page for drilling Down to the Actual document Which opens up in The browser (generated By the filter option) Can generate theme And gist from this Screen. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 56

Search Screen Search screen, Has advance options Like fuzzy search, About search etc. A

Search Screen Search screen, Has advance options Like fuzzy search, About search etc. A chain of expressions Can be used along With conjunctions (like ‘not’, ’or’, ’and’ etc) for Joining the statements 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 57

Conclusion l New estimation methods trying to find more meaning from text. l Industry

Conclusion l New estimation methods trying to find more meaning from text. l Industry has great text mining products and is constantly improving technology. l Unstructured Data Mining – a long way to go. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 58