Data Mining with Unstructured Data A Study And
- Slides: 58
Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) Samrat Sen UB - CS 711, Data Mining with Unstructured Data
Goals l Issues in Text Mining with Unstructured Data l Analysis of Data Mining products l Study of a Real Life Classification Problem l Strategy for solving the problem 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 2
Issues in Text Mining l Different from KDD and DM techniques in structured Databases Problems: 1. Concerned with predefined fields 2. Based on learning from attribute- value database e. g P. T. O 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 3
Issues in Text Mining Potential Customer Table Married to Table Person Age Sex Income Customer Husband Wife Ann S 32 F 10, 000 yes Egor Ann S Jane G 53 F 20, 000 no Sri H Jane Sri S 35 M 65, 000 yes Egor 25 M 10, 000 yes Induced Rules If Married(Person, Spouse) and Income(Person) >= 25, 000 Then Potential-Customer(Spouse) If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 4
Issues in Text Mining l Algorithm techniques like Association Extraction from Indexed data, Prototypical Document Extraction from full Text • Industry standard data mining tools cannot be used directly e. g a usual process has to have the Text Transformer, Text Analyzer, Summary generator 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 5
Issues in Text Mining • The input and output interfaces, the file formats • • may cost in time and money. Exhaustive domains have to be set up for classification. Cost and Benefits have to be weighed before model selection. 1. Gain from positive prediction 2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come up) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 6
Data Mining Products/Tools l DARWIN – from Oracle l Intelligent Data Miner – from IBM l Intermedia Text with Oracle Database with context query feature (theme based document retrieval) FOR MORE INFO. . . http: //www. oracle. com/ip/analyze/warehouse/datamining/ http: //www-4. ibm. com/software/data/iminer/ 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 7
Data Mining Products/Tools • New Specification being proposed by SUN for a Data Mining API * • SQLServer 2000 – Data mining and English query writing features • Verity Knowledge Organizer FOR MORE INFO. . . * http: //java. sun. com/about. Java/communityprocess/jsr_073_dmapi. html#3 Additional Text Mining sites: 1. http: //textmining. krdl. org. sg/resourves. html 2. www. intext. de/TEXTANAE. htm 3. www. cs. uku. fi/~kuikka/systems. html 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 8
DARWIN Functions 1. 2. 3. Prediction (from known values) Classification (into categories) Forecasting (future predictions) Approach 1. 2. 3. Plan Prepare Dataset Build and Use models 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 9
DARWIN The problem is defined in terms of data fields and data records l The fields are classified as follows: l - Categorical and Ordered Fields - Predictive Fields - Target Fields • DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 10
DARWIN - Models l Tree model – Based on classification and regression tree algorithm l Net model – A feed forward multilayer neural network l Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 11
DARWIN – Tree Model Create Tree Training Data Test/Evaluate Tree (Information on error rates of pruned sub-trees) I/P Prediction Dataset Predict with Tree (using the selected sub-tree) Analyze Results 12/20/2021 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 12
DARWIN – Net Model Neural Network Model Create Net Training Dataset Train Net (Information on error rates of pruned sub-trees) Trained Neural Network I/P Prediction Dataset Analyze Results 12/20/2021 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 13
DARWIN – Match Model Create Match Model Training Data Optimize match weights I/P Prediction Dataset Predict with Match Analyze Results 12/20/2021 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 14
DARWIN – Analyzing Evaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes. Summarize Data Provides a statistical summary of the values taken by a data in the specified fields of a dataset Frequency Count Provides information on the frequency with which particular data values appear in a dataset 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 15
DARWIN – Analyzing Performance Matrix Can be used to compare simple fields or simple functions of fields Sensitivity Provides a model showing the relative importance of attributes used in building a model 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 16
DARWIN – Code Generation • Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program • Java code can also be generated to embed a model in a Web Applet FOR MORE INFO. . . http: //technet. oracle. com/docs/products/datamining/doc_index. htm 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 17
DARWIN l For more info http: //technet. oracle. com/software/products/intermedia/soft ware_index. html 1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions Ø http: //www. oracle. com/ip/analyze/warehouse/datamining/ Ø http: //www. oracle. com/oramag/oracle/98 -Jan/fast. html 1. Managing Unstructured Data with Oracle 8 Ø http: //technet. oracle. com/products/datamining/ 1. Product manuals Ø 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 18
DARWIN 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 19
Oracle – Intermedia Text l Ranking technique called theme proving is used Documents grouped into categories and subcategories l Integrated with the Oracle – 8 database. l Absolutely no training or tuning required 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 20
Oracle – Intermedia Text l Lexical Knowledge Base - 200, 000 concepts from very broad domains - 2000 major categories - Concepts mapped into one or more words/phrases in canonical form - Each of these have alternate inflectional variations, acronyms, synonyms stored - Total vocabulary of 450, 000 terms - Each entry has other parameters like parts of speech 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 21
Oracle – Intermedia Text Theme Extraction -Themes are assigned initial ranks based on structure of the document and the frequency of theme. - All the ancestor themes also included in the result - Theme proving done before final ranking Queries Direct match, phrase search (‘contains’), case-sensitive query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 22
Oracle – Intermedia Text l Oracle at Trec 8 (Eighth text retrieval conference- http: //otn. oracle. com/products/intermedia/htdocs/imt_trec 8 pap. ht m) Recall at 1000 Average Precision Initial precision (at recall 0. 0) Final precision (at recall 1. 0) 12/20/2021 71. 57% (3384/4728) 41. 30% 92. 79% 07. 91% UB - CS 711, Data Mining with Unstructured Data 23
Intermedia Text-Model 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 24
Interface Options 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 25
Language Selection Java for robot l PL/SQL for data retrieval l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 26
Code Execution 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 27
Overview of the System Customer Browser Listening at port 80 Server process 12/20/2021 Intermedia Text Client Browser Web Server Tag stripper UB - CS 711, Data Mining with Unstructured Data Oracle 8 i JDBC 28
Intermedia Text Steps for Building an application Load the documents l Index the document l Issue Queries l Present the documents that satisfy the query l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 29
Loading Methods l Loading Methods – – – Insert Statements SQL Loader Ctxsrv – This is a server daemon process which builds the index at regular intervals – Ctxload Utility Used for Thesaurus Import/Export Text Loading Document Updating/Exporting 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 30
Create and Populate a Simple Table CREATE TABLE quick ( quick_id quick_pk text NUMBER CONSTRAINT PRIMARY KEY, VARCHAR 2(80) ); INSERT INTO quick VALUES ( 1, 'The cat sat on the mat' ); INSERT INTO quick VALUES ( 2, 'The fox jumped over the dog' ); INSERT INTO quick VALUES ( 3, 'The dog barked like a dog' ); COMMIT; 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 31
Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0; DRG-10599: column is not indexed l You must have a Text index on a column before you can do a “contains” query on it 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 32
Create the Text Index CREATE INDEX quick_text on quick ( text ) INDEXTYPE IS CTXSYS. CONTEXT; CTXSYS is the system user for inter. Media Text l The INDEXTYPE keyword is a feature of the Extensible Indexing Framework l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 33
Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0; TEXT -----------The cat sat on the mat You should regard the CONTAINS function as boolean in meaning l It is implemented as a number since SQL does not have a boolean datatype l The only sensible way to use it is with >0 l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 34
Run a Text Query SELECT SCORE(42) s, text FROM quick WHERE CONTAINS ( text, 'dog', 42 ) >= 0 /* just for teaching purposes! */ ORDER BY s; S TEXT -- -------------7 The dog barked like a dog 4 The fox jumped over the dog The better is the match, the higher is the score l The value can be used in ORDER BY but has no absolute significance l The score is zero when the query is not matched l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 35
Intermedia Text - Indexing Pipeline Filtered Doc text Doc Datastore Filter Section Offsets Column data Database Sectioner Index Data Engine Tokens Lexer Plain text • First step is creating an index Datastore • Reads the data out of the table (for URL datastore performs a ‘GET ‘) 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 36
Intermedia Text - Indexing Pipeline • Filter : The data is transformed to some text type, • • • this is needed as some of formats may be binary as when storing doc, pdf, HTML types Sectioner: Converts to plain text, removes tags and invisible info. Lexer: Splits the text into discrete tokens. Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 37
Intermedia Text - Indexing Pipeline Example of index creation Statements • Insert into docs values(1, ’first document’); • Insert into docs values(2, ’second document’); Produces an index DOCUMENT doc 1 position 2, doc 2 position 2 FIRST doc 1 position 1 SECOND doc 2 position 1 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 38
Testing procedure l Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used l Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 39
Newsgroup Results 1. 2. 3. 2. 4. 5. 3. 6. 7. 8. 4. 5. 9. 10. 6. 11. 12. 13. 7. Religion , Atheism – 15 on bible, islam, religious beliefs Comp-os-ms-windows-misc - 17 about operating sys, protocols, installation Comp. graphics – 27 on hardware and software for computer graphics Ice Hockey - 18 Computer hardware – 12 on installation of different peripheral devices Mideast. politics - 14 on political development in mideast Science. space - 19 on various space programs, devices, theories 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 40
Newsgroup Results Group Retrieved Wrong Not Retrieved Recall Precision Science and technology 120 16 1 99% 78% Computer Hardware Industry 12 0 5 71% 100% Governme nt 103 26 8 90% 74% 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 41
Newsgroup Results politics 17 3 0 100% 82% Military 5 1 0 80% Social Environm ent Religion 48 2 14 77% 96% 22 3 2 90% 86% Islam 4 0 0 100% Leisure recreation 22 4 5 78% 82% 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 42
Newsgroup Results Sports 21 1 0 90% Hockey 18 0 0 100% Recall = predictions # of correct positive -----------# of positive examples Precision = predictions 12/20/2021 ------ # of correct positive UB - CS 711, Data Mining with -----------Unstructured Data 43
Query Syntax: Binary Operators l AND cat cat cat & l OR | l EQUIV = l MINUS - l NOT & | = ~ , dog dog dog ~ 12/20/2021 l ACCUM , UB - CS 711, Data Mining with Unstructured Data 44
Semantics: Binary Operators The semantics of all the binary operators is defined in terms of SCORE l However, the score for even the simplest query expression - a single word - is calculated by a subtle rule – the score is higher for a document where the query word occurs more frequently than for one where it occurs less frequently – but when “word 1” occurs N times in document D, its score is lower than when “word 2” occurs N times in document D if “word 1” occurs more often in the whole document set than “word 2” l 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 45
The Salton Algorithm • inter. Media Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products • The score for a word is proportional to. . . f ( 1+log ( N/n) ) . . . where –f is the frequency of the search term in the document –N is the total number documents –and n is the number of documents which contain the search term • The score is converted into an integer in the range 0 - 100. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 46
The Salton Algorithm Assumption l. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 47
The Salton Algorithm l. This table assumes that only one document in the set contains the query term. # of Documents in Document Set Occurrences of Term in Document Needed to Score 100 34 1 5 20 10 17 50 13 100 12 500 10 1, 000 9 10, 000 7 100, 000 5 1, 000 12/20/2021 UB - CS 711, Data Mining with 4 Unstructured Data 48
Summary of operators l Binary operators… & | = - ~ , • Built-in expansion. . . ? $ ! • Thesaurus. . . BT, BTG, BTP, BTI, NTG, NTP, NTI, PT, RT, SYN, TRSYN, TT 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 49
Summary of operators • Stored query expression. . . SQE • Grouping and escaping. . . () {} • Special. . . NEAR WITHIN ABOUT 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 50
Application Details- Customer profile Analyzer The http server For (User web Page caching) Is started Oracle web Server also started 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 51
Log In Screen- Customer & User Log in Screen Used both By the customer And the users The oracle web. Server takes care Of the secure Connections, while For the http server, The user id is Common for the session -no user can invoke a Document from server Without user id. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 52
Customer Interface – Http Server The user Uses the Interface Provided By the custom http server 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 53
Main User Screen User can Choose the Type of data To be analyzed. Two types of data exist 1. Newsgroups 2. User Browsed URL’s 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 54
Selection of Category and options User chooses Category and Other options Like. Generating theme Generating gist Generatingmarked-up text Date range 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 55
Results Page – Gist Generation Can use this Page for drilling Down to the Actual document Which opens up in The browser (generated By the filter option) Can generate theme And gist from this Screen. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 56
Search Screen Search screen, Has advance options Like fuzzy search, About search etc. A chain of expressions Can be used along With conjunctions (like ‘not’, ’or’, ’and’ etc) for Joining the statements 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 57
Conclusion l New estimation methods trying to find more meaning from text. l Industry has great text mining products and is constantly improving technology. l Unstructured Data Mining – a long way to go. 12/20/2021 UB - CS 711, Data Mining with Unstructured Data 58
- Mining complex data types
- Mining multimedia databases in data mining
- How to convert unstructured data into structured data
- Difference between strip mining and open pit mining
- Web text mining
- Unstructured and structured data
- Strip mining vs open pit mining
- Chapter 13 mineral resources and mining worksheet answers
- Unstructured data growth
- Unstructured data growth rate
- Non-numerical unstructured data indexing
- Azure unstructured data
- Sql server express filestream
- Dealing with unstructured data
- Dealing with unstructured data
- Dealing with unstructured data
- Structured data vs unstructured
- Unstructured data warehouse
- What is data mining and data warehousing
- Datamart olap
- Olap data mining
- Introduction to data warehouse
- Unstructured questionnaire examples
- Structured and unstructured observation examples
- Observation in research
- Structured and unstructured observation examples
- Data reduction in data mining
- What is missing data in data mining
- Data reduction in data mining
- Data reduction in data mining
- Data reduction in data mining
- Shell cube in data mining
- Data reduction in data mining
- Data warehouse dan data mining
- Perbedaan data warehouse dan data mining
- Mining complex types of data
- Noisy data in data mining
- Holap in data warehouse
- Markku roiha
- Data compression in data mining
- Data warehouse dan data mining
- Cs 412 introduction to data mining
- Unstructured decision
- Moodle umons
- Unstructured decision
- Unstructured interview guide
- Examples of naturalistic observation
- Structured vs unstructured observation
- Unstructured interview
- Disadvantages of unstructured interviews
- 3 disadvantages of written records
- Descriptive survey
- Unstructured observation
- Observation
- Unstructured decision making
- General standoutishness
- General standoutishness
- Christopher buehler
- Unstructured information workflow