Data Mining Motivation Necessity is the Mother of
- Slides: 20
Data Mining Motivation: “Necessity is the Mother of Invention” • Automated data collection tools and mature database technology have led to tremendous amounts of stored data. • We are drowning in data, but starving for knowledge! • Solution: Data mining – Extract interesting rules, patterns, constraints) – (reduce volume, raise information/knowledge levels)
What Is Data Mining? • Data mining: – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases • Alternative names: – Knowledge discovery in dbs (KDD), knowledge extraction, data/pattern analysis, data prospecting, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? – (Deductive) query processing.
Applications • Database analysis and decision support – Market analysis and management • target marketing, customer relation management, market basket analysis, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and management • Other Applications – Text mining (news group, email, documents) and Web analysis. – Intelligent query answering
More Applications • Sports – IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy – 22 quasars discovered with the help of data mining • Internet Web Surf-Aid – IBM Surf-Aid applies data mining algorithms to Web access logs to discover customer preference and behaviors, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Data Mining: A KDD Process – Data mining: the core of the knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Warehouse Data Cleaning/ Integration: missing data, outliers, noise, errors Databases Selection Classification Clustering ARM Feature extraction, attribute selection
Association Rule Mining: The “Walmart” Example Rule: {Diaper, Milk} => Beer TID Items 1 Bread, Milk 2 Beer, Diaper, Bread, Eggs 3 Beer, Coke, Diaper, Milk |D| 4 Beer, Bread, Diaper, Milk 5 Coke, Bread, Diaper, Milk (Diaper, Milk, Beer} Confidence = = 0. 66 (Diaper, Milk} (Diaper, Milk, Beer} Support = = 0. 4
Precision Ag example: Find image antecedents that imply high yield TIFF image Yield Map High Green reflectance High Yield (obvious) High (Near. Infra. Red – Red) High Yield (higher confidence)
Grasshopper Infestation Prediction • Grasshopper caused significant economic loss last year. • These insects are likely to visit again this year. • Early prediction of the infestation is a key step to decrease damage. Association rule mining on remotely sensed imagery holds significant potential to achieve early detection. How do we signature initial infestation from RGB bands? ? ?
Gene Regulation Pathway Discovery example • Results of clustering may indicated that nine genes are involved in a pathway. • High confident rule mining on that cluster will discover the relationships among the genes in which the expression of one gene (e. g. , Gene 2) is regulated by others. Other genes (e. g. , Gene 4 and Gene 7) may not be directly involved in regulating Gene 2 and can therefore be excluded. Clustering ARM Gene 4 Gene 1 Gene 6 Gene 1 Gene 2, Gene 3 Gene 4, Gene 5, Gene 6 Gene 7, Gene 8 Gene 9 Gene 7 Gene 3 Gene 8 Gene 5 Gene 9 Gene 2
Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines
Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19
Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19
Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19
Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19 b. SQ format (16 files) B 11 B 12 B 13 B 14 B 15 1 1 1 0 0 1 1 1 0 0 0 B 16 B 17 B 18 B 21 B 22 B 23 1 1 0 0 0 1 1 1 1 1 0 0 0 B 24 B 25 B 26 0 0 1 1 0 0 0 1 0 0 B 27 B 28 0 1 0 0 1 1
Peano Count Tree (P-tree) • P-trees are a lossless representation of data in a compressed, recursive quadrant-orientation. • NDSU holds patents on P-tree Technology
An example of Ptree 11 11 01 11 11 10 11 11 11 00 00 00 10 11 11 55 16 8 3 0 15 4 1 4 4 3 16 4 1 1 1 0 0 0 1 1 0 1 • Peano or Z-ordering • quadrant • Root Count
An example of Ptree 001 11 11 11 11 10 11 11 11 00 00 00 10 11 11 55 0 16 1 2 8 3 0 4 1 4 3 15 2 4 3 16 4 3 1 1 1 0 0 0 1 1 0 1 2. 2. 3 • Pure (Pure-1/Pure-0) quadrant • Root Count ( 7, 1 ) ( 111, 001 ) Ø Level Ø Fan-out Ø QID (Quadrant ID) 10. 11
Tuple Count Cube (T-cube) The (v 1, v 2, v 3)th cell of the T-cube contains the Root Count of P(v 1, v 2, v 3) = P 1, v 1 AND P 2, v 2 AND P 3, v 3
High confidence Association Rules • Assume minimum confidence threshold 80%, • minimum support threshold 10% • Start with 1 -bit values and 2 bands, B 1 and B 2 30 24 34 27. 2 sums thresholds 2, 0 25 15 32 40 2, 1 5 19 19. 2 24 1, 0 1, 1 C: B 1={0} => B 2={0} c = 83. 3%
The End Thank you |: ~)
- Mining complex types of data
- Mining multimedia databases
- Data mining motivation
- Data mining motivation
- Motivation of data mining
- Strip mining vs open pit mining
- Chapter 13 mineral resources and mining
- Difference between strip mining and open pit mining
- Web text mining
- Data reduction in data mining
- What is data mining and data warehousing
- What is missing data in data mining
- Concept hierarchy generation for nominal data
- Data reduction in data mining
- Data reduction in data mining
- Shell cube in data mining
- Data reduction in data mining
- Data warehouse dan data mining
- Perbedaan data warehouse dan data mining
- Mining fraud
- Mining complex types of data