Data Mining Motivation Necessity is the Mother of

  • Slides: 20
Download presentation
Data Mining Motivation: “Necessity is the Mother of Invention” • Automated data collection tools

Data Mining Motivation: “Necessity is the Mother of Invention” • Automated data collection tools and mature database technology have led to tremendous amounts of stored data. • We are drowning in data, but starving for knowledge! • Solution: Data mining – Extract interesting rules, patterns, constraints) – (reduce volume, raise information/knowledge levels)

What Is Data Mining? • Data mining: – Extraction of interesting (non-trivial, implicit, previously

What Is Data Mining? • Data mining: – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases • Alternative names: – Knowledge discovery in dbs (KDD), knowledge extraction, data/pattern analysis, data prospecting, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? – (Deductive) query processing.

Applications • Database analysis and decision support – Market analysis and management • target

Applications • Database analysis and decision support – Market analysis and management • target marketing, customer relation management, market basket analysis, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and management • Other Applications – Text mining (news group, email, documents) and Web analysis. – Intelligent query answering

More Applications • Sports – IBM Advanced Scout analyzed NBA game statistics (shots blocked,

More Applications • Sports – IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy – 22 quasars discovered with the help of data mining • Internet Web Surf-Aid – IBM Surf-Aid applies data mining algorithms to Web access logs to discover customer preference and behaviors, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining: A KDD Process – Data mining: the core of the knowledge discovery

Data Mining: A KDD Process – Data mining: the core of the knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Warehouse Data Cleaning/ Integration: missing data, outliers, noise, errors Databases Selection Classification Clustering ARM Feature extraction, attribute selection

Association Rule Mining: The “Walmart” Example Rule: {Diaper, Milk} => Beer TID Items 1

Association Rule Mining: The “Walmart” Example Rule: {Diaper, Milk} => Beer TID Items 1 Bread, Milk 2 Beer, Diaper, Bread, Eggs 3 Beer, Coke, Diaper, Milk |D| 4 Beer, Bread, Diaper, Milk 5 Coke, Bread, Diaper, Milk (Diaper, Milk, Beer} Confidence = = 0. 66 (Diaper, Milk} (Diaper, Milk, Beer} Support = = 0. 4

Precision Ag example: Find image antecedents that imply high yield TIFF image Yield Map

Precision Ag example: Find image antecedents that imply high yield TIFF image Yield Map High Green reflectance High Yield (obvious) High (Near. Infra. Red – Red) High Yield (higher confidence)

Grasshopper Infestation Prediction • Grasshopper caused significant economic loss last year. • These insects

Grasshopper Infestation Prediction • Grasshopper caused significant economic loss last year. • These insects are likely to visit again this year. • Early prediction of the infestation is a key step to decrease damage. Association rule mining on remotely sensed imagery holds significant potential to achieve early detection. How do we signature initial infestation from RGB bands? ? ?

Gene Regulation Pathway Discovery example • Results of clustering may indicated that nine genes

Gene Regulation Pathway Discovery example • Results of clustering may indicated that nine genes are involved in a pathway. • High confident rule mining on that cluster will discover the relationships among the genes in which the expression of one gene (e. g. , Gene 2) is regulated by others. Other genes (e. g. , Gene 4 and Gene 7) may not be directly involved in regulating Gene 2 and can therefore be excluded. Clustering ARM Gene 4 Gene 1 Gene 6 Gene 1 Gene 2, Gene 3 Gene 4, Gene 5, Gene 6 Gene 7, Gene 8 Gene 9 Gene 7 Gene 3 Gene 8 Gene 5 Gene 9 Gene 2

Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data

Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010

Spatial Data Formats (Cont. ) 254 (1111 1110) BAND-1 127 (0111 1111) 37 (0010 0101) BAND-2 240 (1111 0000) 14 (0000 1110) 193 (1100 0001) 200 (1100 1000) 19 (0001 0011) BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19 b. SQ format (16 files) B 11 B 12 B 13 B 14 B 15 1 1 1 0 0 1 1 1 0 0 0 B 16 B 17 B 18 B 21 B 22 B 23 1 1 0 0 0 1 1 1 1 1 0 0 0 B 24 B 25 B 26 0 0 1 1 0 0 0 1 0 0 B 27 B 28 0 1 0 0 1 1

Peano Count Tree (P-tree) • P-trees are a lossless representation of data in a

Peano Count Tree (P-tree) • P-trees are a lossless representation of data in a compressed, recursive quadrant-orientation. • NDSU holds patents on P-tree Technology

An example of Ptree 11 11 01 11 11 10 11 11 11 00

An example of Ptree 11 11 01 11 11 10 11 11 11 00 00 00 10 11 11 55 16 8 3 0 15 4 1 4 4 3 16 4 1 1 1 0 0 0 1 1 0 1 • Peano or Z-ordering • quadrant • Root Count

An example of Ptree 001 11 11 11 11 10 11 11 11 00

An example of Ptree 001 11 11 11 11 10 11 11 11 00 00 00 10 11 11 55 0 16 1 2 8 3 0 4 1 4 3 15 2 4 3 16 4 3 1 1 1 0 0 0 1 1 0 1 2. 2. 3 • Pure (Pure-1/Pure-0) quadrant • Root Count ( 7, 1 ) ( 111, 001 ) Ø Level Ø Fan-out Ø QID (Quadrant ID) 10. 11

Tuple Count Cube (T-cube) The (v 1, v 2, v 3)th cell of the

Tuple Count Cube (T-cube) The (v 1, v 2, v 3)th cell of the T-cube contains the Root Count of P(v 1, v 2, v 3) = P 1, v 1 AND P 2, v 2 AND P 3, v 3

High confidence Association Rules • Assume minimum confidence threshold 80%, • minimum support threshold

High confidence Association Rules • Assume minimum confidence threshold 80%, • minimum support threshold 10% • Start with 1 -bit values and 2 bands, B 1 and B 2 30 24 34 27. 2 sums thresholds 2, 0 25 15 32 40 2, 1 5 19 19. 2 24 1, 0 1, 1 C: B 1={0} => B 2={0} c = 83. 3%

The End Thank you |: ~)

The End Thank you |: ~)