Eamonn Keogh Data Mining What is data mining

  • Slides: 16
Download presentation
Eamonn Keogh Data Mining

Eamonn Keogh Data Mining

What is data mining? • Generally, data mining (sometimes called data or knowledge discovery)

What is data mining? • Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information. • In my lab, we tend to look at data and problems that no one else looks at.

Data Mining People • Eamonn Keogh • Vagelis Hristidis • Vassilis Tsotras • •

Data Mining People • Eamonn Keogh • Vagelis Hristidis • Vassilis Tsotras • • Chinya Ravishankar Michael Pazzani Christian Shelton (AI) Stefano Lonardi (Bioinformatics)

My Ph. D Students • • • Jessica Lin (Ph. d 2005: George Mason

My Ph. D Students • • • Jessica Lin (Ph. d 2005: George Mason University) Chotirat (Ann) Ratanamahatana (Ph. d 2005: Chulalongkorn University) Li Wei (Ph. d 2006, Google) Xiaopeng Xi (Ph. d 2007, Yahoo) Dragomir Yankov. (Ph. d 2008, Yahoo) Lexiang Ye (Ph. d 2010 Google) Xiaoyue (Elaine) Wang (Ph. d 2010 Nokia) Jin-Wien Shieh (Ph. d 2010 Microsoft) Qiang Zhu (Ph. d 2011 stumbleupon. com) Abdullah Mueen (Ph. d 2012 Microsoft) • • • Bilson Campana (Ph. d going to Google at Xmas) Thanawin (Art) Rakthanmanon (Ph. d ongoing) Bing Hu (Ph. d ongoing) Yuan Hao (Ph. d ongoing) Jesin Zakaria (Ph. d ongoing) Yipeng Chen (Ph. d ongoing)

stinging nettles false nettles

stinging nettles false nettles

false nettles stinging nettles false nettles Shapelet Dictionary Shapelet 5. 1 yes 0 false

false nettles stinging nettles false nettles Shapelet Dictionary Shapelet 5. 1 yes 0 false nettles I I Leaf Decision Tree no 1 stinging nettles false nettles stinging nettles

Of course, this is a decision tree, we want to eventually do clustering. However,

Of course, this is a decision tree, we want to eventually do clustering. However, in general, features that are good for classification, are good for clustering. Decision Tree for Arrowheads Clovis Avonlea Mix Training data (subset) Avonlea Clovis To do: On a small labeled subset of data, learn a dictionary of shaplets. Code the large unlabeled dataset with reference to that dictionary. (Clovis) 11. 24 I (Avonlea) 85. 47 II Shapelet Dictionary 1. 5 1. 0 0. 5 0 0 100 I 200 300 400 Arrowhead Decision Tree II 1 0 2 The shapelet decision tree classifier achieves an accuracy of 80. 0%, the accuracy of rotation invariant one-nearest-neighbor classifier is 68. 0%.

There now exists, perhaps tens of million of digitized pages of historical manuscripts dating

There now exists, perhaps tens of million of digitized pages of historical manuscripts dating back to the 12 th century, that feature one or more heraldic shields The images are often stained, faded or torn

Wouldn’t it be great if we could automatically hyperlink all similar shields to each

Wouldn’t it be great if we could automatically hyperlink all similar shields to each other? For example, here we could link two occurrence of the Von Sax family shield. To do this, we need to consider shape, color and texture. Lets just consider shape for now… Manesse Codex an illuminated manuscript in codex form, copied and illustrated between 1304 and 1340 in Zurich

Indexing and Mining Rock Art Rock art is found on every continent except Antarctica.

Indexing and Mining Rock Art Rock art is found on every continent except Antarctica. Australia may have 100 million examples To date, computer science has had little impact on analysis of rock art. A decade ago, Walt et al. summed up the state of petroglyph research by noting, “Complete-site and cross-site research thus remains impossible, incomplete, or impressionistic”

If we assume that we have high quality binary images of rock art, then

If we assume that we have high quality binary images of rock art, then we can do clustering, classification, indexing motif discovery. Atlatls Anthropomorphs One challenge is designing distance measures. For example, we would like Bighorn Sheep to find and similar, even though one is solid and one is hollow. *Zhu, Wang, Keogh, Lee (2009). Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs. SIGKDD 2009

Why Insects Matter I Because they eat/destroy $40 billion+ worth of food each year

Why Insects Matter I Because they eat/destroy $40 billion+ worth of food each year One Example Crop/Insect Apple Maggot Rhagoletis pomonella Surround WP Crop Protectant against Apple maggots cause two types of injury: dimpling and insects. Derived from Kaolin clay, a natural mineral it forms a barrier that acts tunneling. Dimpling occurs around the site where eggs are laid, causing the flesh to stop growing, resulting in a to control insect pests. sunken, misshapen, dimpled area. Tunneling, done by the Effective & safe, but very expensive larvae (maggots) eating in the fruit, causes the pulp to break down, discolor, and start to rot. The tunnels are often Carbaryl is an insecticide that enlarged by bacterial decay. Damaged fruit eventually is widely used agriculturally. becomes soft and rotten and cannot be used. Effective, but likely a human carcinogen, and it kills honey bees and other pollinators [1] http: //npic. orst. edu/factsheets/carbgen. pdf [2] http: //www. maine. gov/agriculture/pesticides/gotpests/bugs/factsheets/apple-maggot-cornell. pdf

Why Insects Matter II Because they kill over one million people each year

Why Insects Matter II Because they kill over one million people each year

Our Sensor One second of audio from our sensor. The Common Eastern Bumble Bee

Our Sensor One second of audio from our sensor. The Common Eastern Bumble Bee (Bombus impatiens) takes about one tenth of a second to pass the laser. 0. 2 0. 1 0 Background noise -0. 1 -0. 2 0 0. 5 Bee begins to cross laser 1 1. 5 2 2. 5 Bee has past though the laser 3 3. 5 4 x 10 4 4. 5

Peak at 705 Hz 0 100 200 300 400 500 600 700 800 900

Peak at 705 Hz 0 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz) Culex quinquefasciatu Aedes aegypti Bombus impatiens 100 200 300 400 500 600 Frequency (Hz) Almost certainly a Aedes aegypti 700 800

 h g o e K n n o Eam ng Department ri ee

h g o e K n n o Eam ng Department ri ee in ng E & e nc ie Sc r te pu Com erside iv R – ia rn o if al C f o ty si er iv n U eamonn@cs. ucr. edu