Overview 1 Data Mining classification and clustering 2

Data Mining extracting knowledge from a large amount of data Useful Information (sometimes 1

Classification Predicting the class of a data object also called Supervised learning Training data:

Types of Classifier Eager classifier: Builds a classifier model in advance e. g. decision

Clustering The process of grouping objects into classes, with the objective: the data objects

Distance Metric Measures the dissimilarity between two data points. A metric is a fctn,

Various Distance Metrics Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian

An Example Y (6, 4) A two-dimensional space: Manhattan, d 1(X, Y) = XZ+

Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance

HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbit. S(A, B) = A, B: two

HOBbit Distance The HOBbit distance between two scalar value A and B: dv(A, B)

HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y),

Neighborhood of a Point Neighborhood of a target point, T, is a set of

Decision Boundary decision boundary between points A and B, is the A locus of

Notations P 1 & P 2 : P 1 AND P 2 rc(P) :

Properties of P-trees 1. a) 2. a) 3. a) b) b) c) c) d)

P-tree Header of a P-tree file to make a generalized P-tree structure 1 word

k-Nearest Neighbor Classification 1) Select a suitable value for k 2) Determine a suitable

Closed-KNN T is the target pixels. With k = 3, to find the third

Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target

HOBbit Similarity Method for KNN In this method, we match bits of the target

An Analysis of HOBbit Method Let ith band value of the target T, vi

Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target

Finding the Plurality Class Let, Pc(i) is the value P-trees for the class i

Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP)

Performance - Accuracy (cont. ) 1998 Dataset:

Performance - Time 1997 Dataset: both axis in logarithmic scale

Performance - Time (cont. ) 1998 Dataset : both axis in logarithmic scale

Slides: 29

Download presentation

Overview 1. Data Mining - classification and clustering 2. Various distance metrics Minkowski, Manhattan, Euclidian, Max, Canberra, Cord, and HOBbit distance - Neighborhoods and decision boundaries 3. P-trees and its properties 4. k-nearest neighbor classification - Closed-KNN using Max and HOBbit distance 5. k-clustering - overview of existing algorithms - our new algorithm - computation of mean and variance from the P-trees

Data Mining extracting knowledge from a large amount of data Useful Information (sometimes 1 bit: Y/N) Data Mining More data volume = less information Raw data Information Pyramid Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis

Classification Predicting the class of a data object also called Supervised learning Training data: Class labels are known Feature 1 Feature 2 Feature 3 Class a 1 b 1 c 1 A a 2 b 2 c 2 A a 3 b 3 c 3 B Sample with unknown class: a b c Classifier Predicted class Of the Sample

Types of Classifier Eager classifier: Builds a classifier model in advance e. g. decision tree induction, neural network Lazy classifier: Uses the raw training data e. g. k-nearest neighbor

Clustering The process of grouping objects into classes, with the objective: the data objects are • similar to the objects in the same cluster • dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification Ø the class labels of the data objects are unknown

Distance Metric Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z) d(X, Z)

Various Distance Metrics Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = )

An Example Y (6, 4) A two-dimensional space: Manhattan, d 1(X, Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2(X, Y) = XY = 5 Z X (2, 1) Max, d (X, Y) = Max(XZ, ZY) = XZ = 4 d 1 d 2 d For any positive integer p,

Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance

HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbit. S(A, B) = A, B: two scalars (integer) ai, bi : ith bit of A and B (left to right) m : number of bits Bit position: 1 2 3 4 5 6 7 8 x 1: 0 1 1 0 0 1 y 1: 0 1 1 1 0 1 x 2: 0 1 1 1 0 1 y 2: 0 1 0 0 HOBbit. S(x 1, y 1) = 3 HOBbit. S(x 2, y 2) = 4

HOBbit Distance The HOBbit distance between two scalar value A and B: dv(A, B) = m – HOBbit(A, B) The previous example: Bit position: 1 2 3 4 5 6 7 8 x 1: 0 1 1 0 0 1 y 1: 0 1 1 1 0 1 x 2: 0 1 1 1 0 1 y 2: 0 1 0 0 HOBbit. S(x 1, y 1) = 3 HOBbit. S(x 2, y 2) = 4 dv(x 1, y 1) = 8 – 3 = 5 dv(x 2, y 2) = 8 – 4 = 4 The HOBbit distance between two points X and Y: In our example (considering 2 -dimensional data): dh(X, Y) = max (5, 4) = 5

HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (X Y), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality

Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X S if and only if d(T, X) r 2 r 2 r T X X X T 2 r 2 r X T T Manhattan Euclidian Max HOBbit If X is a point on the boundary, d(T, X) = r

Decision Boundary decision boundary between points A and B, is the A locus of the point X satisfying d(A, X) = d(B, X) R 1 d(A, X) X d(B, X) R 2 B D A A B B Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Manhattan Euclidian Max Euclidian A Manhattan A B > 45 Decision boundaries for Manhattan, Euclidean and max distance B < 45

Notations P 1 & P 2 : P 1 AND P 2 rc(P) : root count of P-tree P P 1 | P 2 : P 1 OR P 2 N : number of pixels P´ : COMPLEMENT of P n : number of bands Pi, j : basic P-tree for band i bit j. m : number of bits Pi(v) : value P-tree for value v of band i. Pi([v 1, v 2]) : interval P-tree for interval [v 1, v 2] of band i. P 0 : is pure 0 -tree, a P-tree having the root node which is pure 0. P 1 : is pure 1 -tree, a P-tree having the root node which is pure 1.

Properties of P-trees 1. a) 2. a) 3. a) b) b) c) c) d) 4. rc(P 1 | P 2) = 0 rc(P 1) = 0 and rc(P 2) = 0 5. v 1 v 2 rc{Pi (v 1) & Pi(v 2)} = 0 6. rc(P 1 | P 2) = rc(P 1) + rc(P 2) - rc(P 1 & P 2) 7. rc{Pi (v 1) | Pi(v 2)} = rc{Pi (v 1)} + rc{Pi(v 2)}, where v 1 v 2

P-tree Header of a P-tree file to make a generalized P-tree structure 1 word 2 words 4 words Format Code Fanout # of levels Root count Length of the body in bytes Body of the P-tree

k-Nearest Neighbor Classification 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified.

Closed-KNN T is the target pixels. With k = 3, to find the third nearest neighbor, T KNN arbitrarily select one point from the boundary line of the neighborhood Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN

Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = <v 1, v 2, v 3, …, vn> The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dimension i, [vi] is expanded to the interval [vi – ai , vi+bi], for some positive integers ai and bi. Continue expansion until there at least k points in the neighborhood.

HOBbit Similarity Method for KNN In this method, we match bits of the target to the training data Fist we find matching in all 8 bits of each band (exact matching) let, bi, j = jth bit of the ith band of the target pixel. Define Pti, j = Pi, j, if bi, j = 1 = P i, j, otherwise And Pvi, 1 -j = Pti, 1 & Pti, 2 & Pti, 3 & … & Pti, j Pnn = Pv 1, 1 -8 & Pv 2, 1 -8 & Pv 3, 1 -8 & … & Pvn, 1 -8 If rc(Pnn) < k, update Pnn = Pv 1, 1 -7 & Pv 2, 1 -7 & Pv 3, 1 -7 & … & Pvn, 1 -7

An Analysis of HOBbit Method Let ith band value of the target T, vi = 105 = 01101001 b [01101001] = [105, 105] 1 st expansion [0110100 -] = [01101000, 01101001] = [104, 105] 2 nd expansion [011010 - -] = [01101000, 01101011] = [104, 107] Ø Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105. 5 Ø And expands by power of 2. Ø Computationally very cheap

Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1(v 1) & P 2(v 2) & P 3(v 3) & … & Pn(vn) If rc(Pnn) < k Pnn = P 1(v 1 -1, v 1+1) & P 2(v 2 -1, v 2+1) & … & Pn(vn-1, vn+1) If rc(Pnn) < k Pnn = P 1(v 1 -2, v 1+2) & P 2(v 2 -2, v 2+2) & … & Pn(vn-2, vn+2) Computationally costlier than HOBbit Similarity method But a little better classification accuracy

Finding the Plurality Class Let, Pc(i) is the value P-trees for the class i Plurality class =

Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7

Performance – Accuracy 1997 Dataset:

Performance - Accuracy (cont. ) 1998 Dataset:

Performance - Time 1997 Dataset: both axis in logarithmic scale

Performance - Time (cont. ) 1998 Dataset : both axis in logarithmic scale