Overview 1 Data Mining classification and clustering 2
- Slides: 29
Overview 1. Data Mining - classification and clustering 2. Various distance metrics Minkowski, Manhattan, Euclidian, Max, Canberra, Cord, and HOBbit distance - Neighborhoods and decision boundaries 3. P-trees and its properties 4. k-nearest neighbor classification - Closed-KNN using Max and HOBbit distance 5. k-clustering - overview of existing algorithms - our new algorithm - computation of mean and variance from the P-trees
Data Mining extracting knowledge from a large amount of data Useful Information (sometimes 1 bit: Y/N) Data Mining More data volume = less information Raw data Information Pyramid Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis
Classification Predicting the class of a data object also called Supervised learning Training data: Class labels are known Feature 1 Feature 2 Feature 3 Class a 1 b 1 c 1 A a 2 b 2 c 2 A a 3 b 3 c 3 B Sample with unknown class: a b c Classifier Predicted class Of the Sample
Types of Classifier Eager classifier: Builds a classifier model in advance e. g. decision tree induction, neural network Lazy classifier: Uses the raw training data e. g. k-nearest neighbor
Clustering The process of grouping objects into classes, with the objective: the data objects are • similar to the objects in the same cluster • dissimilar to the objects in the other clusters. A two dimensional space showing 3 clusters Clustering is often called unsupervised learning or unsupervised classification Ø the class labels of the data objects are unknown
Distance Metric Measures the dissimilarity between two data points. A metric is a fctn, d, of 2 n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z) d(X, Z)
Various Distance Metrics Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = )
An Example Y (6, 4) A two-dimensional space: Manhattan, d 1(X, Y) = XZ+ ZY = 4+3 = 7 Euclidian, d 2(X, Y) = XY = 5 Z X (2, 1) Max, d (X, Y) = Max(XZ, ZY) = XZ = 4 d 1 d 2 d For any positive integer p,
Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance
HOBbit Similarity Higher Order Bit (HOBbit) similarity: HOBbit. S(A, B) = A, B: two scalars (integer) ai, bi : ith bit of A and B (left to right) m : number of bits Bit position: 1 2 3 4 5 6 7 8 x 1: 0 1 1 0 0 1 y 1: 0 1 1 1 0 1 x 2: 0 1 1 1 0 1 y 2: 0 1 0 0 HOBbit. S(x 1, y 1) = 3 HOBbit. S(x 2, y 2) = 4
HOBbit Distance The HOBbit distance between two scalar value A and B: dv(A, B) = m – HOBbit(A, B) The previous example: Bit position: 1 2 3 4 5 6 7 8 x 1: 0 1 1 0 0 1 y 1: 0 1 1 1 0 1 x 2: 0 1 1 1 0 1 y 2: 0 1 0 0 HOBbit. S(x 1, y 1) = 3 HOBbit. S(x 2, y 2) = 4 dv(x 1, y 1) = 8 – 3 = 5 dv(x 2, y 2) = 8 – 4 = 4 The HOBbit distance between two points X and Y: In our example (considering 2 -dimensional data): dh(X, Y) = max (5, 4) = 5
HOBbit Distance Is a Metric HOBbit distance is positive definite if (X = Y), = 0 if (X Y), > 0 HOBbit distance is symmetric HOBbit distance holds triangle inequality
Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such that X S if and only if d(T, X) r 2 r 2 r T X X X T 2 r 2 r X T T Manhattan Euclidian Max HOBbit If X is a point on the boundary, d(T, X) = r
Decision Boundary decision boundary between points A and B, is the A locus of the point X satisfying d(A, X) = d(B, X) R 1 d(A, X) X d(B, X) R 2 B D A A B B Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance Manhattan Euclidian Max Euclidian A Manhattan A B > 45 Decision boundaries for Manhattan, Euclidean and max distance B < 45
Notations P 1 & P 2 : P 1 AND P 2 rc(P) : root count of P-tree P P 1 | P 2 : P 1 OR P 2 N : number of pixels P´ : COMPLEMENT of P n : number of bands Pi, j : basic P-tree for band i bit j. m : number of bits Pi(v) : value P-tree for value v of band i. Pi([v 1, v 2]) : interval P-tree for interval [v 1, v 2] of band i. P 0 : is pure 0 -tree, a P-tree having the root node which is pure 0. P 1 : is pure 1 -tree, a P-tree having the root node which is pure 1.
Properties of P-trees 1. a) 2. a) 3. a) b) b) c) c) d) 4. rc(P 1 | P 2) = 0 rc(P 1) = 0 and rc(P 2) = 0 5. v 1 v 2 rc{Pi (v 1) & Pi(v 2)} = 0 6. rc(P 1 | P 2) = rc(P 1) + rc(P 2) - rc(P 1 & P 2) 7. rc{Pi (v 1) | Pi(v 2)} = rc{Pi (v 1)} + rc{Pi(v 2)}, where v 1 v 2
P-tree Header of a P-tree file to make a generalized P-tree structure 1 word 2 words 4 words Format Code Fanout # of levels Root count Length of the body in bytes Body of the P-tree
k-Nearest Neighbor Classification 1) Select a suitable value for k 2) Determine a suitable distance metric 3) Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified.
Closed-KNN T is the target pixels. With k = 3, to find the third nearest neighbor, T KNN arbitrarily select one point from the boundary line of the neighborhood Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN
Searching Nearest Neighbors We begin searching by finding the exact matches. Let the target sample, T = <v 1, v 2, v 3, …, vn> The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dimension i, [vi] is expanded to the interval [vi – ai , vi+bi], for some positive integers ai and bi. Continue expansion until there at least k points in the neighborhood.
HOBbit Similarity Method for KNN In this method, we match bits of the target to the training data Fist we find matching in all 8 bits of each band (exact matching) let, bi, j = jth bit of the ith band of the target pixel. Define Pti, j = Pi, j, if bi, j = 1 = P i, j, otherwise And Pvi, 1 -j = Pti, 1 & Pti, 2 & Pti, 3 & … & Pti, j Pnn = Pv 1, 1 -8 & Pv 2, 1 -8 & Pv 3, 1 -8 & … & Pvn, 1 -8 If rc(Pnn) < k, update Pnn = Pv 1, 1 -7 & Pv 2, 1 -7 & Pv 3, 1 -7 & … & Pvn, 1 -7
An Analysis of HOBbit Method Let ith band value of the target T, vi = 105 = 01101001 b [01101001] = [105, 105] 1 st expansion [0110100 -] = [01101000, 01101001] = [104, 105] 2 nd expansion [011010 - -] = [01101000, 01101011] = [104, 107] Ø Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105. 5 Ø And expands by power of 2. Ø Computationally very cheap
Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P 1(v 1) & P 2(v 2) & P 3(v 3) & … & Pn(vn) If rc(Pnn) < k Pnn = P 1(v 1 -1, v 1+1) & P 2(v 2 -1, v 2+1) & … & Pn(vn-1, vn+1) If rc(Pnn) < k Pnn = P 1(v 1 -2, v 1+2) & P 2(v 2 -2, v 2+2) & … & Pn(vn-2, vn+2) Computationally costlier than HOBbit Similarity method But a little better classification accuracy
Finding the Plurality Class Let, Pc(i) is the value P-trees for the class i Plurality class =
Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7
Performance – Accuracy 1997 Dataset:
Performance - Accuracy (cont. ) 1998 Dataset:
Performance - Time 1997 Dataset: both axis in logarithmic scale
Performance - Time (cont. ) 1998 Dataset : both axis in logarithmic scale
- Classification and clustering in data mining
- Mining complex types of data in data mining
- Find centroid of tree
- Hierarchical clustering in data mining
- K-means clustering algorithm in data mining
- Nyt top stories
- Bond energy algorithm
- Rumus distance
- Trajectory data mining an overview
- Mining multimedia databases in data mining
- Difference between strip mining and open pit mining
- Difference between text mining and web mining
- Classification and clustering
- Data cleaning problems and current approaches
- Data quality and data cleaning an overview
- Data quality and data cleaning an overview
- What is data mining and data warehousing
- Crm data warehouse models
- Olap
- Introduction to data mining and data warehousing
- Bayesian classification in data mining lecture notes
- Basic concepts of classification in data mining
- Classification alternative techniques in data mining
- Strip mining vs open pit mining
- Strip mining before and after
- Clustering vs classification
- Clustering vs classification
- Clustering vs classification
- Classification regression clustering
- Data reduction in data mining