NEAREST NEIGHBORS CSC 576 Data Mining Today Measures

NEAREST NEIGHBORS CSC 576: Data Mining

Today… Measures of Similarity Distance Measures Nearest Neighbors

Similarity and Dissimilarity Measures Used by a number of data mining techniques: � Nearest neighbors � Clustering �…

How to measure “proximity”? Proximity: similarity or dissimilarity between two objects � Similarity: numerical measure of the degree to which two objects are alike Usually in range [0, 1] 0 = no similarity 1 = complete similarity � Dissimilarity: measure of the degree in which two objects are different

Feature Space Abstract n-dimensional space Each instance is plotted in a feature space One axis for each descriptive feature � Difficult to visualize when # of features > 3

As the differences between the values of the descriptive features grows, so too does the distance between the points in the feature space that represent these instances.

Distance Metric

Dissimilarities between Data Objects A common measure for the proximity between two objects is the Euclidean Distance: x and y are two data objects n dimensions In high school, we typically used this for calculating the distance between two objects, when there were only two dimensions. � Defined for one-dimension, two-dimensions, threedimensions, …, any n-dimensional space �

Example

Dissimilarities between Data Objects Typically the Euclidean Distance is used as a first choice when applying Nearest Neighbors and Clustering x and y are two data objects n dimensions Other distance metrics: � Generalized by the Minkowski distance metric: r is a parameter

Dissimilarities between Data Objects Minkowski Distance Metric: �r =1 L 1 norm “Manhattan”, “taxicab” Distance �r =2 Euclidean L 2 Distance norm The larger the value of r the more emphasis is placed on the features with large differences in values because these differences are raised to the power of r.

Distance Matrix Once a distance metric is chosen, the proximity between all of the objects in the dataset can be computed Can be represented in a distance matrix � Pairwise distances between points

Distance Matrix L 1 norm distance. “Manhattan” Distance. L 2 norm distance. Euclidean Distance.

Using Weights So far, all attributes treated equally when computing proximity In some situations: some features are more important than others � Decision up to the analyst Modified Minkowski distance definition to include weights:

Eager Learner Models So far in this course, we’re performed prediction by: Downloading or constructing a dataset Learning a model Using model to classify/predict test instances 1. 2. 3. Sometimes called eager learners: � Designed to learn a model that maps the input attributes to the class label, as soon as training data becomes available.

Lazy Learner Models Opposite strategy: � Delay process of modeling the training data, until it is necessary to classify/predict a test instance. � Example: Nearest neighbors

Nearest Neighbors k-nearest neighbors �k = parameter, chosen by analyst � For a given test instance, use the k “closest” points (nearest neighbors) for performing classification “closest” points: defined by some proximity metric, such as Euclidean Distance

Algorithm Can’t have a CS class without pseudocode!

l Requires three things 1. The set of stored records 2. Distance Metric to compute distance between records 3. The value of k, the number of nearest neighbors to retrieve l To classify an unknown record: 1. Compute distance to other training records 2. Identify k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e. g. , by taking majority vote)

Definition of Nearest Neighbor The k-nearest neighbors of a given example x • Classification changes are the k points that are closest to x. depending on the chosen k • Majority Voting Tie Scenario: • Randomly choose classification? • For binary problems, usually an odd k is used to avoid ties.

Voronoi Tessellations and Decision Boundaries When k-NN is searching for the nearest neighbor, it is partitioning the abstract feature space into a Voronoi tessellation � Each region belongs to an instance � Contains all the points in the space whose distance to that instance is less than the distance to any other instance

Decision Boundary: the boundary between regions of the feature space in which different target levels will be predicted. • Generate the decision boundary by aggregating the neighboring regions that make the same prediction.

One of the great things about nearest neighbor algorithms is that we can add in new data to update the model very easily.

What’s up with the top-right instance? Is it noise? The decision boundary is likely not ideal because of id 13. k-NN is a set of local models, each defined by a single instance � Sensitive to noise! How to mitigate noise? � Choose a higher value of k.

Which is the ideal value of k? Setting k to a high value is riskier with an imbalanced dataset. • Majority target level begins to dominate the feature space. Different Values of k k=1 k=3 Ideal decision boundary k=5 k=15 Choose k by running evaluation experiments on a training or validation set.

Choosing the right k If k is too small, sensitive to noise points in the training data � Susceptible to overfitting If k is too large, neighborhood may include points from other classes � Susceptible to misclassification

Computational Issues? Computation can be costly if the number of training examples is large. Efficient indexing techniques are available to reduce the amount of computations needed when finding the nearest neighbors of a test example � Sorting training instances?

Majority Voting Every neighbor has the same impact on the classification: Distance-weighted: � Far away neighbors have a weaker impact on the classification.

Scaling Issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example, with three dimensions: � � � height of a person may vary from 1. 5 m to 1. 8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10 K to $1 M Income will dominate if these variables aren’t standardized.

Standardization Treat all features “equally” so that one feature doesn’t dominate the others Common treatment to all variables: � Standardize each variable: Mean = 0 Standard Deviation = 1 � Normalize Max = 1 Min = 0 each variable:

Query / test instance to classify: • Salary = 56000 • Age = 35

Final Thoughts on Nearest Neighbors Nearest-neighbors classification is part of a more general technique called instance-based learning � Use specific instances for prediction, rather than a model Nearest-neighbors is a lazy learner � Performing the classification can be relatively computationally expensive � No model is learned up-front

Classifier Comparison Eager Learners Decision Trees, SVMs Model Building: potentially slow Classifying Test Instance: fast Lazy Learners Nearest Neighbors Model Building: fast (because there is none!) Classifying Test Instance: slow

Classifier Comparison Eager Learners Decision Trees, SVMs finding a global model that fits the entire input space Lazy Learners Nearest Neighbors classification decisions are made locally (small k values), and are more susceptible to noise

Footnotes In many cases, the initial dataset is not needed once similarities and dissimilarities have been computed � “transforming the data into a similarity space”

References Fundamentals of Machine Learning for Predictive Data Analytics, 1 st Edition, Kelleher et al. Data Science from Scratch, 1 st Edition, Grus Data Mining and Business Analytics in R, 1 st edition, Ledolter An Introduction to Statistical Learning, 1 st edition, James et al. Discovering Knowledge in Data, 2 nd edition, Larose et al. Introduction to Data Mining, 1 st edition, Tam et al.