Data Preprocessing Outlier Detection Feature Selection Slides adapted
- Slides: 43
Data Preprocessing Outlier Detection & Feature Selection Slides adapted from Richard Souvenir and David Kauchak
Where are Features from Gene expression data Expression Microarray
Where are Features from Face image data
Where are Features from Text/Document Data Terms C D 1 D 2 12 0 …. …… 6 Sports 0 11 …. …… 16 … DM 3 10 …. …… 28 Travel … Documents T 1 T 2 …. …… TN … Emails … Web Pages Jobs Internet n ACM Portal IEEE Xplore Pub. Med n Digital Libraries Task: To classify unlabeled documents into categories Challenge: thousands of terms
UCI Machine Learning Repository http: //archive. ics. uci. edu/ml/datasets. html
Provided features Predicting the age of abalone from physical measurements Name / Data Type / Measurement Unit / Description --------------- Sex / nominal / -- / M, F, and I (infant) Length / continuous / mm / Longest shell measurement Diameter / continuous / mm / perpendicular to length Height / continuous / mm / with meat in shell Whole weight / continuous / grams / whole abalone Shucked weight / continuous / grams / weight of meat Viscera weight / continuous / grams / gut weight (after bleeding) Shell weight / continuous / grams / after being dried Rings / integer / -- / +1. 5 gives the age in years
Provided features Predicting breast cancer recurrence 1. Class: no-recurrence-events, recurrence-events 2. age: 10 -19, 20 -29, 30 -39, 40 -49, 50 -59, 60 -69, 70 -79, 80 -89, 90 -99. 3. menopause: lt 40, ge 40, premeno. 4. tumor-size: 0 -4, 5 -9, 10 -14, 15 -19, 20 -24, 25 -29, 30 -34, 35 -39, 40 -44, 45 -49, 50 -54, 55 -59. 5. inv-nodes: 0 -2, 3 -5, 6 -8, 9 -11, 12 -14, 15 -17, 18 -20, 21 -23, 24 -26, 27 -29, 30 -32, 33 -35, 36 -39. 6. node-caps: yes, no. 7. deg-malig: 1, 2, 3. 8. breast: left, right. 9. breast-quad: left-up, left-low, right-up, right-low, central. 10. irradiated: yes, no.
Provided features vs. Extracted features • Provided features • Data has been collected (biology, medicine) • Usually cannot collect more features • We can often just use these provided features • Extracted features (given raw data, extract/generate features) • • Certain image data text data audio data … Soft tissue found in T-rex fossil Find may reveal details about cells and blood vessels of dinosaurs Thursday, March 24, 2005 Posted: 3: 14 PM EST Health may be concern when giving kids cell phones WASHINGTON (AP) -- For more than a century, the study of dinosaurs Wednesday, March 23, 2005 Posted: 11: 14 AM EST has SEATTLE, been. Wall limited to fossilized bones. Now, should researchers have before recovered Street gears (AP) up for Washington -- jobs Parents think twice giving 70 -million-year-old soft tissue, including what may be blood in to Saturday, March 26, 2005: 11: 41 AM EST a middle-schooler's demands for a cell phone, somevessels scientists and say, cells, because from a potential Tyrannosaurus rex. health long-term risksmoon remain unclear. Probe finds atmosphere on Saturn NEWThursday, March 17, 2005 Posted: 11: 17 AM EST YORK (CNN/Money) - Investors on Inflation Watch 2005 have a big week to look forward to --(Reuters) or be wary of --space depending how you LOS ANGELES, California -- The probeon Cassini look discovered at it. a significant atmosphere around Saturn's moon Enceladus during two recent passes close by, the Jet Propulsion Laboratory said on Wednesday
Current learning model training data (labeled examples) Terrain Unicycletype Weather Go-For-Ride? Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES n r a le model/ classifier feature 1 X 2 X 3 X 4 … … Xn feature 2 feature 3 … featured label y 1 y 2. . . . yn
Pre-process training data (labeled examples) Terrain Unicycletype Weather Go-For-Ride? Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Trail Normal Mountain Snowy NO YES pr p e c o r t a d s s e a Terrain Unicycletype Weather Go-For-Ride? Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES n r a le model/ classifier “better” training data What types of preprocessing might we want to do?
Outlier detection vs feature selection Relevant features feature 1 X 2 X 3 X 4 … … Xn feature 2 feature 3 … featured label y 1 y 2 outliers . . yn
Outlier detection What is an outlier?
Outlier detection An example that is inconsistent with the other examples What types of inconsistencies?
Outlier detection An example that is inconsistent with the other examples - extreme feature values (or far away from other samples) - examples with the same feature values but different labels
Outlier detection An example that is inconsistent with the other examples - extreme feature values (or far away from other samples) - examples with the same feature values but different labels Fix?
Outlier detection An example that is inconsistent with the other examples - extreme feature values (or far away from other samples) - examples with the same feature values but different labels
Removing extreme outliers • Throw out examples with extreme values in one dimension • Throw out examples very far away from any other example • Say, from its neighbors • Train a probabilistic model on the data and throw out “very unlikely” examples • This is an entire field of study by itself! • Called outlier or anomaly detection.
Quick statistics recap What are the mean, standard deviation, and variance of data?
feature 1 Quick statistics recap feature 2 feature 3 … Univariate data x 1 x 2 featured multivariate data x 1 x 2 x 3. . xn xn variance: a measure of how much variation there is in the data. Calculated as: standard deviation: square root of the variance (written as σ) How can these help us with outliers?
Outlier detection Outside 3 times variance Probability smaller than 0. 3%
Outliers in a single dimension Even if the data isn’t actually distributed normally, this is still often reasonable
Kernel bandwidth Density
Outliers for machine learning Some good practices: - Throw out conflicting examples - Throw out any examples with obviously extreme feature values (i. e. many, many standard deviations away) - Check for erroneous feature values (e. g. negative values for a feature that can only be positive) - Let the learning algorithm/other pre-processing handle the rest
So far… 1. 2. Throw outlier examples Which features to us
Feature selection Good features provide us information that helps predict labels. However, not all features are good • Feature pruning is the process of removing “bad” features • Feature selection is the process of selecting “good” features Why we need feature selection? - to help build more accurate, faster, and easier to understand classifiers. • • Performance: enhancing generalization ability alleviating curse of dimensionality Efficiency: speeding up learning process Interpretability: model that is easier to understand
Removing Noisy/useless features • What are useless features? • Binary features: • remove “rare” features, i. e. features that only occur (or don’t occur) a very small number of times feature 1 feature 2 feature 3 … featured No X 1 2. 1 No • remove features that have low variance X 2 2. 05 X 3 No X 4 2. 02 No … 2. 01 • In both cases, can either use thresholds, throw away lowest x%, use … 1. 99 development data, etc. No … 2. 03 Xn 1. 98 Yes • Real-valued features:
So far… 1. 2. 3. Throw outlier examples Remove noisy features Pick “good” features
Feature selection Let’s look at the problem from the other direction, that is, selecting good features. What are good features? How can we pick/select them?
Good features A good feature correlates well with the label 1 0 1 1 0 F 1 F 2 F 3 1 0 1 1 0 0 1 1 1 0 … How can we identify this? - training error (like for DT) - correlation model - statistical test - probabilistic test - …
Filter method for feature selection - for each feature f, determine its relevance score by: - calculate the training error (only feature f is used) - Or calculate the correlation of feature (f) with the label (y) - Or calculate the class separability of this feature (classification problems) - rank each feature by this value - pick top k, top x%, etc. - can use a development set to help pick k or x
Scatter plots showing the correlation between feature(f) and label(y). y f X 2 X 1
So far… 1. 2. 3. Throw outlier examples Remove noisy features Pick “good” features
Feature normalization Length Weight Color Label 4 4 0 Apple 40 4 0 Apple 5 5 1 Apple 50 5 1 Apple 7 6 1 Banana 70 6 1 Banana 4 3 0 Apple 40 3 0 Apple 6 7 1 Banana 60 7 1 Banana 5 8 1 Banana 50 8 1 Banana 5 6 1 Apple 50 6 1 Apple Would our classifiers (k-NN and perceptron) learn the same models on these two data sets? What about for linear regression?
Feature normalization Length Weight Color Label 4 4 0 Apple 40 4 0 Apple 5 5 1 Apple 50 5 1 Apple 7 6 1 Banana 70 6 1 Banana 4 3 0 Apple 40 3 0 Apple 6 7 1 Banana 60 7 1 Banana 5 8 1 Banana 50 8 1 Banana 5 6 1 Apple 50 6 1 Apple k-NN: NO! The distances are biased based on feature magnitude.
Feature normalization Length Weight Label 4 4 Apple 7 5 Apple 5 8 Banana Length Weight Label 40 4 Apple 70 5 Apple 50 8 Banana Which of the two examples are closest to the first?
Feature normalization Length Weight Label 4 4 Apple 7 5 Apple 5 8 Banana Length Weight Label 40 4 Apple 70 5 Apple 50 8 Banana
Feature normalization Length Weight Color Label 4 4 0 Apple 40 4 0 Apple 5 5 1 Apple 50 5 1 Apple 7 6 1 Banana 70 6 1 Banana 4 3 0 Apple 40 3 0 Apple 6 7 1 Banana 60 7 1 Banana 5 8 1 Banana 50 8 1 Banana 5 6 1 Apple 50 6 1 Apple perceptron: NO! The classification and weight update are based on the magnitude of the feature value
Feature normalization Length Weight Color Label 4 4 0 Apple 40 4 0 Apple 5 5 1 Apple 50 5 1 Apple 7 6 1 Banana 70 6 1 Banana 4 3 0 Apple 40 3 0 Apple 6 7 1 Banana 60 7 1 Banana 5 8 1 Banana 50 8 1 Banana 5 6 1 Apple 50 6 1 Apple Linear regression: Yes! The model coefficient W naturally absorbs the re-scaling
Feature normalization Length Weight Color Label 40 4 0 Apple 50 5 1 Apple 70 6 1 Banana 40 3 0 Apple 60 7 1 Banana 50 8 1 Banana 50 6 1 Apple Modify all values for a given feature
Normalize each feature For each feature (over all examples): Center: adjust the values so that the mean of that feature is 0. How do we do this?
Normalize each feature For each feature (over all examples): Center: adjust the values so that the mean of that feature is 0: subtract the mean from all values Rescale/adjust feature values to avoid magnitude bias. Ideas?
Normalize each feature • Terrain Unicycletype Weather Go-For-Ride? Trail Normal Rainy NO Road Normal Sunny YES Trail Mountain Sunny YES Road Mountain Rainy YES Trail Normal Snowy NO Road Normal Rainy YES Road Mountain Snowy YES Trail Normal Sunny NO Road Normal Snowy NO Trail Mountain Snowy YES
So far… 1. 2. 3. 4. Throw outlier examples Remove noisy features Pick “good” features Normalize feature values 1. 2. center data scale data (either variance or absolute)
- Data integration in data preprocessing
- Data preprocessing
- Outlier data mining
- Outlier
- Sequential feature selection
- Aggregation in data preprocessing
- Neural network data preprocessing
- Major tasks in data preprocessing
- High leverage point vs outlier
- Why is the median useful
- Box plot applet
- Outlier analysis adalah
- What is the outlier test
- What is the outlier test
- Logistic regression outliers
- What are the third and first quartiles of the data
- Cách loại bỏ outlier trong spss
- Zxzy statistics
- "outlier property group"
- "outlier property group"
- Lower outlier boundary calculator
- Graclus
- Rich feature hierarchies for accurate object detection
- A small child slides down the four frictionless slides
- Each of the boxes shown is pulled for 10 m
- Feature dataset vs feature class
- Isolated feature combined feature effects
- Text preprocessing steps
- Document preprocessing
- Finite element example
- Image preprocessing
- Preprocessing fem
- Preprocessing in image processing
- Password hashing and preprocessing
- Password hashing and preprocessing
- Dti preprocessing
- Information gain feature selection
- Information gain feature selection
- Sequential feature selection
- Weka feature selection
- This passage is adapted from jane austen
- How are red blood cells adapted
- Adapted with permission from
- In what ways have the highland maya adapted to modern life?