An Analysis of Data Anomalies in Data Mining

1. Definitions × Data Cleaning × Outlier Mining 4. Managing Ý Applications for Detecting

Definitions õ English language definition â 공통 규칙(common rule)에서 벗어난 데이터 â 불규칙, 뭔가

Types according to how they originate within a data source defining dirty data that

Types of Data Anomalies(2/2) õ Dullea가 정의한 five categories § data acquisition anomalies §

Acquisition õ Errors in Manual Data Entry Missing values in a data tuple can

Acquisition õ Such errors can be prevented by creating better data entry systems â

Acquisition õ Syntactic errors involve errorneous formatting and values used for representing the entities

Acquisition õ Data Semantic Anomalies â occur when this mapping is corrupted, thus, altering

õ outlier â is an observation that lies an abnormal distance from other values

õ Dullea defines outliers as maverick values â occur at the extreme end of

õ The data is considered interesting only when values occur that fall outside the

õ anomaly detection systems type Real time Offline • detect anomalies as they occur,

õ outlier detection method Statistical and distance-based Density-based Deviation-based • depend on analyzing the

õ Examples of Data Cleaning Missing or empty values • are often replaced with

õ Outlier Mining â is the term used to describe systems that focus on

õ described some of the most commonly known data anomaly types and compared some

Slides: 20

Download presentation

An Analysis of Data Anomalies in Data Mining and Knowledge Discovery in Data The 2008 International Conference on Data Mining, Las Vagas. *Mary C. Malone, Comcast Cable Corporation, USA *James Dullea, Villanova University, Pennsylvania, USA 2010. 12. 06. 김현숙

1. Definitions × Data Cleaning × Outlier Mining 4. Managing Ý Applications for Detecting Data Anomalies 3. Detecting 2 2. Types Ý Data Acquisition Anomalies Ý Errors in Manual Data Entry × Data Semantic Anomalies ÝOutliers × Outliers as Noise × Outliers as Interesting Data

Definitions õ English language definition â 공통 규칙(common rule)에서 벗어난 데이터 â 불규칙, 뭔가 다르고, 유별나서 쉽게 분류되지 않는 데이터 õ Other Definition â frequent events : considered normal, anomalies : considered rare. â events that deviate substantially from a known (explicit or implicit) model of some domain â noise in the form of irrelevant or meaningless data â anomalous data and erroneous data are used synonymously. â a considerable part of data has quality problems. Ý These problems are labeled as errors, anomalies or even dirtiness. õ the percentage of anomalous data that occurs in a data set is estimated to be relatively small or about 5% 3

1. Definitions × Data Cleaning × Outlier Mining 4. Managing Ý Applications for Detecting Data Anomalies 3. Detecting 4 2. Types Ý Data Acquisition Anomalies Ý Errors in Manual Data Entry × Data Semantic Anomalies ÝOutliers × Outliers as Noise × Outliers as Interesting Data

Types according to how they originate within a data source defining dirty data that uses a “successive hierarchical refinement” approach • single-source anomalies : occur within a single data source • multi-source anomalies : occur when different data sources are combined to form a whole. • missing data • not missing but wrong data • not missing and not wrong but unusable data 5

Types of Data Anomalies(2/2) õ Dullea가 정의한 five categories § data acquisition anomalies § data semantic anomalies § maverick values § predictor issues § organization and granularity issues Müller가 정의한 classification scheme § syntax § semantics § coverage maverick values We incorporate two different approaches for classifying data anomalies as an outline § data acquisition anomalies, § data semantic anomalies and maverick values, § and augment these categories with classifications as defined in Müller 6

Acquisition õ Errors in Manual Data Entry Missing values in a data tuple can occur • Ex) when reporting sex – in that M or F are the only given choices – but neither choice is selected, leaving the field blank Empty values • may or may not have a value, • but because the respondent was unsure of the correct value, the field was left blank. Ex) zip codes Improper input or incorrect input • occurs as a result of typographical errors or incorrect data entry. • Ex) PATIENT_ID or SSN Bad questions Respondent guessing • Ex) What color are your grandfather’s eyes? • everyone has two grandfathers. • the person entering the data “guesses. ” • Ex) Cleveland State University study showed 50% of majors as “undecided” for both spring and fall 1999 terms. 7

Acquisition õ Such errors can be prevented by creating better data entry systems â defining more accurate questions in surveys, â or by employing better data collection forms like radio buttons to enforce mutually exclusive choices on a respondent for a selection such as M or F. â postal systems can also be used to check zip codes against input addresses õ But because of human error, such as typos, some data entry problems are unavoidable. 8

Acquisition õ Syntactic errors involve errorneous formatting and values used for representing the entities in a database. lexical error the table is expected to have five columns, but some or all of the rows contain only four columns a missing or empty value result in semantic anomaly contradiction between values that violated a dependency between attributes coverage anomaly if the value that is missing is defined for an attribute that had been assigned a not null constraint in the database. 9

Acquisition õ Data Semantic Anomalies â occur when this mapping is corrupted, thus, altering the meaning of the data in some way. Attributes with the same name but different meanings or different names and the same meanings Ex) YY or Y for year. the naming of one attribute depends on the value of another attribute Ex) as in the relationship between AGE and DATE_OF_BIRTH for a tuple representing persons. Overloaded attributes where an attribute has more than one meaning Ex) clothing code in a store, “V 00403 sp 33100502” -> (V) = style, (004) = the color BLUE, etc. . . 10

õ outlier â is an observation that lies an abnormal distance from other values in a random sample from a population. â In a database, outliers appear as objects that do not conform to the general behavior of the data. â Most data mining methods consider outliers to be noise, causing problems for classification or clustering. õ outliers detection methods â statistical tests based on a probability model of the data â others use distance measures to identify objects that are a significantly large distance from any other cluster in the data â deviation-based methods examine differences in the main characteristics of objects in a group to identify outliers 11

õ Dullea defines outliers as maverick values â occur at the extreme end of a distribution and greatly vary from the distribution mean â occur when one or two values strongly bias an average distribution (ex: average in a group) â these maverick values are considered to be detrimental noise that causes the analysis to suffer â The goal in most data mining applications, especially classifiers and predictors, is to detect and remove these outliers from the data 12

õ The data is considered interesting only when values occur that fall outside the range of what is expected. â Ex) computer network intrusion detection systems Ý The systems first establish a baseline of normal network traffic data Ý The network is then monitored against this baseline. Ý When network data changes significantly in comparison to the baseline, it indicates possible intrusion, Ý thus the salient characteristics of the outlying data are what make it useful, valued and deemed interesting in such systems. 13

1. Definitions × Data Cleaning × Outlier Mining 4. Managing Ý Applications for Detecting Data Anomalies 3. Detecting 14 2. Types Ý Data Acquisition Anomalies Ý Errors in Manual Data Entry × Data Semantic Anomalies ÝOutliers × Outliers as Noise × Outliers as Interesting Data

õ anomaly detection systems type Real time Offline • detect anomalies as they occur, • logging issues for later analysis, • or sometimes sounding alarms that require immediate human intervention. • detect anomalies and manage them within an iterative cycle of detection and cleaning. 15

õ outlier detection method Statistical and distance-based Density-based Deviation-based • depend on analyzing the overall distribution of the data. • It is used when data is not uniformly distributed and, therefore, • cannot benefit from the statistical and distancebased methods. • identifies outliers by examining the main characteristics in a group and how objects deviate from those characteristics. 16

1. Definitions × Data Cleaning × Outlier Mining 4. Managing Ý Applications for Detecting Data Anomalies 3. Detecting 17 2. Types Ý Data Acquisition Anomalies Ý Errors in Manual Data Entry × Data Semantic Anomalies ÝOutliers × Outliers as Noise × Outliers as Interesting Data

õ Examples of Data Cleaning Missing or empty values • are often replaced with a mean, median or midrange value determined by the data expert. convert names and addresses • Some tools, help convert names and addresses in several countries and some can even detect and correct wrongly entered street addresses. Some discrepancies between attributes • such as AGE and DATE_OF_BIRTH causing data semantic anomalies can be handled using data profiling techniques that deduce the correct values by creating metadata, or data about the data. 18

õ Outlier Mining â is the term used to describe systems that focus on detecting or mining for outliers as indicators of change in the normal behavior of a system â often, the outliers are used to trigger alarms that require some immediate response or system intervention â applications Ý credit card fraud Ý severe weather prediction Ý computer network intrusion Ý outbreaks of disease 19

õ described some of the most commonly known data anomaly types and compared some of them õ described how manual data entry systems are highly susceptible to human error â Ex) data entry systems are poorly designed õ outlier â noise that should be removed â indicator or alarm that is sought as a goal in outlier mining õ data cleaning is necessary â noisy data may result in unreliable classifiers or predictors õ not much literature is available on the topic of anomaly detection and management, and what is available is mostly highly case specific, opportunities for research in this area are wide 20