Data Mining and machine learning Lecture 2 Some

  • Slides: 80
Download presentation
Data Mining (and machine learning) Lecture 2: Some common data processing tasks David Corne,

Data Mining (and machine learning) Lecture 2: Some common data processing tasks David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Today • • • About classification A simple classification method (1 -NN) Normalisation of

Today • • • About classification A simple classification method (1 -NN) Normalisation of instances and fields Discretization of fields Coursework B David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Some notation A dataset has N records {R 1, R 2, …, RN}, …

Some notation A dataset has N records {R 1, R 2, …, RN}, … each of which has F fields {A 1, A 2, …, AF}. Field AF is the target field – the one we want to predict. v(i, j) is the value of field Aj in record Ri So, for example, what is David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Some more terminology Sometimes the values of a field are numeric. We call this

Some more terminology Sometimes the values of a field are numeric. We call this a numeric field (!) Sometimes the values of a field are words or categories. E. g. {red, blue, green} {true, false} {male, female} {low, medium, high} this type of field is called a categorical field, or a nominal field. David Corne, , Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

So, what is this mostly about? You have some data: 3 7 2 9

So, what is this mostly about? You have some data: 3 7 2 9 1 A 5 8 2 9 2 B 4 1 1 1 3 B 6 8 2 1 4 B David Corne, , Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

So, what is this mostly about? You have some data: 3 7 2 9

So, what is this mostly about? You have some data: 3 7 2 9 1 A 5 8 2 9 2 B 4 1 1 1 3 B 6 8 2 1 4 B You get a new line of data, but it is not classified 3 6 1 8 1 ? David Corne Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

So, what is this mostly about? You have some data: 3 7 2 9

So, what is this mostly about? You have some data: 3 7 2 9 1 A 5 8 2 9 2 B 4 1 1 1 3 B 6 8 2 1 4 B You get a new line of data, but it is not classified 3 6 1 8 1 ? You need to predict the class value David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Broadly how it’s done In order to be able to predict the class value

Broadly how it’s done In order to be able to predict the class value of any new line of data that comes along, this is what happens: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Broadly how it’s done In order to be able to predict the class value

Broadly how it’s done In order to be able to predict the class value of any new line of data that comes along, this is what happens: Get your already-classified dataset Run a machine learning method on these data, which produces a classifier Use the classifier to make predictions of the class value for new instances David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

E. g. here is some data we saw before 3 7 2 9 1

E. g. here is some data we saw before 3 7 2 9 1 A 5 8 2 9 2 B 4 1 1 1 3 B 6 8 2 1 4 B If we run a machine learning method like RIPPER, the classifer that comes out is a rule (or a set of rules), e. g. : If (A 1<5) AND (A 5<2) THEN class is A If (A 2>1) THEN class is B David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

E. g. here is some data we saw before 3 7 2 9 1

E. g. here is some data we saw before 3 7 2 9 1 A 5 8 2 9 2 B 4 1 1 1 3 B 6 8 2 1 4 B If instead we do machine learning with a Neural Network, or Naive Bayes, or Regression, etc …, then the output classifier is (more or less) a thresholded mathematical function of the attribute values. The type of function varies a lot between methods, but, it might be e. g. : With the convention that: if this is < 0 the class is A, otherwise B David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Or the classifier could be a decision tree, such as: yes A 4<5 ?

Or the classifier could be a decision tree, such as: yes A 4<5 ? Class is A A 1>2 ? no Class is B no A 4>7 ? Class is B David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Applications Just a reminder: you should already be developing an awareness of this from

Applications Just a reminder: you should already be developing an awareness of this from previous lectures, from browsing data repositories, and from using your imagination. Business, Commerce, Engineering, Medicine, and Science provide endless important applications for this process, in which: • there are existing classified datasets, that we can learn from • and for which producing a classifier that can predict class values for new data instances is highly valuable. David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

An Important Issue Overfitting: It is very possible to learn a classifier that gives

An Important Issue Overfitting: It is very possible to learn a classifier that gives accurate performance on the existing classified dataset, but which performs very poorly in its predictions on new, previously unseen data. I will have more to say on overfitting, and how to combat it, in later lectures. David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The main issue for today Dataset preparation: There are various things we can and

The main issue for today Dataset preparation: There are various things we can and should do to our dataset so that we make it easier, or indeed possible, to apply our chosen machine learning method. E. g. we may want to apply the ID 3 Decision Tree learning algorithm – but this only works with categorical data. We may want to apply a Neural Network, but this needs the data to be all numeric, and for all values to be relatively small (e. g. between -1 and 1). Etc. . David Corne Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Useful interlude: 1 -NN classification • Almost the simplest possible classification method. • The

Useful interlude: 1 -NN classification • Almost the simplest possible classification method. • The `classifier’ is just the dataset itself • Predict the class of a new instance new by finding the instance c in the dataset that is closest to new, and predicting that it will have the same class value as c David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

How to work out distance between two instances? Depends on the data, and on

How to work out distance between two instances? Depends on the data, and on your common sense. More on that later; but for now let’s say we are only dealing with numeric data. One way to work out distance is the SSD (sum-squared distance) If you have two records a and b, then the distance between them can be given by: Just add up the squared differences of the fields, omitting the class field. Why don’t I take the square root of this? David Corne, , Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Visualisation of 1 -NN Suppose you have data with two numeric fields and one

Visualisation of 1 -NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2 D points, coloured by the class. Suppose this is the pre-classified dataset: David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Visualisation of 1 -NN Suppose you have data with two numeric fields and one

Visualisation of 1 -NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2 D points, coloured by the class. Suppose this is the pre-classified dataset: A new unclassified instance comes along – what is the class? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Visualisation of 1 -NN Suppose you have data with two numeric fields and one

Visualisation of 1 -NN Suppose you have data with two numeric fields and one class field, which is either red or green. We can treat the data as 2 D points, coloured by the class. Suppose this is the pre-classified dataset: The way you answer this visually is very similar to the way 1 -NN works David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Life is difficult / interesting Just an aside to point out that, in many

Life is difficult / interesting Just an aside to point out that, in many interesting and important classification problems, the situation is more like this: David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The performance of 1 -NN classification Calculate it for any dataset of R records

The performance of 1 -NN classification Calculate it for any dataset of R records like this: Initialise accuracy to 0. For each record r Find the closest other record, o, in the dataset If the class value of o is the same as the class value of r, than add 1 to accuracy Return 100 * accuracy/R Thereby, you obtain a percentage accuracy result David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Scaling, Normalisation and Discretization So we will learn these things: Ways to do normalisation

Scaling, Normalisation and Discretization So we will learn these things: Ways to do normalisation and scaling (which may be necessary or sensible, depending on how the data were generated, and on what DM/ML methods you want to use) Ways to do discretization (for turning numeric fields into categorical fields) David Corne, , Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Normalisation You have data about word-counts for specific words on web pages. You want

Normalisation You have data about word-counts for specific words on web pages. You want to be able to predict whether the page is about sport or business. “winner” “game” “team” “sales” “run” Page content 16 22 81 75 10 business 12 14 44 16 12 business 4 7 20 0 2 sport 2 3 7 6 1 ? ? ? What category would 1 -NN predict for the 4 th record? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Normalisation The closest record to the 4 th is the 3 rd, so it

Normalisation The closest record to the 4 th is the 3 rd, so it would be predicted to be sport. But this is probably wrong. Why? “winner” “game” “team” “sales” “run” Page content 16 22 81 75 10 business 12 14 44 16 12 business 4 7 20 0 2 sport 2 3 7 6 1 ? ? ? David Corne, r, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Normalisation The closest record to the 4 th is the 3 rd, so it

Normalisation The closest record to the 4 th is the 3 rd, so it would be predicted to be sport. But this is probably wrong. Why? “winner” “game” “team” “sales” “run” Page content 16 22 81 75 10 business 12 14 44 16 12 business 4 7 20 0 2 sport 2 3 7 6 1 ? ? ? How could you pre-process the data to help ML methods? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Record-wise Normalisation In these data, the sensible thing to do is transform each record

Record-wise Normalisation In these data, the sensible thing to do is transform each record so that it is normalised by the total number of words. This makes them more comparable, each providing a “fingerprint” in terms of the relative proportions of the probe words. Sometimes this is necessary, sometimes it is useful – it just takes common sense. David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Record-wise normalised version Each row is simply scaled so that he numeric fields add

Record-wise normalised version Each row is simply scaled so that he numeric fields add up to 1 “winner” “game” “team” “sales” “run” Page content 0. 0784 0. 1078 0. 397 0. 368 0. 049 business 0. 12 0. 14 0. 45 0. 16 0. 12 business 0. 12 0. 21 0. 61 0 0. 061 sport 0. 105 0. 158 0. 368 0. 316 0. 053 ? ? ? 1 -NN now shows record 4 closer to 1 and 2 David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Record-wise Normalisation Using the notation in slide 3 – given a dataset of N

Record-wise Normalisation Using the notation in slide 3 – given a dataset of N numeric fields, how would you represent recordwise normalization of record r ? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Min-Max Normalisation This is columnwise normalisation, done separately for each field. For each numeric

Min-Max Normalisation This is columnwise normalisation, done separately for each field. For each numeric attribute, scale it so that each value is in a specific range [a, b]. E. g. usually [0, 1], sometimes [-1, 1], etc. E. g. if we use [0, 1], we do this for each record r. Where min(i) is the smallest value for field i in the dataset, max(i) is the largest, and range(i) is max(i) – min(i). David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

When min-max normalisation might be useful Height (mm) Weight (kg) IQ Mark in DMML

When min-max normalisation might be useful Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901 75 56 86 61 81 95 101 112 110 92 70 82 90 90 ? ? ? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901 75 56 86 61 81 95 101 112 110 92 70 82 90 90 ? ? ? What is the closest record to record 5? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901 75 56 86 61 81 95 101 112 110 92 70 82 90 90 ? ? ? What is the closest record to record 5? Will this be a good predictor for DMML Mark? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901 75 56 86 61 81 95 101 112 110 92 70 82 90 90 ? ? ? What is the closest record to record 5? Will this be a good predictor for DMML Mark? Which field will be most important for predicting Mark? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901

Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901 75 56 86 61 81 95 101 112 110 92 70 82 90 90 ? ? ? What is the closest record to record 5? Will this be a good predictor for DMML Mark? Which field will be most important for predicting Mark? How does min-max normalisation help? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Discretization Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775

Discretization Height (mm) Weight (kg) IQ Mark in DMML module 1856 1502 1904 1775 1901 75 56 86 61 81 95 101 112 110 92 70 82 90 90 65 We can’t run the ID 3 decision tree algorithm on this. David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Discretization Height (mm) Weight (kg) IQ Mark in DMML module tall short tall medium

Discretization Height (mm) Weight (kg) IQ Mark in DMML module tall short tall medium tall heavy light heavy medium smart medium A A+ A+ A+ B We can run the ID 3 decision tree algorithm on this. David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Discretization is simply a process that converts a numerical field into a class field.

Discretization is simply a process that converts a numerical field into a class field. How might you do this? David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Equal Width Binning (EWB) EWB(5), for example, means you discretize into 5 values, where

Equal Width Binning (EWB) EWB(5), for example, means you discretize into 5 values, where each value has equal `width’. If field f values range from 0 to 100, then each bin has width 20. In the converted dataset, we can just label this bins 1, 2, …, 5, or we can use appropriate linguistic terms, such as: {very low, medium, high, very high} David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Equal Frequency Binning (EFB) EFB(3), for example, means you discretize into 3 values, where

Equal Frequency Binning (EFB) EFB(3), for example, means you discretize into 3 values, where each bin covers around the same amount of data. David Corne, , Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Contrived Example for EWB(2) Height (mm) Weight IQ (kg) DMM L mark 1856 75

Contrived Example for EWB(2) Height (mm) Weight IQ (kg) DMM L mark 1856 75 95 70 high low 1502 74 101 82 low high 1904 77 112 90 high 1775 61 110 90 high low 1901 76 92 70 high low high low David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Other discretization methods There are others, which almost all differ from EWB and EFB

Other discretization methods There are others, which almost all differ from EWB and EFB in one respect: they use class value information in attempt to find bins that make good sense with regard to classification. We may look at those later David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Finally: Missing values Often, a dataset has missing values for some fields in some

Finally: Missing values Often, a dataset has missing values for some fields in some records. E. g. a certain blood test was not taken for some patients. A questionnaire response for some question was unreadable, etc… David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Coursework B There are several ways to deal with missing values. Find out yourself

Coursework B There are several ways to deal with missing values. Find out yourself Do some research with google, or google scholar (or your favourite other search engine). Write a report of only ~100 words, entitled “Methods for dealing with missing values in datasets” Provide at least 2 relevant references (to papers or URLs) at the end of your report. The whole thing should only take 30 mins— 1 hr David Corne, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Next week Basic statistics Histograms A little more normalisation and discretization Coursework 1 David

Next week Basic statistics Histograms A little more normalisation and discretization Coursework 1 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Extra slides on Data Quality / Data Cleaning I am not teaching Data Quality

Extra slides on Data Quality / Data Cleaning I am not teaching Data Quality / Data Cleaning as part of F 21 DL this year. But here are some slides you may be interested in anyway David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

On Data Quality Suppose you have a database sitting in front of you, and

On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database? ’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date,

A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Conventional Definition of Data Quality, continued … Accurate: This refers to how the

A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? Barratt Burns Carter Davies John Robert Laura Michael 22 24 20 12 Maths CS Physics CS BSc MSc BSc Male Female Male David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Conventional Definition of Data Quality, continued … Complete: This refers to whether or

A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E. g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC TV Licensing DB should contain an entry for every address in the country. Does it? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only

A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only once. How many `Data Quality errors’ can you find in the following table, and what types are they? Surname Firstname Do. B Driving test passed: Smith J. Jack Jock 17/12/85 17/12/95 17/12/05 17/12/2005 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up

A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up to date. The post office recently changed my postcode from EX 6 8 RA to EX 6 8 NU. Why does this make it difficult for me to get a sensible quote for home insurance or car insurance? Can you think of a DB where it doesn’t matter whether or not the data are kept up to date? ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical

A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical errors or impossibilities. It makes sense in and of itself. Why is the following mini DB inconsistent? Date 23 rd Nov 24 th Nov 25 th Nov Sales £ 25, 609 £ 26, 202 £ 28, 936 Returns £ 1, 003 £ 1, 601 £ 1, 178 Net income £ 24, 506 £ 24, 601 £ 25, 758 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Note: This definition of data quality is not much use, since there is no

Note: This definition of data quality is not much use, since there is no way to measure DQ sensibly Completeness: How will we know? ? Uniqueness: It is hard to tell whether two entries are similar, or duplicates! Up-to-date-ness: How do we know? Consistent: consistency errors can be very hard to find, especially in a very large DB The database research `world’ is actively engaged in finding ways to measure data quality sensibly. In the meantime, we just use common sense to avoid dirty data at all points of the DQ continuum David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The Data Quality Continuum It’s rare that a datum is entered once into a

The Data Quality Continuum It’s rare that a datum is entered once into a DB and then left alone. Usually, a datum has a long and varied life, into which errors can arise at each and every stage. The continuum is: – – – Data gathering Data delivery Data storage Data integration Data retrieval Data analysis So, if we want to monitor DQ, we need to monitor it at each of these stages David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

DQ Continuum: Example This is an example I am familiar with, helping to illustrate

DQ Continuum: Example This is an example I am familiar with, helping to illustrate the DQ continuum. The International Seismological Centre (ISC) is in Thatcham, in Berkshire. It’s a charity funded by various governments. Their role is to be the repository for recording all earthquake events on the planet. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

DQ Continuum: ISC example: gathering Data gathering centres ISC David Corne, and Nick Taylor,

DQ Continuum: ISC example: gathering Data gathering centres ISC David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

See 2006 earthquake data via my dmml page David Corne, and Nick Taylor, Heriot-Watt

See 2006 earthquake data via my dmml page David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG

DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG centres. Email or ftp to ISC; some centres provide raw data, some provide interpreted data (e. g. maybe won’t send some data if they believe it in error in the first place) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where

DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where and when the Earth tremors were (there are hundreds per month) based on reverse engineering from seismograph readings. They integrate the raw data and attempt to do this, largely by hand brain, and record their findings in archival CDs David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for

DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for the earth tremor activity on any particular day. I’m not sure whether you can get the raw data from them. Naturally, you can analyse the data and see if you can find inconsistencies or errors. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness?

The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness? Consisency? What else is important in this case? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Where DQ problems occur (gathering) • Manual data entry (how can we improve this?

Where DQ problems occur (gathering) • Manual data entry (how can we improve this? ) • Lack of uniform standards format and content. • Duplicates arising from parallel entry • Approximations, alternatives, entries altered in order to cope with s/w and/or h/w constraints. • Measurement errors. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Where DQ problems occur (delivery) • Multiple hops from source to DB – problems

Where DQ problems occur (delivery) • Multiple hops from source to DB – problems can happen anywhere • Inappropriate pre-processing (e. g. removing some `small’ seismograph readings before sending on to ISC; rounding up or down, when the destination needs more accurate data). • Transmission problems: buffer overflows, checks (did all files arrive, and all correctly? ) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Where DQ problems occur (storage) • Poor, out of date or inappropriate metadata •

Where DQ problems occur (storage) • Poor, out of date or inappropriate metadata • Missing timestamps • conversion to storage format (e. g. to excel files, to higher/lower precision David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Where DQ problems occur (integration) This is the business of combining datasets – e.

Where DQ problems occur (integration) This is the business of combining datasets – e. g. from different parts of a company, from (previously) different companies following an acquisition; from different government agencies, etc. • Different keys, different fields, different formats • Different definitions (`customer’, `income’, …) • Sociological factors: reluctance to share! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs

Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs that store the retrieved data, or the use of the retrieved data in general. Problems arise because: • The source DB is not properly understood! • Straightforward mistakes in the queries that retrieve the relevant data. E. g. A database of genes contains entries that indicate whether or not each gene has a known or suspected link with cancer. A retrieval/analysis task leads to publishing a list of genes that are not relevant to cancer. What is the problem here? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency

What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

… all this: • Consistency constraints are often not applied, or are applied! –

… all this: • Consistency constraints are often not applied, or are applied! – suppose height is not allowed to go over 2 metres in a school student DB – My postcode problem • The data are just too numerous, complex and illunderstood. `Cleaning it’ would cost too much! • Undetectable problems: incorrect values, missing entries • Metadata not maintained properly David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Single Source vs Multiple Source Schema Level v Instance Level One useful way to

Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Single Source / Schema level examples Scope Problem Unclean Notes attribute Illegal values Do.

Single Source / Schema level examples Scope Problem Unclean Notes attribute Illegal values Do. B=30. 02. 88 Values out of range record Violated attribute dependencies Car-owner = No, make = Toyota Make should clearly have a Null value here. Record type Source Uniqueness violations Name= Jo Smith, NUS no. = 3067 Name= Ed Brown, NUS no. = 2124 NUS no. s should be unique Referential integrity violation Name= D Corne, Office = EM G. 92 Where is G. 92 ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Single Source / Instance level examples Scope Problem Unclean Notes attribute missing values, mis-spellings,

Single Source / Instance level examples Scope Problem Unclean Notes attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed = 0 mph, Dummy entries – values unavailable at entry time, human error record Violated attribute dependencies City = Edinburgh. Postcode = EX 6 Record type Word transposition, Name= Jo Smith, Name = Carr, Jim Name= J. Smith, Name = Joe Smith Name = Jo Smith, Do. B = 17/12/62 Name = Jo Smith, Do. B = 17/11/62 Source Wrong references Duplicates, contradictions Title = Dark Side of the Moan Full. Name = J. Smith Colour = 160 mph Phone = “Dan Jones 0207 308653” Name= D Corne, Office = EM G. 46 exists, but is not my office. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music

Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: Customer ID Name Street City Sex 102 Luke Smith 5 Chewie Rd Dawlish, Devon 0 175 Leia Smith Chewie St, 5 Dawlish 1 They also buy books and music online from company B: Client ID Last. Name Other names Phone Gender 23 Smith Luke Michael 378988 Male 35 Smith Leia S. +44(0)1626 378988 F David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

When Companies A and B merge, various problems arise when they merge their DBs

When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person? ? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A Special but Common type of Problem: Semantic Complexity (SC) is the state of

A Special but Common type of Problem: Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E. g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person. The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that

Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E. g. Suppose the histogram of value types in mobile phone no. field is: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

What does NULL mean? A. This record is of someone who does not have

What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, 0000000 for type B and 9999999 for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Cleaning via basic data analysis Data Profiling: examine the instances to see how the

Cleaning via basic data analysis Data Profiling: examine the instances to see how the attributes vary. E. g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

What problems does this analysis alert us to? David Corne, and Nick Taylor, Heriot-Watt

What problems does this analysis alert us to? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Acknowledgements I adapted the data quality/cleaning material from various sources, most notably: • A

Acknowledgements I adapted the data quality/cleaning material from various sources, most notably: • A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs • A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html