Data Mining and machine learning DM Lecture 3

  • Slides: 27
Download presentation
Data Mining (and machine learning) DM Lecture 3: Basic Statistics for data miners David

Data Mining (and machine learning) DM Lecture 3: Basic Statistics for data miners David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Overview of My Lectures All at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html •

Overview of My Lectures All at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html • • 25/9 Overview of DM (and of these 8 lectures) 02/10: Data Cleaning - usually a necessary first step for large amounts of data • 09/10 Basic Statistics for Data Miners - essential knowledge, and very useful • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry • NO THURSDAY LECTURE OCTOBER 23 rd • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data • NO THURSDAY LECTURE November 6 th • 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data • 20/11: Regression - the simplest algorithm for predicting data/class values • 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Today you will see The most important theorem in science David Corne, and Nick

Today you will see The most important theorem in science David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

 Statistical Data Mining • Definitions – Population, Sample, Statistic • Simple Statistics –

Statistical Data Mining • Definitions – Population, Sample, Statistic • Simple Statistics – Mean, Mode, Median – Range, Variance, Standard Deviation • Probability Distributions – Normal distribution David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Fundamental Statistics Definitions • A Population is the total collection of all items/individuals/events under

Fundamental Statistics Definitions • A Population is the total collection of all items/individuals/events under consideration • A Sample is that part of a population which has been observed or selected for analysis E. g. all students is a population. Students at HWU is a sample; this class is a sample, etc … • A Statistic is a measure which can be computed to describe a characteristic of the sample (e. g. the sample mean) The reason for doing this is almost always to estimate (i. e. make a good guess) things about that characteristic in the population David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

E. g. • This class is a sample from the population of students at

E. g. • This class is a sample from the population of students at HWU (it can also be considered as a sample of other populations – like what? ) • One statistic of this sample is your mean weight. Suppose that is 65 Kg. I. e. this is the sample mean. • Is 65 Kg a good estimate for the mean weight of the population? • Another statistic: suppose 10% of you are married. Is this a good estimate for the proportion that are married in the population? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Some Simple Statistics • The Mean (average) is the sum of the values in

Some Simple Statistics • The Mean (average) is the sum of the values in a sample divided by the number of values • The Median is the midpoint of the values in a sample (50% above; 50% below) after they have been ordered (e. g. from the smallest to the largest) • The Mode is the value that appears most frequently in a sample • The Range is the difference between the smallest and largest values in a sample • The Variance is a measure of the dispersion of the values in a sample – how closely the observations cluster around the mean of the sample • The Standard Deviation is the square root of the variance of a sample David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Standard Deviation and other `moment’s • The m-th moment about the mean (μ) of

Standard Deviation and other `moment’s • The m-th moment about the mean (μ) of a sample is: Where n is the number of items in the sample. • The first moment (m = 1) is 0! • The second moment (m = 2) is the variance • (and: square root of the variance is the standard deviation) • The third moment can be used in tests for skewness • The fourth moment can be used in tests for kurtosis David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Distributions / Histograms A Normal (aka Gaussian) distribution (image from Mathworld) David Corne, and

Distributions / Histograms A Normal (aka Gaussian) distribution (image from Mathworld) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Distributions / Histograms Uniform distributions. Every possible value tends to be equally likely David

Distributions / Histograms Uniform distributions. Every possible value tends to be equally likely David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Probability Distributions • If a population is expected to match a standard probability distribution

Probability Distributions • If a population is expected to match a standard probability distribution then a wealth of statistical knowledge and results can be brought to bear on its analysis • Many standard statistical techniques are based on the assumption that the underlying distribution of a population is Normal (Gaussian) • Statistical tests have been developed to determine whether a sampled population is normally distributed David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

An important aside … This is the standard deviation of a sample Std is

An important aside … This is the standard deviation of a sample Std is square root of This is slightly different, called the sample standard deviation Sample Std is square root of David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

A closer look at the normal distribution This is the ND with mean mu

A closer look at the normal distribution This is the ND with mean mu and std sigma David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

More than just a pretty bell shape Suppose mean of your sample is 1.

More than just a pretty bell shape Suppose mean of your sample is 1. 8; and suppose std of your sample is 0. 12 Theory tells us that if a population is Normal, the sample std is a fairly good guess at the population std So, we can say with some confidence, for example, that 99. 7% of the population lies between 1. 44 and 2. 16 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Date 23 rd Nov 24 th Nov 25 th Nov Sales £ 25, 609

Date 23 rd Nov 24 th Nov 25 th Nov Sales £ 25, 609 £ 26, 202 £ 28, 936 Returns £ 1, 003 £ 1, 601 £ 1, 178 Net income £ 24, 506 £ 24, 601 £ 25, 758 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The Central Limit Theorem Sir Francis Galton (Natural Inheritance, 1889) described the Central Limit

The Central Limit Theorem Sir Francis Galton (Natural Inheritance, 1889) described the Central Limit Theorem as: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along. ” (from the wikipedia article) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

 the more tosses of the coin in each expt, the more the closer

the more tosses of the coin in each expt, the more the closer the distribution of heads is to a Normal distribution. Same with : • dist of sum of two dice • dists of heights, weights, hours watching TV, etc … (from the wikipedia article) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The Central Limit Theorem is this: As more and more samples are taken from

The Central Limit Theorem is this: As more and more samples are taken from a population the distribution of the sample means conforms to a normal distribution • The average of the samples more and more closely approximates the average of the entire population • A very powerful and useful theorem • The normal distribution is such a common and useful distribution that additional statistics have been developed to measure how closely a population conforms to it and to test for divergence from it due to skewness and kurtosis David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Remember, MUCH of science relies on making guesses about populations The CLT helps us

Remember, MUCH of science relies on making guesses about populations The CLT helps us make the guesses reasonable rather than crazy. Assuming normal dist, the stats of a sample tells us lots about the stats of the population And, assuming normal dist helps us detect errors and outliers – how? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Testing for Normality: the χ2 goodness-of-fit test This is the classic test of whether

Testing for Normality: the χ2 goodness-of-fit test This is the classic test of whether a data sample is normally distributed or not • We first group our data into k classes so that we can form a frequency distribution (the number of data items in each class) • We calculate the mean and standard deviation of our sample and define a normal distribution based on these values. • We now need to see if the number of data items in each of our classes matches the number predicted by the normal distribution David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The normal distribution - with mean mu and std sigma This tells you how

The normal distribution - with mean mu and std sigma This tells you how to calculate the probability (frequency) for any value x David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The goodness of fit test simply measures the difference between the bars and the

The goodness of fit test simply measures the difference between the bars and the curve – adding up the squared difference for each bar. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

We can also test for skewness and kurtosis, using higher order moments David Corne,

We can also test for skewness and kurtosis, using higher order moments David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

The take-home lesson (for those new to statistics) Your data contains 100 values for

The take-home lesson (for those new to statistics) Your data contains 100 values for x, and you have good reason to believe that x is normally distributed. Thanks to the Central Limit Theorem, you can: – Make a lot of good estimates about the statistics of the population – Find outliers and spot other problems in the data It’s better to test for Normality though, and also test for skewness and kurtosis, so that you can say: “probably around 0. 3% of people use their mobile for >8 hrs per day, although the sample is somewhat skewed to the left so this may be an underestimate …” David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html

Next week – an actual Data Mining Algorithm! David Corne, and Nick Taylor, Heriot-Watt

Next week – an actual Data Mining Algorithm! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail. com These slides and related resources: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html