Chapter 4 Numerical Descriptive Techniques 1 4 2

4. 2 Measures of Central Location § Usually, we focus our attention on two

4. 2 Measures of Central Location § The measure of central location reflects the

The Arithmetic Mean § This is the most popular and useful measure of central

The Arithmetic Mean Sample mean Sample size Population mean Population size 5

The arithmetic mean The Arithmetic Mean • Example 4. 1 The reported time on

The Median § The Median of a set of observations is the value that

The Mode § The Mode of a set of observations is the value that

The Mode The Mean, Median, Mode The Mode § Example 4. 5 Find the

Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean,

The Geometric Mean § This is a measure of the average growth rate. §

The Geometric Mean For the given series of rate of. If the rate of

4. 3 Measures of variability § Measures of central location fail to tell the

4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average

§ The range l l The range of a set of observations is the

The Variance l l l This measure reflects the dispersion of all the observations

Why not use the sum of deviations? Consider two small populations: 9 -10= -1

The Variance Let us calculate the variance of the two popula Why is the

The Variance Let us calculate. Which the sum deviations for both dataof setsquared has

The Variance Sum. A = (1 -2)2 +…+(1 -2)2 +(3 -2)2 +… +(3 -2)2=

The Variance However, when calculated on “per observation” basis (variance), the data set dispersions

The Variance § Example 4. 7 l The following sample consists of the number

Standard Deviation § The standard deviation of a set of observations is the square

Standard Deviation § Example 4. 8 l l l To examine the consistency of

The Standard Deviation § Example 4. 8 – solution Excel printout, from the “Descriptive

Interpreting Standard Deviation § The standard deviation can be used to l l compare

Interpreting Standard Deviation § Example 4. 9 A statistics practitioner wants to describe the

Interpreting Standard Deviation Example 4. 9 – solution § The empirical rule can be

The Chebysheff’s Theorem § The proportion of observations in any sample that lie within

The Chebysheff’s Theorem § Example 4. 10 l The annual salaries of the employees

The Coefficient of Variation § The coefficient of variation of a set of measurements

4. 4 Measures of Relative Standing and Box Plots § Percentile l The pth

Quartiles § Commonly used percentiles l l l First (lower)decile = 10 th percentile

Quartiles § Example Find the quartiles of the following set of measurements 7, 8,

Quartiles § Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12,

Location of Percentiles § Find the location of any percentile using the formula §

Location of Percentiles § Example 4. 11 – solution l After sorting the data

Location of Percentiles § Example 4. 11 – solution continued The 50 th percentile

Location of Percentiles § Example 4. 11 – solution continued The 75 th percentile

Quartiles and Variability § Quartiles can provide an idea about the shape of a

Interquartile Range § This is a measure of the spread of the middle 50%

Box Plot l This is a pictorial display that provides the main descriptive measures

Box Plot § Example 4. 14 (Xm 02 -01) Left hand boundary = 9.

Box Plot l Additional Example - GMAT scores Create a box plot for the

Box Plot GMAT - continued 449 Q 1 Q 2 512 537 25% l

Box Plot GMAT - continued The histogram is positively skewed Q 1 Q 2

Box Plot § Example 4. 15 (Xm 04 -15) l l A study was

Box Plot Jack in the Box Hardee’s Jack in the box is the slowest

Box Plot Times are symmetric Jack in the Box Hardee’s Jack in the box

4. 5 Measures of Linear Relationship § The covariance and the coefficient of correlation

Covariance mx (my) is the population mean of the variable X (Y). N is

Covariance § Compare the following three sets xi yi (x – x) (y –

Covariance § If the two variables move in the same direction, (both increase or

The coefficient of correlation l This coefficient answers the question: How strong is the

The coefficient of correlation +1 Strong positive linear relationship COV(X, Y)>0 r or r

The coefficient of correlation § If the two variables are very strongly positively related,

The coefficient of correlation and the covariance – Example 4. 16 § Compute the

The coefficient of correlation and the covariance – Example 4. 16 Student x y

The coefficient of correlation and the covariance – Example 4. 16 – Excel §

The Least Squares Method § We are seeking a line that best fits the

The least Squares Method Y Errors X Different lines generate different errors, thus different

The least Squares Method The coefficients b 0 and b 1 of the line

The Least Squares Method § Example 4. 17 l Find the least squares line

Slides: 67

Download presentation

Chapter 4 Numerical Descriptive Techniques 1

4. 2 Measures of Central Location § Usually, we focus our attention on two types of measures when describing population characteristics: l l Central location (e. g. average) Variability or spread The measure of central location reflects the locations of all the actual data points. 2

4. 2 Measures of Central Location § The measure of central location reflects the locations of all the actual data points. § How? With two data points, the central location But if the third fall data With one data point should in point the middle appears on the left hand-side clearly the central between them (in order of the midrange, it should “pull” location is at the point to reflect the location of the central location to the left. itself. both of them). 3

The Arithmetic Mean § This is the most popular and useful measure of central location Sum of the observations Mean = Number of observations 4

The Arithmetic Mean Sample mean Sample size Population mean Population size 5

The arithmetic mean The Arithmetic Mean • Example 4. 1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 0 7 22 11. 0 • Example 4. 2 Suppose the telephone bills of Example 2. 1 represent the population of measurements. The population mean is 42. 19 38. 45 45. 77 43. 59 6

The Median § The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Example 4. 3 Comment Suppose only 9 adults were sampled Find the median of the time on the internet (exclude, say, the longest time (33)) for the 10 adults of example 4. 1 Odd number of observations Even number of observations 0, 0, 5, 5, 7, 7, 8, 8, , 9, 12, 14, 22, 33 0, 330, 5, 7, 8 8 9, 12, 14, 22 8. 5 7

The Mode § The Mode of a set of observations is the value that occurs most frequently. § Set of data may have one mode (or modal class), or two or more modes. The modal class For large data sets the modal class is much more relevant than a single-value mode. 8

The Mode The Mean, Median, Mode The Mode § Example 4. 5 Find the mode for the data in Example 4. 1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution • All observation except “ 0” occur once. There are two “ 0”. Thus, the mode is zero. • Is this a good measure of central location? • The value “ 0” does not reside at the center of this set (compare with the mean = 11. 0 and the mode = 8. 5). 9

Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median 10

Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Median Mean Mode Median 11

The Geometric Mean § This is a measure of the average growth rate. § Let Ri denote the rate of return in period i (i=1, 2…, n). The geometric mean of the returns R 1, R 2, …, Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods. 12

The Geometric Mean For the given series of rate of. If the rate of return was Rg in e returns the nth period return is period, the nth period return wo calculated by: be calculated by: = Rg is selected such that… 13

4. 3 Measures of variability § Measures of central location fail to tell the whole story about the distribution. § A question of interest still remains unanswered: How much are the observations spread out around the mean value? 14

4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to. . . 15

4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before. 16

§ The range l l The range of a set of observations is the difference between the largest and smallest observations. But, how do all the observations spread out? Its major advantage is the ease with which it can be computed. ? ? ? The range cannot assist. Range in answering this question l Smallest Its major shortcoming is its Largest failure to observation provide information on the dispersion of the observations between the two end points. 17

The Variance l l l This measure reflects the dispersion of all the observations The variance of a population of size N x 1, x 2, …, x. N whose mean is m is defined as The variance of a sample of n observations x 1, x 2, …, xn whose mean is is defined as 18

Why not use the sum of deviations? Consider two small populations: 9 -10= -1 11 -10= +1 8 -10= -2 A measure of dispersion A B 4 Can the sum of deviations agreesofwith this Be Should a good measure dispersion? The sum of deviations observation. is zero for both populations, 8 9 10 therefore, 11 12 is not …but a The good measurements mean measure of both in B arepopulations more dispersed is 10. . . of dispersion. then those in A. 7 10 13 12 -10= +2 Sum = 0 4 -10 = - 6 16 -10 = +6 7 -10 = -3 16 13 -10 = +3 19 Sum = 0

The Variance Let us calculate the variance of the two popula Why is the variance defined as After all, the sum of squared the average squared deviation? deviations increases in Why not use the sum of squared magnitude when the variation deviations as a measure of of a data set increases!! 20 variation instead?

The Variance Let us calculate. Which the sum deviations for both dataof setsquared has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5 21

The Variance Sum. A = (1 -2)2 +…+(1 -2)2 +(3 -2)2 +… +(3 -2)2= 10 Sum. B = (1 -3)2 + (5 -3)2 = 8 Sum. A > Sum. B. This is inconsistent with observation that set B is more dispers A B 1 2 3 1 3 5 22

The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. s. A 2 = Sum. A/N = 10/5 = 2 s. B 2 = Sum. B/N = 8/2 = 4 A B 1 2 3 1 3 5 23

The Variance § Example 4. 7 l The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance § Solution 24

The Variance – Shortcut method 25

Standard Deviation § The standard deviation of a set of observations is the square root of the variance. 26

Standard Deviation § Example 4. 8 l l l To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7 -iron) club, and 75 with the new club. The distances were recorded. Which 7 -iron is more consistent? 27

The Standard Deviation § Example 4. 8 – solution Excel printout, from the “Descriptive Statistics” sub-menu. The innovation club is more consistent, and because the means are close, is considered a better club 28

Interpreting Standard Deviation § The standard deviation can be used to l l compare the variability of several distributions make a statement about the general shape of a distribution. § The empirical rule: If a sample of observations has a mound-shaped distribution, the interval 29

Interpreting Standard Deviation § Example 4. 9 A statistics practitioner wants to describe the way returns on investment are distributed. l l l The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped. 30

Interpreting Standard Deviation Example 4. 9 – solution § The empirical rule can be applied (bell shaped histogram) § Describing the return distribution l l l Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] Approximately 99. 7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)] 31

The Chebysheff’s Theorem § The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1 -1/k 2 for k > 1. § This theorem is valid for any set of measurements (sample, population) of any shape!! (1 -1/12) 2) K Interval Chebysheff (1 -1/2 Empirical Rule 1 2 at least 0% at least 75% 2) approximately (1 -1/3 approximately 95% 68% 32

The Chebysheff’s Theorem § Example 4. 10 l The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are $28, 000 and $3, 000, respectively. What can you say about the salaries at this chain? Solution At least 75% of the salaries lie between $22, 000 and $34, 000 28000 – 2(3000) 28000 + 2(3000) 33 At least 88. 9% of the salaries lie between $$19, 000

The Coefficient of Variation § The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. § This coefficient provides a proportionate A standard deviation of 10 may be perceived measure of variation. large when the mean value is 100, but only moderately large when the mean value is 50 34

4. 4 Measures of Relative Standing and Box Plots § Percentile l The pth percentile of a set of measurements is the value for which • p percent of the observations are less than that value • 100(1 -p) percent of all the observations are greater than that value. l Example • Suppose 60% yourofscore is the 60% percentile of a 40% all the scores lie here SAT test. Then 35 Your score

Quartiles § Commonly used percentiles l l l First (lower)decile = 10 th percentile First (lower) quartile, Q 1, = 25 th percentile Second (middle)quartile, Q 2, = 50 th percentile Third quartile, Q 3, = 75 th percentile Ninth (upper)decile = 90 th percentile 36

Quartiles § Example Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8 37

Quartiles § Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 21, 27, 29, 30 15 observations The first quartile At most (. 25)(15) = 3. 75 observations At most (. 75)(15)=11. 25 observations should appear below the first quartile. should appear above the first quartile Check the first 3 observations on the Check 11 observations on the left hand side. right hand side. Comment: If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations. 38

Location of Percentiles § Find the location of any percentile using the formula § Example 4. 11 Calculate the 25 th, 50 th, and 75 th percentile of the data in Example 4. 1 39

Location of Percentiles § Example 4. 11 – solution l After sorting the data we have 0, 0, 5, 7, 8, 5 Values 0 3. 75 9, 22, 33. 0 Location 2 Location 1 2. 75 3 Location 3 The 2. 75 th location Translates to the value (. 75)(5 – 0) = 3. 75 40

Location of Percentiles § Example 4. 11 – solution continued The 50 th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8. 5. 41

Location of Percentiles § Example 4. 11 – solution continued The 75 th percentile is one quarter of the distance between the eighth and ninth observation that is 14+. 25(22 –Ninth 14) = 16. Eighth observation 42

Quartiles and Variability § Quartiles can provide an idea about the shape of a histogram Q 1 Q 2 Positively skewed histogram Q 3 Q 1 Q 2 Q 3 Negatively skewed histogram 43

Interquartile Range § This is a measure of the spread of the middle 50% of the observations § Large value indicates a large spread of the observations Interquartile range = Q 3 – Q 1 44

Box Plot l This is a pictorial display that provides the main descriptive measures of the data set: • • • L - the largest observation Q 3 - The upper quartile Q 2 - The median Q 1 - The lower quartile S - The smallest observation 1. 5(Q 3 – Q 1) Whisker S Q 1 1. 5(Q 3 – Q 1) Q 2 Q 3 Whisker L 45

Box Plot § Example 4. 14 (Xm 02 -01) Left hand boundary = 9. 275– 1. 5(IQR)= -104 Right hand boundary=84. 9425+ 1. 5(IQR)=1 -104. 226 0 9. 275 84. 9425 119. 63 26. 905 198. 4438 No outliers are found 46

Box Plot l Additional Example - GMAT scores Create a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT. XLS) 417. 5 449 512 -1. 5(IQR) 512 537 575 669. 5 575+1. 5(IQR) 47 788

Box Plot GMAT - continued 449 Q 1 Q 2 512 537 25% l Q 3 575 50% 669. 5 25% Interpreting the box plot results • The scores range from 449 to 788. • About half the scores are smaller than 537, and about half are larger than 537. • About half the scores lie between 512 and 575. • About a quarter lies below 512 and a quarter above 575. 48

Box Plot GMAT - continued The histogram is positively skewed Q 1 Q 2 512 537 449 25% 50% Q 3 575 669. 5 25% 50% 25% 49

Box Plot § Example 4. 15 (Xm 04 -15) l l A study was organized to compare the quality of service in 5 drive through restaurants. Interpret the results § Example 4. 15 – solution l Minitab box plot 50

Box Plot Jack in the Box Hardee’s Jack in the box is the slowest in service Hardee’s service time variability is the larg Mc. Donalds Wendy’s service time appears to be the shortest and most consistent. Popeyes 51

Box Plot Times are symmetric Jack in the Box Hardee’s Jack in the box is the slowest in service Hardee’s service time variability is the larg Mc. Donalds Wendy’s service time appears to be the shortest and most consistent. Popeyes Times are positively skewed 52

4. 5 Measures of Linear Relationship § The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. l l Covariance - is there any pattern to the way two variables move together? Coefficient of correlation - how strong is the linear relationship between two variables 53

Covariance mx (my) is the population mean of the variable X (Y). N is the population size. x (y) is the sample mean of the variable X (Y). n is the sample size. 54

Covariance § Compare the following three sets xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 13 20 27 -3 1 2 -7 0 7 21 0 14 x=5 y y=20 i xi 2 6 7 27 20 13 (x – x) (y – y) -3 1 2 7 0 -7 Cov(x, y)=17. (x 5 – x)(y – y) -21 0 -14 xi yi 2 6 7 20 27 13 x= 5 y =20 Cov(x, y) = 3. 5 55

Covariance § If the two variables move in the same direction, (both increase or both decrease), the covariance is a large § positive If the twonumber. variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number. § If the two variables are unrelated, the covariance will be close to zero. 56

The coefficient of correlation l This coefficient answers the question: How strong is the association between X and Y. 57

The coefficient of correlation +1 Strong positive linear relationship COV(X, Y)>0 r or r = 0 or No linear relationship COV(X, Y)=0 COV(X, Y)<0 -1 Strong negative linear relationship 58

The coefficient of correlation § If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). § If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). § No straight line relationship is indicated 59

The coefficient of correlation and the covariance – Example 4. 16 § Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another. § Solution l We believe GMAT affects GPA. Thus • GMAT is labeled X • GPA is labeled Y 60

The coefficient of correlation and the covariance – Example 4. 16 Student x y 3588 x 2 y xy 5750. 9. 6 1 599 01 92. 16 4 4747 6063. 2 689 8. 8 21 77. 44 2 cov(x, y)=(1/12 -1)[67, 559. 2 -(7587)(106. 4)/12]=26. 16 3410 4321. 3 584 7. 4 56 54. 76 6 3981 2. 5 Sx = {(1/12 -1)[4, 817, 755 -(7587) /12)]} =43. 56 4 631 10 100 6310 61 2 Sy …………………………. = similar to Sx = 1. 12 593 x. S 8. 8 351649 77. 44 5218. 4 r =11 cov(x, y)/S =. 5362 y = 26. 16/(43. 56)(1. 12) 12 683 8 Total 7, 587 106. 4 466489 64 4, 817, 7 55 957. 2 5464 67, 559. 2 61

The coefficient of correlation and the covariance – Example 4. 16 – Excel § Use the Covariance option in Data Analysis § If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values. § Use the Correlation option to produce the correlation matrix. Variance-Covariance Matrix Population. GPA values GPA 1. 15 GMA T 23. 98 GMA T 1739. 52 Sample values 12 ´ 12 -1 GPA GMA T 1. 25 26. 16 1897. 62 66

The coefficient of correlation and the covariance – Example 4. 16 – Excel § Interpretation l l The covariance (26. 16) indicates that GMAT score and performance in the MBA program are positively related. The coefficient of correlation (. 5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA. 63

The Least Squares Method § We are seeking a line that best fits the data when two variables are (presumably) related to one another. § We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. The actual y value of point i The y value of point i calculated from the equation 64

The least Squares Method Y Errors X Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared err 65

The least Squares Method The coefficients b 0 and b 1 of the line that minimizes the sum of squares of errors are calculated from the data. 66

The Least Squares Method § Example 4. 17 l Find the least squares line for Example 4. 16 (Xm 04 -16. xls) 67