# Chapter 4 Numerical Descriptive Techniques 1 4 2

• Slides: 67

Chapter 4 Numerical Descriptive Techniques 1

4. 2 Measures of Central Location § Usually, we focus our attention on two types of measures when describing population characteristics: l l Central location (e. g. average) Variability or spread The measure of central location reflects the locations of all the actual data points. 2

4. 2 Measures of Central Location § The measure of central location reflects the locations of all the actual data points. With two data points, § How? the central location if the thirdfall data point With one data. But point should in the middle appears on them left hand-side clearly the central between (in order midrange, it should “pull” location is at of thethe point to reflect the location of the central location to the left. itself. both of them). 3

The Arithmetic Mean § This is the most popular and useful measure of central location Sum of the observations Mean = Number of observations 4

The Arithmetic Mean Sample mean Sample size Population mean Population size 5

The Arithmetic Mean The arithmetic mean • Example 4. 1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 0 7 22 11. 0 • Example 4. 2 Suppose the telephone bills of Example 2. 1 represent the population of measurements. The population mean 42. 19 38. 45 45. 77 43. 59 6

The Median § The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Example 4. 3 Comment Suppose Find the median of the time on the internetonly 9 adults were sample for the 10 adults of example 4. 1 (exclude, say, the longest time (33) Odd number of observation Even number of observations 0, 0, 0, 5, 5, 7, 7, 8, 8, , 9, 12, 14, 22, 33 335, 7, 8 8 9, 12, 14, 7 22 8. 5

The Mode § The Mode of a set of observations is the value that occurs most frequently. § Set of data may have one mode (or modal class), or two or more modes. The modal class For large data sets the modal class is much more relevan than a single-value mode. 8

The Mode The Mean, Median, Mode The Mode § Example 4. 5 Find the mode for the data in Example 4. 1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution • All observation except “ 0” occur once. There are two “ 0”. Thus, the mode is zero. • Is this a good measure of central location? • The value “ 0” does not reside at the center of this set (compare with the mean = 11. 0 and the mode = 8. 5). 9

Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median 10

Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Median Mean Mode Median 11

The Geometric Mean § This is a measure of the average growth rate. § Let Ri denote the rate of return in period i (i=1, 2…, n). The geometric mean of the returns R 1, R 2, …, Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods. 12

The Geometric Mean For the given series of rate. Ifof the rate of return was Rg in returns the nth period return is period, the nth period return calculated by: be calculated by: = Rg is selected such that… 13

4. 3 Measures of variability § Measures of central location fail to tell the whole story about the distribution. § A question of interest still remains unanswered: How much are the observations spread out around the mean value? 14

4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to. . . 15

4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before. 16

§ The range l l The range of a set of observations is the difference between the largest and smallest observations. But, how do all the observations spread out? Its major advantage is the ease with which it can be computed. ? ? ? The range cannot assist. Range in answering this question l Largest Smallest Its major shortcoming is its failure to observation provide information on the dispersion of the observations between the two end points. 17

The Variance l l l This measure reflects the dispersion of all the observations The variance of a population of size N x 1, x 2, …, x. N whose mean is m is defined as The variance of a sample of n observations x 1, x 2, …, xn whose mean is is defined as 18

Why not use the sum of deviations? Consider two small populations: A B 4 9 -10= -1 11 -10= +1 8 -10= -2 12 -10= +2 Athe measure of dispersion The. Cansum ofofdeviations sum deviations agrees of with this Be Should a good measure dispersion? is zero for both observation. populations, 8 9 10 11 12 therefore, is not a …but Themeasurements mean of both in B good are measure more dispersed populations isof 10. . . then those in A. dispersion. 7 10 13 Sum = 0 4 -10 = - 6 16 -10 = +6 7 -10 = -3 16 13 -10 = +3 19 Sum = 0

The Variance Let us calculate the variance of the two pop Why is the variance defined as After all, the sum of squared the average squared deviation? deviations increases in Why not use the sum of squared magnitude when the variation deviations as a measure of of a data set increases!!20 variation instead?

The Variance Let us calculate thedata sumset ofhas squared Which a largerdeviations dispersion? for b A B 1 2 3 Data set B is more dispersed around the mean 1 3 5 21

The Variance Sum. A = (1 -2)2 +…+(1 -2)2 +(3 -2)2 +… +(3 -2)2= Sum. B = (1 -3)2 + (5 -3)2 = 8 Sum. A > Sum. B. This is inconsistent w observation that set B is more dispe A B 1 2 3 1 3 5 22

The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. s. A 2 = Sum. A/N = 10/5 = 2 s. B 2 = Sum. B/N = 8/2 = 4 A B 1 2 3 1 3 5 23

The Variance § Example 4. 7 l The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance § Solution 24

The Variance – Shortcut method 25

Standard Deviation § The standard deviation of a set of observations is the square root of the variance. 26

Standard Deviation § Example 4. 8 l l l To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7 -iron) club, and 75 with the new club. The distances were recorded. Which 7 -iron is more consistent? 27

The Standard Deviation § Example 4. 8 – solution Excel printout, from the “Descriptive Statistics” submenu. The innovation club is more consistent, and because the means are close, is considered a better club 28

Interpreting Standard Deviation § The standard deviation can be used to l l compare the variability of several distributions make a statement about the general shape of a distribution. § The empirical rule: If a sample of observations has a mound-shaped distribution, the interval 29

Interpreting Standard Deviation § Example 4. 9 A statistics practitioner wants to describe the way returns on investment are distributed. l l l The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped. 30

Interpreting Standard Deviation Example 4. 9 – solution § The empirical rule can be applied (bell shaped histogram) § Describing the return distribution l l l Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] Approximately 99. 7% of the returns lie between -14% 31 and 34% [10

The Chebysheff’s Theorem § The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1 -1/k 2 for k > 1. § This theorem is valid for any set of measurements (sample, population) of any shape!! (1 -1/12) K Interval Chebysheff(1 -1/22) Empirical Rule (1 -1/32) 1 at least 0% 32 approximately 68%

The Chebysheff’s Theorem § Example 4. 10 l The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are \$28, 000 and \$3, 000, respectively. What can you say about the salaries at this chain? Solution At least 75% of the salaries lie between \$22, 000 and \$34, 000 28000 – 2(3000) 28000 + 2(3000) 33 At least 88. 9% of the salaries lie between

The Coefficient of Variation § The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. § This coefficient provides a proportionate A standard deviation of 10 may be percei measure of variation. large when the mean value is 100, but on moderately large when the mean value is 34

4. 4 Measures of Relative Standing and Box Plots § Percentile l The pth percentile of a set of measurements is the value for which • p percent of the observations are less than that value • 100(1 -p) percent of all the observations are greater than that value. l Example • Suppose 60% your score is the 60% percentile of 40% of all the scores lie here a SAT test. Then 35 Your score

Quartiles § Commonly used percentiles l l l First (lower)decile = 10 th percentile First (lower) quartile, Q 1, = 25 th percentile Second (middle)quartile, Q 2, = 50 th percentile Third quartile, Q 3, = 75 th percentile Ninth (upper)decile = 90 th 36

Quartiles § Example Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8 37

Quartiles § Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 21, 27, 29, 30 15 observations The first quartile At most (. 25)(15) = 3. 75 observations At most (. 75)(15)=11. 25 observatio should appear below the first quartile. should appear above the first qua Check the first 3 observations on. Check the 11 observations on the left hand side. right hand side. Comment: If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations. 38

Location of Percentiles § Find the location of any percentile using the formula § Example 4. 11 Calculate the 25 th, 50 th, and 75 th percentile of the data in Example 4. 1 39

Location of Percentiles § Example 4. 11 – solution l After sorting the data we have 0, 0, 5, 7, 5 Values 0 3. 75 8, 9, 22, 33. 0 Location 2 Location 1 2. 75 3 Location 3 The 2. 75 th location Translates to the value (. 75)(5 – 0) = 3. 75 40

Location of Percentiles § Example 4. 11 – solution continued The 50 th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8. 5. 41

Location of Percentiles § Example 4. 11 – solution continued The 75 th percentile is one quarter of the distance between the eighth and ninth observation that is 14+. 25(22 – 14) = 16. Eighth Ninth observation 42

Quartiles and Variability § Quartiles can provide an idea about the shape of a histogram Q 1 Q 2 Positively skewed histogram Q 3 Q 1 Q 2 Q 3 Negatively skewed histogram 43

Interquartile Range § This is a measure of the spread of the middle 50% of the observations § Large value indicates a large spread of the observations Interquartile range = Q 3 – Q 1 44

Box Plot l This is a pictorial display that provides the main descriptive measures of the data set: • L - the largest observation • Q 3 - The upper quartile • Q 2 - The median • Q 1 - The lower quartile • S -– The 1. 5(Q Q 1) smallest observation 1. 5(Q 3 – Q 1) 3 Whisker S Q 1 Q 2 Q 3 Whisker L 45

Box Plot § Example 4. 14 (Xm 02 -01) Left hand boundary = 9. 275– 1. 5(IQR)= Right hand boundary=84. 9425+ 1. 5(IQR -104. 226 0 9. 275 84. 9425119. 63 198. 4438 26. 905 No outliers are found 46

Box Plot l Additional Example - GMAT scores Create a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT. XLS) 417. 5 449 512 -1. 5(IQR) 512 537 575 669. 5 575+1. 5(IQR) 47 788

Box Plot GMAT - continued 449 Q 1 Q 2 512 537 25% l 50% Q 3 575 669. 5 25% Interpreting the box plot results • The scores range from 449 to 788. • About half the scores are smaller than 537, and about half are larger than 537. • About half the scores lie between 512 and 575. • About a quarter lies below 512 and a quarter above 48 575.

Box Plot GMAT - continued The histogram is positively skewed Q 1 Q 2 512 537 449 25% 50% Q 3 575 669. 5 25% 50% 25% 49

Box Plot § Example 4. 15 (Xm 04 -15) l l A study was organized to compare the quality of service in 5 drive through restaurants. Interpret the results § Example 4. 15 – solution l Minitab box plot 50

Box Plot Jack in the Box Jack in the box is the slowest in service Hardee’s service time variability is the Mc. Donalds Wendy’s Popeyes Wendy’s service time appears to be the shortest and most consistent. 51

Box Plot Times are symmetric Jack in the Box Hardee’s Mc. Donald s Wendy’s Popeyes Jack in the box is the slowest in service Hardee’s service time variability is the Wendy’s service time appears to be the shortest and most consistent. Times are positively skewed 52

4. 5 Measures of Linear Relationship § The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. l Covariance - is there any pattern to the l way two variables move together? Coefficient of correlation - how strong is the linear relationship between two variables 53

Covariance mx (my) is the population mean of the variable X (Y). N is the population size. x (y) is the sample mean of the variable X (Y). n is the sample size. 54

Covariance § Compare the following three sets xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 13 20 27 -3 1 2 -7 0 7 21 0 14 x=5 y y i =20 xi 2 6 7 27 20 13 (x – x) (y – y) -3 1 2 7 0 -7 Cov(x, y)=17 (x. 5 – x)(y – y) -21 0 -14 xi yi 2 6 7 20 27 13 Cov(x, y) = 3. 5 x= y 5 =20 55

Covariance § If the two variables move in the same direction, (both increase or both decrease), the covariance is a large number. § positive If the two variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number. § If the two variables are unrelated, the covariance will be close to zero. 56

The coefficient of correlation l This coefficient answers the question: How strong is the association between X 57 and Y.

The coefficient of correlation +1 Strong positive linear relationship COV(X, Y)>0 r or r = 0 No linear relationship or COV(X, Y)=0 COV(X, Y)<0 -1 Strong negative linear relationship 58

The coefficient of correlation § If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). § If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). 59 § No straight line relationship is

The coefficient of correlation and the covariance – Example 4. 16 § Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another. § Solution l We believe GMAT affects GPA. Thus • GMAT is labeled X • GPA is labeled Y 60

The coefficient of correlation and the covariance – Example 4. 16 Student 1 599 x y 3588 x 2 5750. y 2 xy 01 92. 16 4 4747 6063. 2 689 8. 8 21 77. 44 2 cov(x, y)=(1/12 -1)[67, 559. 2 -(7587)(106. 4)/12]=26. 16 3410 4321. 3 584 7. 4 56 54. 76 6 3981 2. 5 Sx = {(1/12 -1)[4, 817, 755 -(7587) /12)]} =43. 56 4 631 10 100 6310 61 9. 6 Sy = similar to Sx = 1. 12 …………………………. 11 cov(x, y)/S 593 351649 77. 44 5218. 4 r= Sy = 26. 16/(43. 56)(1. 12) =. 5362 x 8. 8 12 Total 683 8 4, 817, 7 466489 64 7, 58 7 106. 4 55 957. 2 5464 67, 559. 2 61

The coefficient of correlation and the covariance – Example 4. 16 – Excel § Use the Covariance option in Data Analysis § If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values. § Use the Correlation option to produce the correlation matrix. Variance-Covariance Matrix Population. GPA values GPA GMA T 1. 15 GMA 23. 98 1739. 52 T Sample values 12 ´ 12 -1 GPA GMA T 1. 25 GMA 26. 16 1897. 62 66 T

The coefficient of correlation and the covariance – Example 4. 16 – Excel § Interpretation l l The covariance (26. 16) indicates that GMAT score and performance in the MBA program are positively related. The coefficient of correlation (. 5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA. 63

The Least Squares Method § We are seeking a line that best fits the data when two variables are (presumably) related to one another. § We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. The actual y value of point i The y value of point i calculated from the equation 64

The least Squares Method Y Errors X Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared 65

The least Squares Method The coefficients b 0 and b 1 of the line that minimizes the sum of squares of errors are calculated from the data. 66

The Least Squares Method § Example 4. 17 l Find the least squares line for Example 4. 16 (Xm 04 -16. xls) 67