Chapter 4 Numerical Descriptive Techniques 1 4 2
- Slides: 67
Chapter 4 Numerical Descriptive Techniques 1
4. 2 Measures of Central Location § Usually, we focus our attention on two types of measures when describing population characteristics: l l Central location (e. g. average) Variability or spread The measure of central location reflects the locations of all the actual data points. 2
4. 2 Measures of Central Location § The measure of central location reflects the locations of all the actual data points. § How? With two data points, the central location But if the third fall data With one data point should in point the middle appears on the left hand-side clearly the central between them (in order of the midrange, it should “pull” location is at the point to reflect the location of the central location to the left. itself. both of them). 3
The Arithmetic Mean § This is the most popular and useful measure of central location Sum of the observations Mean = Number of observations 4
The Arithmetic Mean Sample mean Sample size Population mean Population size 5
The arithmetic mean The Arithmetic Mean • Example 4. 1 The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 0 7 22 11. 0 • Example 4. 2 Suppose the telephone bills of Example 2. 1 represent the population of measurements. The population mean is 42. 19 38. 45 45. 77 43. 59 6
The Median § The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude. Example 4. 3 Comment Suppose only 9 adults were sampled Find the median of the time on the internet (exclude, say, the longest time (33)) for the 10 adults of example 4. 1 Odd number of observations Even number of observations 0, 0, 5, 5, 7, 7, 8, 8, , 9, 12, 14, 22, 33 0, 330, 5, 7, 8 8 9, 12, 14, 22 8. 5 7
The Mode § The Mode of a set of observations is the value that occurs most frequently. § Set of data may have one mode (or modal class), or two or more modes. The modal class For large data sets the modal class is much more relevant than a single-value mode. 8
The Mode The Mean, Median, Mode The Mode § Example 4. 5 Find the mode for the data in Example 4. 1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 Solution • All observation except “ 0” occur once. There are two “ 0”. Thus, the mode is zero. • Is this a good measure of central location? • The value “ 0” does not reside at the center of this set (compare with the mean = 11. 0 and the mode = 8. 5). 9
Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) Mode Mean Median 10
Relationship among Mean, Median, and Mode § If a distribution is symmetrical, the mean, median and mode coincide § If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ. A positively skewed distribution (“skewed to the right”) A negatively skewed distribution (“skewed to the left”) Mode Mean Median Mean Mode Median 11
The Geometric Mean § This is a measure of the average growth rate. § Let Ri denote the rate of return in period i (i=1, 2…, n). The geometric mean of the returns R 1, R 2, …, Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods. 12
The Geometric Mean For the given series of rate of. If the rate of return was Rg in e returns the nth period return is period, the nth period return wo calculated by: be calculated by: = Rg is selected such that… 13
4. 3 Measures of variability § Measures of central location fail to tell the whole story about the distribution. § A question of interest still remains unanswered: How much are the observations spread out around the mean value? 14
4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. This data set is now changing to. . . 15
4. 3 Measures of variability Observe two hypothetical data sets: Small variability The average value provides a good representation of the observations in the data set. Larger variability The same average value does not provide as good representation of the observations in the data set as before. 16
§ The range l l The range of a set of observations is the difference between the largest and smallest observations. But, how do all the observations spread out? Its major advantage is the ease with which it can be computed. ? ? ? The range cannot assist. Range in answering this question l Smallest Its major shortcoming is its Largest failure to observation provide information on the dispersion of the observations between the two end points. 17
The Variance l l l This measure reflects the dispersion of all the observations The variance of a population of size N x 1, x 2, …, x. N whose mean is m is defined as The variance of a sample of n observations x 1, x 2, …, xn whose mean is is defined as 18
Why not use the sum of deviations? Consider two small populations: 9 -10= -1 11 -10= +1 8 -10= -2 A measure of dispersion A B 4 Can the sum of deviations agreesofwith this Be Should a good measure dispersion? The sum of deviations observation. is zero for both populations, 8 9 10 therefore, 11 12 is not …but a The good measurements mean measure of both in B arepopulations more dispersed is 10. . . of dispersion. then those in A. 7 10 13 12 -10= +2 Sum = 0 4 -10 = - 6 16 -10 = +6 7 -10 = -3 16 13 -10 = +3 19 Sum = 0
The Variance Let us calculate the variance of the two popula Why is the variance defined as After all, the sum of squared the average squared deviation? deviations increases in Why not use the sum of squared magnitude when the variation deviations as a measure of of a data set increases!! 20 variation instead?
The Variance Let us calculate. Which the sum deviations for both dataof setsquared has a larger dispersion? Data set B is more dispersed around the mean A B 1 2 3 1 3 5 21
The Variance Sum. A = (1 -2)2 +…+(1 -2)2 +(3 -2)2 +… +(3 -2)2= 10 Sum. B = (1 -3)2 + (5 -3)2 = 8 Sum. A > Sum. B. This is inconsistent with observation that set B is more dispers A B 1 2 3 1 3 5 22
The Variance However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked. s. A 2 = Sum. A/N = 10/5 = 2 s. B 2 = Sum. B/N = 8/2 = 4 A B 1 2 3 1 3 5 23
The Variance § Example 4. 7 l The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance § Solution 24
The Variance – Shortcut method 25
Standard Deviation § The standard deviation of a set of observations is the square root of the variance. 26
Standard Deviation § Example 4. 8 l l l To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7 -iron) club, and 75 with the new club. The distances were recorded. Which 7 -iron is more consistent? 27
The Standard Deviation § Example 4. 8 – solution Excel printout, from the “Descriptive Statistics” sub-menu. The innovation club is more consistent, and because the means are close, is considered a better club 28
Interpreting Standard Deviation § The standard deviation can be used to l l compare the variability of several distributions make a statement about the general shape of a distribution. § The empirical rule: If a sample of observations has a mound-shaped distribution, the interval 29
Interpreting Standard Deviation § Example 4. 9 A statistics practitioner wants to describe the way returns on investment are distributed. l l l The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped. 30
Interpreting Standard Deviation Example 4. 9 – solution § The empirical rule can be applied (bell shaped histogram) § Describing the return distribution l l l Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)] Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)] Approximately 99. 7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)] 31
The Chebysheff’s Theorem § The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1 -1/k 2 for k > 1. § This theorem is valid for any set of measurements (sample, population) of any shape!! (1 -1/12) 2) K Interval Chebysheff (1 -1/2 Empirical Rule 1 2 at least 0% at least 75% 2) approximately (1 -1/3 approximately 95% 68% 32
The Chebysheff’s Theorem § Example 4. 10 l The annual salaries of the employees of a chain of computer stores produced a positively skewed histogram. The mean and standard deviation are $28, 000 and $3, 000, respectively. What can you say about the salaries at this chain? Solution At least 75% of the salaries lie between $22, 000 and $34, 000 28000 – 2(3000) 28000 + 2(3000) 33 At least 88. 9% of the salaries lie between $$19, 000
The Coefficient of Variation § The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. § This coefficient provides a proportionate A standard deviation of 10 may be perceived measure of variation. large when the mean value is 100, but only moderately large when the mean value is 50 34
4. 4 Measures of Relative Standing and Box Plots § Percentile l The pth percentile of a set of measurements is the value for which • p percent of the observations are less than that value • 100(1 -p) percent of all the observations are greater than that value. l Example • Suppose 60% yourofscore is the 60% percentile of a 40% all the scores lie here SAT test. Then 35 Your score
Quartiles § Commonly used percentiles l l l First (lower)decile = 10 th percentile First (lower) quartile, Q 1, = 25 th percentile Second (middle)quartile, Q 2, = 50 th percentile Third quartile, Q 3, = 75 th percentile Ninth (upper)decile = 90 th percentile 36
Quartiles § Example Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8 37
Quartiles § Solution Sort the observations 2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 21, 27, 29, 30 15 observations The first quartile At most (. 25)(15) = 3. 75 observations At most (. 75)(15)=11. 25 observations should appear below the first quartile. should appear above the first quartile Check the first 3 observations on the Check 11 observations on the left hand side. right hand side. Comment: If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations. 38
Location of Percentiles § Find the location of any percentile using the formula § Example 4. 11 Calculate the 25 th, 50 th, and 75 th percentile of the data in Example 4. 1 39
Location of Percentiles § Example 4. 11 – solution l After sorting the data we have 0, 0, 5, 7, 8, 5 Values 0 3. 75 9, 22, 33. 0 Location 2 Location 1 2. 75 3 Location 3 The 2. 75 th location Translates to the value (. 75)(5 – 0) = 3. 75 40
Location of Percentiles § Example 4. 11 – solution continued The 50 th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8. 5. 41
Location of Percentiles § Example 4. 11 – solution continued The 75 th percentile is one quarter of the distance between the eighth and ninth observation that is 14+. 25(22 –Ninth 14) = 16. Eighth observation 42
Quartiles and Variability § Quartiles can provide an idea about the shape of a histogram Q 1 Q 2 Positively skewed histogram Q 3 Q 1 Q 2 Q 3 Negatively skewed histogram 43
Interquartile Range § This is a measure of the spread of the middle 50% of the observations § Large value indicates a large spread of the observations Interquartile range = Q 3 – Q 1 44
Box Plot l This is a pictorial display that provides the main descriptive measures of the data set: • • • L - the largest observation Q 3 - The upper quartile Q 2 - The median Q 1 - The lower quartile S - The smallest observation 1. 5(Q 3 – Q 1) Whisker S Q 1 1. 5(Q 3 – Q 1) Q 2 Q 3 Whisker L 45
Box Plot § Example 4. 14 (Xm 02 -01) Left hand boundary = 9. 275– 1. 5(IQR)= -104 Right hand boundary=84. 9425+ 1. 5(IQR)=1 -104. 226 0 9. 275 84. 9425 119. 63 26. 905 198. 4438 No outliers are found 46
Box Plot l Additional Example - GMAT scores Create a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT. XLS) 417. 5 449 512 -1. 5(IQR) 512 537 575 669. 5 575+1. 5(IQR) 47 788
Box Plot GMAT - continued 449 Q 1 Q 2 512 537 25% l Q 3 575 50% 669. 5 25% Interpreting the box plot results • The scores range from 449 to 788. • About half the scores are smaller than 537, and about half are larger than 537. • About half the scores lie between 512 and 575. • About a quarter lies below 512 and a quarter above 575. 48
Box Plot GMAT - continued The histogram is positively skewed Q 1 Q 2 512 537 449 25% 50% Q 3 575 669. 5 25% 50% 25% 49
Box Plot § Example 4. 15 (Xm 04 -15) l l A study was organized to compare the quality of service in 5 drive through restaurants. Interpret the results § Example 4. 15 – solution l Minitab box plot 50
Box Plot Jack in the Box Hardee’s Jack in the box is the slowest in service Hardee’s service time variability is the larg Mc. Donalds Wendy’s service time appears to be the shortest and most consistent. Popeyes 51
Box Plot Times are symmetric Jack in the Box Hardee’s Jack in the box is the slowest in service Hardee’s service time variability is the larg Mc. Donalds Wendy’s service time appears to be the shortest and most consistent. Popeyes Times are positively skewed 52
4. 5 Measures of Linear Relationship § The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. l l Covariance - is there any pattern to the way two variables move together? Coefficient of correlation - how strong is the linear relationship between two variables 53
Covariance mx (my) is the population mean of the variable X (Y). N is the population size. x (y) is the sample mean of the variable X (Y). n is the sample size. 54
Covariance § Compare the following three sets xi yi (x – x) (y – y) (x – x)(y – y) 2 6 7 13 20 27 -3 1 2 -7 0 7 21 0 14 x=5 y y=20 i xi 2 6 7 27 20 13 (x – x) (y – y) -3 1 2 7 0 -7 Cov(x, y)=17. (x 5 – x)(y – y) -21 0 -14 xi yi 2 6 7 20 27 13 x= 5 y =20 Cov(x, y) = 3. 5 55
Covariance § If the two variables move in the same direction, (both increase or both decrease), the covariance is a large § positive If the twonumber. variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number. § If the two variables are unrelated, the covariance will be close to zero. 56
The coefficient of correlation l This coefficient answers the question: How strong is the association between X and Y. 57
The coefficient of correlation +1 Strong positive linear relationship COV(X, Y)>0 r or r = 0 or No linear relationship COV(X, Y)=0 COV(X, Y)<0 -1 Strong negative linear relationship 58
The coefficient of correlation § If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship). § If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship). § No straight line relationship is indicated 59
The coefficient of correlation and the covariance – Example 4. 16 § Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another. § Solution l We believe GMAT affects GPA. Thus • GMAT is labeled X • GPA is labeled Y 60
The coefficient of correlation and the covariance – Example 4. 16 Student x y 3588 x 2 y xy 5750. 9. 6 1 599 01 92. 16 4 4747 6063. 2 689 8. 8 21 77. 44 2 cov(x, y)=(1/12 -1)[67, 559. 2 -(7587)(106. 4)/12]=26. 16 3410 4321. 3 584 7. 4 56 54. 76 6 3981 2. 5 Sx = {(1/12 -1)[4, 817, 755 -(7587) /12)]} =43. 56 4 631 10 100 6310 61 2 Sy …………………………. = similar to Sx = 1. 12 593 x. S 8. 8 351649 77. 44 5218. 4 r =11 cov(x, y)/S =. 5362 y = 26. 16/(43. 56)(1. 12) 12 683 8 Total 7, 587 106. 4 466489 64 4, 817, 7 55 957. 2 5464 67, 559. 2 61
The coefficient of correlation and the covariance – Example 4. 16 – Excel § Use the Covariance option in Data Analysis § If your version of Excel returns the population covariance and variances, multiply each one by n/n-1 to obtain the corresponding sample values. § Use the Correlation option to produce the correlation matrix. Variance-Covariance Matrix Population. GPA values GPA 1. 15 GMA T 23. 98 GMA T 1739. 52 Sample values 12 ´ 12 -1 GPA GMA T 1. 25 26. 16 1897. 62 66
The coefficient of correlation and the covariance – Example 4. 16 – Excel § Interpretation l l The covariance (26. 16) indicates that GMAT score and performance in the MBA program are positively related. The coefficient of correlation (. 5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA. 63
The Least Squares Method § We are seeking a line that best fits the data when two variables are (presumably) related to one another. § We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. The actual y value of point i The y value of point i calculated from the equation 64
The least Squares Method Y Errors X Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared err 65
The least Squares Method The coefficients b 0 and b 1 of the line that minimizes the sum of squares of errors are calculated from the data. 66
The Least Squares Method § Example 4. 17 l Find the least squares line for Example 4. 16 (Xm 04 -16. xls) 67
- What is the lower quartile measure of this box plot?
- Jack the box
- Variance and standard deviation formula
- Numerical descriptive measures exercises
- Numerical descriptive statistics
- Numerical methods of descriptive statistics
- Numerical descriptive measures
- Numerical descriptive measures
- Jack in the
- Numerical descriptive measures
- Numerical optimization techniques for engineering design
- What are descriptive techniques
- Graphical descriptive techniques
- Graphical descriptive techniques
- Fonctions techniques et solutions techniques
- What is a numerical expression
- Numerical stroop
- Nsm formulas
- V number in optical fiber
- Numerical measures
- Non-numerical unstructured data indexing
- Numerical integration
- Basic algebra definition
- Numerical and geometric patterns
- Graphical and numerical methods
- Numerical computing with python
- Numerical integration c++
- Percent difference
- Define interpolation in numerical methods
- Taylor series numerical methods
- Klinefelter syndrome facial features
- Numerical analysis formula
- Dr babu rajendran
- What are numerical expressions
- Secondary keywords
- Numerical aperture in microscope
- Visual numerical learning style
- Flux
- Fnmoc
- Turner syndrome is what numerical chromosome disorder?
- Numerical expression examples
- Sifat dari variabel kategorik
- Numerical differentiation
- Hiral panchal
- What is a numerical summary of a sample
- K means numerical example
- Comparative advantage numerical example
- Abstrct nouns
- Numerical differentiation
- Banker's algorithm
- Sketching as a tool for numerical linear algebra
- Composite simpsons rule
- Mathematical preliminaries in numerical computing
- What is cfl number in cfd
- Numerical interpolation
- Logo notasi
- Numerical datum crossword puzzle clue
- If the compass quadrant bearing is n 790 w, the azimuth is?
- Numerical variable
- A parameter is a numerical description of a
- Programs that organize analyze and graph numerical data
- Different types of errors in numerical methods
- Stirling interpolation formula
- Numerical measure of how alike two data objects are
- Numerical technologies ltd
- Euler modified method
- Numerical reasoning practice test
- Numerical geometry of non-rigid shapes