CIS 2033 Based on Textbook A Modern Introduction
CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics. 2007 Instructor: Dr. Longin Jan Latecki Slides: QUINCY R WALKER Chapter 16 Exploratory data analysis: numerical summaries
16. 1 The Center of the Data Set Center of the Data= sample mean, sample median Mean: xbar n = the sample size Example: Sample mean of the following data is 44. 7 43, 41, 41, 42, 43, 58, 41
Outliers an outlier is an observation that is numerically distant from the rest of the data
Variability in A Data Set Variance: Standard Deviation=sqrt(Var(X)): Where: n=number samples xbar=mean
Variability cont. Median of Absolute Deviation (MAD): The Median of the Absolute Deviations of a Sample. Medn= median of sample Absolute Deviation: The absolute value of the distance Of a point x[i] in a data set from the median
Empirical quantiles The order statistics consist of the same elements as the original dataset x 1, x 2 x 3, …, xk , but in ascending order. Denote by the kth element in the ordered list. Then: To compute the pth quartile use this formula: Finv(p) where F(p) is the cumulative distribution function
Quartiles • Lower quartile: qn(. 25) • Upper quartile: qn(. 75) • Interquartile Range (IQR) • IQR = qn(0. 75) − qn(0. 25) • Median(Middle Quartile): qn(. 50)
The box-and-whisker plot • Advantages: • Good representation of statistical data • Shows quartiles, median and outliers • Disadvantages • poor graphical display of the dataset • histogram and kernel density estimate are more informative displays of a single dataset
Using boxplots to compare several datasets Boxplots become useful if we want to compare several sets of data in a simple graphical display:
- Slides: 9