CIS 2033 Based on Textbook A Modern Introduction
CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics. 2007 Slides: QUINCY R WALKER Modified by the instructor: Dr. Longin Jan Latecki Chapter 16 Exploratory data analysis: numerical summaries
16. 1 The Center of the Data Set Center of the Data= sample mean: n = the sample size Example: Sample mean of the following data is 44. 7 43, 41, 41, 42, 43, 58, 41
Outliers an outlier is an observation that is numerically distant from the rest of the data Sample median is more robust in the presence of outliers.
Variability in A Data Set Variance: Standard Deviation: where n is the number samples Why we choose the factor 1/(n− 1) instead of 1/n will be explained later (in Chapter 19).
Variability cont. Median of Absolute Deviation (MAD): The Median of the Absolute Deviations of a Sample. Medn= median of sample Absolute Deviation: The absolute value of the distance Of a point xi in a data set from the median
Empirical quantiles The order statistics consist of the same elements as the original dataset x 1, x 2 x 3, …, xk , but in ascending order. Denote by the kth element in the ordered list. Then: The pth quartile corresponds to pth quartile of a cdf: Finv(p) where F(p) is the cumulative distribution function of the data
Quartiles • Lower quartile: qn(. 25) • Upper quartile: qn(. 75) • Interquartile Range (IQR) • IQR = qn(0. 75) − qn(0. 25) • Median(Middle Quartile): qn(. 50)
The box-and-whisker plot • Advantages: • Good representation of statistical data • Shows quartiles, median and outliers • Disadvantages • poor graphical display of the dataset • histogram and kernel density estimate are more informative displays of a single dataset
Using boxplots to compare several datasets Boxplots become useful if we want to compare several sets of data in a simple graphical display:
- Slides: 10