CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics. 2007 Slides: QUINCY R WALKER Modified by the instructor: Dr. Longin Jan Latecki Chapter 16 Exploratory data analysis: numerical summaries

16. 1 The Center of the Data Set Center of the Data= sample mean: n = the sample size Example: Sample mean of the following data is 44. 7 43, 41, 41, 42, 43, 58, 41

Outliers an outlier is an observation that is numerically distant from the rest of the data Sample median is more robust in the presence of outliers.

Variability in A Data Set Variance: Standard Deviation: where n is the number samples Why we choose the factor 1/(n− 1) instead of 1/n will be explained later (in Chapter 19).

Variability cont. Median of Absolute Deviation (MAD): The Median of the Absolute Deviations of a Sample. Medn= median of sample Absolute Deviation: The absolute value of the distance Of a point xi in a data set from the median

Empirical quantiles The order statistics consist of the same elements as the original dataset x 1, x 2 x 3, …, xk , but in ascending order. Denote by the kth element in the ordered list. Then: The pth quartile corresponds to pth quartile of a cdf: Finv(p) where F(p) is the cumulative distribution function of the data

Quartiles • Lower quartile: qn(. 25) • Upper quartile: qn(. 75) • Interquartile Range (IQR) • IQR = qn(0. 75) − qn(0. 25) • Median(Middle Quartile): qn(. 50)

The box-and-whisker plot • Advantages: • Good representation of statistical data • Shows quartiles, median and outliers • Disadvantages • poor graphical display of the dataset • histogram and kernel density estimate are more informative displays of a single dataset

Using boxplots to compare several datasets Boxplots become useful if we want to compare several sets of data in a simple graphical display: