Describing Data Where does data come from Chapter

Describing Data

Where does data come from? • Chapter 2 of Van Belle has a nice description of study types. • In theory, if you can gather data on EVERYONE of interest (e. g. , all people with a disease), you are not doing statistics, you are describing parameters in the population. • In reality, you only sample a fraction of the population of interest. The people who could have been included in your sample are called the sampling frame.

Abusing a Sampling Frame • Look VERY carefully at the selection criteria for a study. • If you randomize enough people into drug and placebo groups, you can find effects if they exist in the population. Right? Wrong! • If the sampling frame for a study does not include people at risk or only includes people who are at far less risk than the population in general, you can not find differences, regardless of randomization. – People with high risk of cardiovascular problems should be kept out of the study so the differences in the rate of cardiovascular problems between the Vioxx patients and the others “would not be evident” • Mathews A, Martinez B (November 1, 2004) E-mails suggest Merck knew Vioxx's dangers at early stage. Wall Street Journal.

Organizing Data • When you collect data, you store it in a grid/matrix where each row represents one measurement time on one individual and the columns represent different types of information. You may have a column for last name and another for CD 4 count. The values in the columns vary from row to row (aka from record to record). Therefore, the columns are called variables.

Types of Variables • Computer programmers differentiate between lots of different types of variables. – They pay attention to the differences between whole numbers vs. lots of decimals and single letters vs. long strings of characters because they want to make the columns use as little space as possible. • Statistical programmers and statisticians think about character variables (letters and words which they call strings of letters) vs. categorical factors vs. numeric variables because there are some things you just don’t want to do to a bunch of letters (like get an average).

Taxonomy of Variables • In 1946 Stevens suggested a taxonomy of variable types. Each type affords different summary statistics and graphics. – Nominal • named categories – Ordinal • ordered categories but distances between categories are not equal – Interval • ordered categories with equal distance between the points – Ratio • continuous scale with meaningful ratios and a meaningful zero • You will think a lot about nominal, ordinal and continuous variables.

Another Popular Taxonomy Categorical binary nominal Quantitative ordinal discrete continuous 2 categories + more categories + order matters + numerical + uninterrupted

Describing Data • For every variable you play with, you want to know two things: its variability and its central tendency. • Never EVER use a numeric summary of data without a plot. A good plot shows you both the variability and central tendency at once.

Same Mean, Different Variability Data A 11 12 13 14 15 16 17 18 19 20 21 Mean = 15. 5 S = 3. 338 20 21 Mean = 15. 5 S = 0. 926 20 21 Mean = 15. 5 S = 4. 570 Data B 11 12 13 14 15 16 17 18 19 Data C 11 12 13 14 15 16 17 18 19 Slide from: Statistics for Managers Using Microsoft® Excel 4 th Edition, 2004 Prentice-Hall

Central Tendency • Mean – The arithmetic mean is the “add up the values and divide by N” formula (number of records). There are other means! • Median – Order the data from low to high and take the middle value or the average of the middle 2 values if you have an even number of records. • Mode – The most frequently occurring value

Variability • • The actual values… Range Limits IQR – Difference between 75 th and 25 th percentiles • The absolute deviation • The standard deviation/variance

Rat brain weights in 4 treatments (Original plot) Rat brain weights in 4 treatments (alternate plot) Bars show the mean and dots indicate each animal.

The Average Variability • It frequently makes sense to use the mean to describe the average value but the average variability around the mean is zero (give or take rounding error). There alternatives. – First, calculate the differences between the observed and mean values and then take the absolute value (strip off the negative signs). Calculate the average of those values. – First, calculate the differences between the observed and mean values and square these differences. Calculate the average of those values. This is the variance.

The Joys of Excel All those lovely extra digits and still rounding error Average Difference Absolute Difference Variance Standard Deviation

Errr, ummm… Why the N-1? • The denominator is actually the degrees of freedom. – It considers the fact that you have already included one estimate (the mean) in the formula for the variance. Basically, you bump up the estimated variability a bit because you guessed on the mean. – You use up one DF for every parameter estimate in a formula. – Why call it degrees of freedom? You can vary most of the data going into a formula and still get the same answer.

Why call it degrees of freedom? • Say you have 5 numbers and the mean is 10. What must the total have been? The sum is ten. Degrees of freedom is the sample size, N, minus the number of parameters, P, estimated from the data. We can freely vary 4 of the 5 numbers and still come up with the same mean. The DF on a mean with sample size N is N - 1

The Variance Formula or if you prefer hieroglyphics… A bar over a variable means the mean.

Secret Decoder Ring • S 2 = Sample variance • S = Sample standard dev • 2 = Population (true or theoretical) variance • = Population standard dev. • X = Sample mean • µ = Population mean • IQR = interquartile range (middle 50%)

Nominal Data • If a variable represents categories, summarize with frequency counts. • Graph it with a dot plot or bar graph. • Pie charts are all bad. Waffle plots are Data on the number of better. hospice referrals received from physicians after a visit by a hospice marketing nurse

Bar plots are not too good. • Look at the ink-to-information ratio…. Three numbers are shown with LOTS of ink.

$Dot Plots in R library(gdata) hospice = read. xls("C: \Projects\classes\hrp 223 -2007\hospice. xls") library(lattice)$

Dot Plots in R library(gdata) hospice = read. xls("C: \Projects\classes\hrp 223 -2007\hospice. xls") library(lattice) trellis. par. set(list(fontsize=list(points=20))) trellis. par. set(list(fontsize=list(text=25))) dotplot(table(hospice$Practice), xlim = c(-1, 21), xlab = "Frequency Count")

Bad Plots • Pies are great for twisting the truth. The false 3 rd dimension makes the front piece look bigger. I can’t tell if there is a difference in the sizes. Rotating the pie can affect your judgment of the piece sizes. NEVER trust a glossy pie.

Ordinal Data Serum Samples in Each Trimester • Summary tables can include cumulative percentages and similar plots. • The data is ordered, so get your figure categories in the same order.

Interval and Ratio Data • People automatically draw histograms to describe data that is on a continuous scale. Histograms show you the shape of the empirical distribution but they do nothing to convey things like the mean, median or quantiles. They also have issues where re -binning the data changes perception.

Mean, median, mode? The same data rendered by R and SAS affords different interpretations about a bimodal distribution, and good luck finding the median or mean.

Use Boxplots 1. 5 * IQR = upper fence 75 th percentile Median Mean 25 th percentile 1. 5 * IQR = lower fence

Box Plots and Histograms: for Continuous Variables • To show the distribution (shape, center, range, variation) of continuous variables, use both box plots and histograms.

Histogram of SI 25. 0 Bins of size 0. 1 Note the “right skew” Percent 16. 7 8. 3 0. 0 0. 7 1. 3 SI 2. 0

Box Plot: Shock Index Units 2. 0 maximum (1. 7) Outliers 1. 3 Q 3 + 1. 5 IQR =. 8+1. 5(. 25)=1. 175 “whisker” 0. 7 75 th percentile (0. 8) median (. 66) 25 th percentile (0. 55) interquartile range (IQR) =. 8 -. 55 =. 25 minimum (or Q 11. 5 IQR) 0. 0 SI

100 bins (too much detail)

2 bins (too little detail)

Box Plot: Shock Index Units 2. 0 Also shows the “right skew” 1. 3 0. 7 0. 0 SI

Box Plot: Age 100. 0 maximum More symmetric Years 66. 7 75 th percentile interquartile range median 25 th percentile 33. 3 minimum 0. 0 AGE Variables

Histogram: Age 14. 0 Not skewed, but not bell -shaped either… Percent 9. 3 4. 7 0. 0 33. 3 66. 7 AGE (Years) 100. 0

Numeric Summaries • You can always calculate the mean, median, mode and standard deviation on continuous data but you don’t want to. • The mean and standard deviation may not be good descriptions of the data if you have outliers, skewed data or a bimodal distribution.

Leukemia Onset Age • Say you are studying a disease whose age of onset is bimodal like Leukemia. You can describe it with a mean but you are not representing the data. the mean

Density Function • In theory, there is a continuous density function that describes the pattern in the histogram. The most famous is the bell shaped curve but there are others that are at least as important. – Is the density shape Gaussian, skewed, bimodal exponential or something weirder? – Does it contain outliers? – Are there data points that don’t make sense?

Thoughts on Outliers • Work like crazy to identify them. • Do analyses with and without them and see if the inferences change. • If one data point changes the inferences and you decide to exclude it, be sure to include the value in your plots with a special plotting symbol. • True outlier values bring Nobel prizes. • Statistics based on ranks or percentiles are relatively insensitive to outliers. The median income for Washington state was $48, 397 in 2000 but the mean was $96, 200.

Mean and SD • The mean and the SD play a huge role in statistics because they describe the normal curve. Much more on this later, but… • No matter what and are, the area between and + is about 68%; the area between -2 and +2 is about 95%; and the area between -3 and +3 is about 99. 7%. Almost all values fall within 3 standard deviations.

68 -95 -99. 7 Rule 68% of the data 95% of the data 99. 7% of the data

Huff – How to Lie with Statistics • Worry about broken, stretched or broken/split axes. • If people use “images” to display numbers, they are trying to exaggerate. They increased the vertical height of the image but actually are increasing the AREA. • Nobody would use areas to show a one-dimensional measurement like size. Nobody would design a program that represents data like this. Right? Nobody…

…except Microsoft. Expect lies when you see 3 D effects on plots or pie charts. Exploded pie charts are great for lying. Area/bubble charts are GREAT for hiding differences. Read William Cleveland's books Visualizing Data and The Elements of Graphing Data.

Trust nothing you can’t see. • If a study has a clinically interesting effect with a statistically interesting p-value, it had better have a clear graphic! – Lots more on p-values later. • A good graphic will show the effect with a point estimate (mean, for example) and the variation (standard deviation).