RESEARCH STATISTICS Jobayer Hossain Ph D Larry Holmes

RESEARCH STATISTICS Jobayer Hossain, Ph. D Larry Holmes, Jr, Ph. D October 16, 2008

Data Summarization l In research, the first step of data analysis is to describe the distribution of the variables included in the study. l The advantages of data descriptions are– To get quick over all idea of the study – To get quick idea of the difference among comparing groups – To check the balance of the distribution of the demographic and other prognostic variables that influences the outcome unduly.

Is the balance of the distribution of prognostic factors in comparing groups important? l Example: Primary biliary cirrhosis trial (a chronic and fatal liver disease) – A randomized double-blind trial – Study treatment groups: Azathioprine vs placebo – Objective: To compare the survival time of two treatment groups – Primary end point: Time to death from randomization Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

Is the balance of the distribution of prognostic factors in comparing groups important? l Example … contd. l Bilirubin is a strong predictor of survival time. Table 1: Summary stat for Bilirubin level ( mol/L) at baseline Is the baseline imbalanced? You may expect a higher mortality rate in the Azathioprine group- why? How does it affect the primary end point of survival time? Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

Is the balance of the distribution of prognostic factors in comparing groups important? Table 3: Adjusted and Unadjusted Hazard ratios of death from the Cox proportional Hazards model There was no significant difference between two treatment groups (p-value=0. 455) before adjustment for the covariate Bilirubin But after adjustment for Bilirubin, a significant difference was found (p-value < 0. 001) between treatment groups Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

Looking at Data l How are the data distributed? – Where is the center? – What is the range? – What is the shape of the distribution (symmetric, skewed) l Are there outliers? l Are there data points that don’t make sense?

Distribution of a variable Distribution - (of a variable) tells us what values the variable takes and how often it takes these values. E. g. distribution of some 26 pediatric patients of ages 1 to 6 at AIDHC are as follows. Age 1 2 3 4 5 6 Frequency 5 3 7 5 4 2

Statistical Description/Summarization of Data l l Statistics describes the distribution of a numeric set of data by its l Center (mean, median, mode etc) l Variability (standard deviation, range etc) l Shape (skewness, kurtosis etc) Statistics describes distribution of a categorical set of data by l Frequency, percentage or proportion of each category

Statistical Description/summarization of Data l Examples of numerical and categorical variables– Numerical variable: Age, blood pressure (systolic and diastolic), time, weight, height, bmi (body mass index) – Categorical variable: Treatment group, disease status, race, gender, blood type (O, A, B, AB), age groups (such as 1 -5 years, 6 -9 years etc)

Statistical Presentation of Data Two types of statistical presentation of data - graphical and numerical. Graphical Presentation: We look for the overall pattern and for striking deviations from that pattern. Over all pattern usually described by shape, center, and spread of the data. An individual value that falls outside the overall pattern is called an outlier. Bar diagram and Pie charts are used for categorical variables. Histogram, stem and leaf and Box-plot are used for numerical variable.

Statistical Presentation of Data l Statistics presents the data either graphically or numerically l In graphical presentation, we look for the overall pattern (distribution) and for striking deviations from that pattern l An individual value that falls outside the overall pattern is called an outlier. – Over all pattern of numerical data usually described by shape, center, and spread of data. Commonly used graphs are histogram, stem and leaf plot, and boxplot – Overall pattern of a categorical data usually described by frequency and percentages. Commonly used graphs are bar plot and pie chart

Data Presentation –Categorical Variable Bar Diagram: Lists the categories and presents the percent or count of individuals who fall in each category. Treatment Group Total Frequency Proportion Percent (%) 1 15 (15/60)=0. 25 25. 0 2 25 (25/60)=0. 333 41. 7 3 20 (20/60)=0. 417 33. 3 60 1. 00 100

Data Presentation –Categorical Variable Pie Chart: Lists the categories and presents the percent or count of individuals who fall in each category. Treatment Group Frequency 1 15 (15/60)=0. 25 25. 0 2 25 (25/60)=0. 333 41. 7 3 20 (20/60)=0. 417 33. 3 60 1. 00 100 Total Proportion Percent (%)

Data Presentation –Categorical Variable (Frequency Distribution) Consider a data set of 26 children of ages 1 -6 years. Then the frequency distribution of variable ‘age’ can be tabulated as follows: Frequency Distribution of Age 1 2 3 4 5 6 Frequency 5 3 7 5 4 2 Grouped Frequency Distribution of Age: Age Group 1 -2 3 -4 5 -6 Frequency 8 12 6

Data Presentation –Categorical Variable (Frequency Distribution) Cumulative frequency of data in previous page Age 1 2 3 4 5 6 Frequency 5 3 7 5 4 2 Cumulative Frequency 5 8 15 20 24 26 Age Group 1 -2 3 -4 5 -6 Frequency 8 12 6 Cumulative Frequency 8 20 26

Data Presentation –Numerical Variable Histogram: Overall pattern can be described by its shape, center, and spread. The following age distribution is right skewed. The center lies between 80 to 100. No outliers. Mean 90. 41666667 Standard Error 3. 902649518 Median 84 Mode 84 Standard Deviation 30. 22979318 Sample Variance 913. 8403955 Kurtosis Skewness -1. 183899591 0. 389872725 Range 95 Minimum 48 Maximum 143 Sum Count 5425 60

Graphical presentation- Numerical Variable l Boxplot : – A boxplot is a graph of the five number summary. The central box spans the quartiles. – A line within the box marks the median. – Lines extending above and below the box mark the smallest and the largest observations (i. e. the range). – Outlying samples may be additionally plotted outside the range.

Graphical Presentation –Numerical Variable Box-Plot: Box contains middle 50% of the data. The upper and lower whisker contains top 25% and bottom 25% of the ordered data. Figure 3: Distribution of Age Maximum 75 th percentile Median 25 th percentile Minimum Box Plot Mean 90. 41666667 Standard Error 3. 902649518 Median 84 Mode 84 Standard Deviation 30. 22979318 Sample Variance 913. 8403955 Kurtosis Skewness -1. 183899591 0. 389872725 Range 95 Minimum 48 Maximum 143 Sum 5425 Count The shape of the distribution is right skewed as the upper part of the box and the whisker are longer the corresponding lower parts 60

Side by Side Boxplot Trt 1 Trt 2 Trt 3

Box Plot: Age of patients 100. 0 maximum Years 66. 7 75 th percentile interquartile range 33. 3 median 25 th percentile minimum 0. 0

Numerical Presentation A fundamental concept in summary statistics is that of a central value for a set of observations and the extent to which the central value characterizes the whole set of data. Measures of central value such as the mean or median must be coupled with measures of data dispersion (e. g. , average distance from the mean) to indicate how well the central value characterizes the data as a whole. To understand how well a central value characterizes a set of observations, let us consider the following two sets of data: A: 30, 50, 70 B: 40, 50, 60 The mean of both two data sets is 50. But, the distance of the observations from the mean in data set A is larger than in the data set B. Thus, the mean of data set B is a better representation of the data set than is the case for set A.

Methods of Center Measurement Center measurement is a summary measure of the overall level of a dataset Commonly used methods are mean, median, mode, geometric mean etc. Mean: Summing up all the observation and dividing by number of observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.

Methods of Center Measurement Median: The middle value in an ordered sequence of observations. That is, to find the median we need to order the data set and then find the middle value. In case of an even number of observations the average of the two middle most values is the median. For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of observations is even, e. g. , {9, 3, 6, 7, 5, 2}, then the median is the average of the two middle values from the sorted sequence, in this case, (5 + 6) / 2 = 5. 5.

Mean or Median l The mean is affected by outlier (s) but median is not. 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 Median = 3 Mean = 4 n. Slide from: Statistics for Managers Using Microsoft® Excel 4 th Edition, 2004 Prentice-Hall

Mean or Median The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions, e. g. family income. For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20 -40. So, the mean 270 really fails to give a realistic picture of the major part of the data. It is influenced by extreme value 990.

Methods of Center Measurement Mode: The value that is observed most frequently. The mode is undefined for sequences in which no observation is repeated. A variable with single mode is unimodal, with two modes is bimodal

A bimodal histogram A modal class Slide from Zhengyuan Zhu, UNC, http: //www. unc. edu/~zhuz A modal class

Methods of Variability Measurement Variability (or dispersion) measures the amount of scatter in a dataset. Commonly used methods: range, variance, standard deviation, interquartile range, coefficient of variation etc. Range: The difference between the largest and the smallest observations. The range of 10, 5, 2, 100 is (100 -2)=98. It’s a crude measure of variability.

Methods of Variability Measurement Variance: The variance of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of the n observations x 1, x 2, …xn is Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is Standard Deviation (SD) : Square root of the variance. The SD of the above example is 2. If the distribution is bell shaped (symmetric), then the range is approximately (SD x 6)

Standard deviation of different distributions with the same center Data A 11 12 13 14 15 16 17 18 19 20 21 Mean = 15. 5 S = 3. 338 20 21 Mean = 15. 5 S = 0. 926 20 21 Mean = 15. 5 S = 4. 570 Data B 11 12 13 14 15 16 17 18 19 Data C 11 12 13 14 15 16 17 18 19 n. SSlide from: Statistics for Managers Using Microsoft® Excel 4 th Edition, 2004 Prentice-Hall

Std Dev of Shock Index 250. 0 Std. dev is a measure of the “average” scatter around the mean. Count 187. 5 Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 1. 4/6 =. 23 125. 0 62. 5 0. 0 0. 5 SI 1. 0 Slide from: Kristin L. Sainani, Stanford University, http: //www. stanford. edu/~kcobb 1. 5 2. 0

Methods of Variability Measurement Quartiles: Quartiles are values that divides the sorted dataset in to four equal parts so that each part contains 25% of the sorted data In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is the desired quartile and n is the number of observations of data. The first quartile (Q 1) is the value from which 25% observations are smaller and 75% observations are larger. This is the median of the 1 st half of the ordered dataset. The second quartile (Q 2) is the median of the data. The third quartile (Q 1) is the value from which 75% observations are smaller and 25% observations are larger. This is the median of the 2 nd half of the ordered dataset.

Methods of Variability Measurement In the following example Q 1= ((15+1)/4)1 =4 th observation of the data. The 4 th observation is 11. So Q 1 is of this data is 11. An example with 15 numbers 3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q 1 Q 2 Q 3 The first quartile is Q 1=11. The second quartile is Q 2=40 (This is also the Median. ) The third quartile is Q 3=61. Inter-quartile Range: Difference between Q 3 and Q 1. Inter-quartile range of the previous example is 61 - 40=21. The middle half of the ordered data lie between 40 and 61.

Methods of Variability Measurement Symmetric Q 1 Right Skewed Q 2 25% 25% Q 1 Left Skewed 25% Q 3 Q 2 25% Q 3 25% Q 1 25% Q 2 25% Q 3

Deciles and Percentiles l Deciles: If data are ordered and divided into 10 parts, then cut points are called Deciles l Percentiles: If data are ordered and divided into 100 parts, then cut points are called Percentiles. 25 th percentile is the Q 1, 50 th percentile is the Median (Q 2) and the 75 th percentile of the data is Q 3. l Suppose PC= ((n+1)/100)p, where n=number of observations and p is the desired percentile. If PC is an integer than pth percentile of a data set is the (PC)th observation of the ordered set of that data. Otherwise let PI be the integer part of PC and f be the fractional part of PC. Then pth percentile= OI + (OII -OI)x`f where OI is the (PI)th observation of the ordered set of data and OII is the (PI +1)th observation of the ordered set of data. For example, Consider the following ordered set of data: 3, 5, 7, 8, 9, 11, 13, 15. PC= (9/100)p For 25 th percentile, PC=2. 25 (not an integer), then 25 th percentile = 5 + (7 -5)x. 25= 5. 5

Coefficient of Variation l Coefficient of Variation: The standard deviation of data divided by it’s mean. It is usually expressed in percent. Coefficient of Variation=

Five Number Summary l Five Number Summary: The five number summary of a distribution consists of the smallest (Minimum) observation, the first quartile (Q 1), the median(Q 2), the third quartile, and the largest (Maximum) observation written in order from smallest to largest.

Choosing a Summary l The five number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with extreme outliers. The mean and standard deviation are reasonable for symmetric distributions that are free of outliers. l In real life we can’t always expect symmetry of the data. It’s a common practice to include number of observations (n), mean, median, standard deviation, and range as common for data summarization purpose. We can include other summary statistics like Q 1, Q 3, Coefficient of variation if it is considered to be important for describing data.

Shape of Data l Shape of data is measured by – Skewness – Kurtosis

Skewness l Measures of asymmetry of data – Positive or right skewed: Longer right tail – Negative or left skewed: Longer left tail – Symmetric: Bell shaped

Left skewed Right skewed Slide from Zhengyuan Zhu, UNC, http: //www. unc. edu/~zhuz

Bell-shaped Histograms Slide from Zhengyuan Zhu, UNC, http: //www. unc. edu/~zhuz

Kurtosis Formula

Kurtosis relates to the relative flatness or peakedness of a distribution. A standard normal distribution (blue line: µ = 0; = 1) has kurtosis = 0. A distribution like that illustrated with the red curve has kurtosis > 0 with a lower peak relative to its tails.

Normal Distribution l The Normal Distribution is a density curve based on the following formula. It’s completely defined by two parameters: mean; and standard deviation. A density function describes the overall pattern of a distribution. The total area under the curve is always 1. 0. l The normal distribution is symmetrical. l The mean, median and mode are all the same. l

Normal Distribution The 68 -95 -99. 7 Rule In the normal distribution with mean µ and standard deviation σ: 68% of the observations fall within σ of the mean µ. 95% of the observations fall within 2σ of the mean µ. 99. 7% of the observations fall within 3σ of the mean µ.

68 -95 -99. 7 Rule 68% of the data 95% of the data 99. 7% of the data Slide from: Kristin L. Sainani, Stanford University, http: //www. stanford. edu/~kcobb

Normal Distribution l Standardizing and z-Scores If x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is, A standardized value is often called a z-score. If x is a normal variable with mean µ and standard deviation σ, then z is a standard normal variable with mean 0 and standard deviation 1.

Normal Distribution Let x 1, x 2, …. , xn be n random variables each with mean µ and standard deviation σ, then sum of them ∑xi be also a normal with mean nµ and standard deviation σ√n. The distribution of mean is also a normal with mean µ and standard deviation σ/√n. l The standardized score of the mean is, l The mean of this standardized random variable is 0 and standard deviation is 1.

SPSS demo- Data Summarization Categorical variable Frequencies/percentages: l Analyze -> Frequencies -> Select variables (sex, grp, shades, ped) -> ok l

SPSS demo- Data Summarization Categorical variable

SPSS demo- Bar Chart l Analyze -> Frequencies -> Select variables (sex, grp, shades, ped) then select option chart - > Select Chart type (Bar, histogram, Piechart) and select percentages or frequencies- > Continue-> ok l Or l Graphs ->Bar -> Select type (Select type, clustered, stacked) -> Define -> Select Bars represents (n of cases, % of cases) -> select variable for category axis (e. g. grp) and click titles for writing titles -> continue -> ok

SPSS demo- Bar Chart

SPSS demo- Data Summarization Numerical variable l Analyze -> Descriptive Statistics -> Descriptive -> Select variable (s) (e. g. Age, hgt) and click on radio button to transfer the variable(s) in the other window and then select options -> continue -> ok l Or l Analyze -> Compare means ->select variable (s) for dependent (age, hgt) and independent (grp, sex) list and then select options -> Continue -> ok

SPSS demo- Data Summarization Numerical variable

SPSS demo – Boxplots l Graph -> Boxplots -> Simple -> Define -> Select variables ( e. g. PLUC_pre) and category axis (e. g. grp) -> OK

MS Excel demo: Summary Statistics. Categorical Variable l Frequency: Type bins -> Insert -> Function -> Statistical -> Frequency -> Select ranges for data (grp) and bins -> take the curser left of equal sign and then press simultaneously Ctrl, Shift, and Enter. l Pie Chart: Select Frequency -> Chart -> Pie -> Series : write category labels (1, 2, 3) -> next Click title and write title, click data labels and select show percent then click on next. l l

MS Excel demo: Summary Statisticsnumerical variable

Questions