Statistics Review I Class 14 WHAT WOULD YOU

STATISTICS AS VOX POPULI, THE VOICE OF THE PEOPLE

CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median,

LEVELS OF MEASUREMENT 1 1. Categorical 2. Ordinal 3. Continuous a. Interval b. Ratio

Categorical Variables 1. Refer to categories: human, cat, eggplant 2. All or none: Can’t

Ordinal Variables Numeric values refer to the ordering of things Rankings: 1 = First

CONTINUOUS VARIABLES Interval: Discrete: Most stat tests rely on interval data Equal intervals represent

GUESS THAT VARIABLE Example Variable 1 = female, 2 = male Categorical, binary 32.

Distress and Disclosure: A Sample Experiment That Never Occurred!!! Hyp: Increased anxiety leads to

Measures of Centrality MODE Most frequent value, occurrence MEDIAN Middle-most value; 50% above/below Arithmetic

Relations Btwn Mean, Median, Mode Number of words written. Mode Median Mean N =

MEASURES OF DISPERSON N = 20: 1, 2, 2, 3, 3, 3, 4, 4,

Variance and Standard Deviation We need to get an estimate of average dispersion from

Z Scores and Z Distribution DV 1: “How anxious were you during movie? ”

Standard Error of the Mean Sample mean ( X ) estimates true population mean

CONFIDENCE INTERVALS Issue: How do we know if the sample mean is a good

GRAPHIC REPRESENTATION OF CI Error bars overlap; means are likely from same distribution. Differences

GRAPHICALLY EXPLORING DATA USING CENTRALITY AND DISPERSION Why explore data? 1. Get a general

DATA BUGS ARE A HAZZARD: KNOW WHAT'S IN YOUR DATA! = + 12, 19,

Normally Distributed Data Set SPSS output: Note similarity between mean, median, mode

Skewed Distribution Positive Skew Possible "floor effect" Negative Skew Possible "ceiling effect"

Kurtosis Neuroticism Measure Positive kurtosis, “leptokurotic” Problems? "Normativity bias? " DV doesn't discriminate IV

Bimodality Note: What clues in “statistics” output that the distribution may be bimodal? Bimodality

BOX AND WHISKER GRAPH Top 25% Upper Quartile Median (50 %) Lower Quartile Bottom

BOX AND WHISKER GRAPH, AND DATA CHECKING subject number Detecting Skew Detecting Outliers

DEALING WITH OUTLIERS 1. Check raw data: Entry problem? Coding problem? 2. Remove the

Data Transformations 1. Log Transformation (log(X)): Converting scores into Log X reduces positive skew,

Slides: 30

Download presentation

Statistics Review I Class 14

WHAT WOULD YOU LIKE TO KNOW?

STATISTICS AS VOX POPULI, THE VOICE OF THE PEOPLE

STATISTICAL SKILLS AND DISCOVERY

CLASS OVERVIEW Levels of Measurement Measures of Centrality and Dispersion * Centrality (mean, median, mode) * Dispersion (range, variance, std. deviation, std. error) * Z scores and Z distribution Confidence Intervals Exploring Data Sets * Reasons * Methods (histograms, features of distributions) Dealing with Outliers

LEVELS OF MEASUREMENT 1 1. Categorical 2. Ordinal 3. Continuous a. Interval b. Ratio c. Discrete 2 3

Categorical Variables 1. Refer to categories: human, cat, eggplant 2. All or none: Can’t be 1 third human, 2 thirds eggplant 3. Numbers serve as labels, not values: 1 = human, 2 = eggplant “ 1” is not less than “ 2”; human is not less than eggplant 4. Common kinds of categorical variables: gender, race, major 5. Binary: only two values: Yes/No, Day/Night, present/absent 6. Non-Binary: Multiple values. Animal, vegetable, mineral Democrat, Republican, Independ. 7. Nominal: Values are known signifiers: 805 Train = Train between Denver and Chicago 352 Smith Hall = 52 nd Office on third floor

Ordinal Variables Numeric values refer to the ordering of things Rankings: 1 = First place, 2 = Second place Chronology: 1 = occurred first, 2 = occurred second, etc. Numeric valued DO NOT indicate how much “ 1” differs from “ 2” Bike race: 1 st place (27. 24); 2 nd place (27. 28); 3 rd place (33. 10) Rank Grant ranking: Score 1. 99. 89 2. 92. 63 winners 3. 89. 76 4. 89. 75 5. 88. 84 6. 79. 48 losers

CONTINUOUS VARIABLES Interval: Discrete: Most stat tests rely on interval data Equal intervals represent equal differences Virtually same as "interval" but there is a finite range of values, as in Likert scales. “How happy are you with your cell phone service? ” 1 2 3 4 5 Not at all Barely somewhat Very Greatly ≠ Ratio: Ratios of values on scale are meaningful Must have meaningful “ 0” point Likert scale, above, NOT ratio, b/c 1: 2 ≠ 2: 4 Temperature, RT, number of yawns in class ARE ratio

GUESS THAT VARIABLE Example Variable 1 = female, 2 = male Categorical, binary 32. 75 miles per gallon Ratio 1 = slightly tired 2 = moder. tired 3 = very tired Interval 352 Smith Hall Categorical-Nominal Top 4 Reasons to Learn Stats: Ordinal 1. Necessary for career 2. Source of serenity 3. Great

Distress and Disclosure: A Sample Experiment That Never Occurred!!! Hyp: Increased anxiety leads to more disclosure. Ss see scary movie or neutral movie. Ss asked to rate how scary they found the movie. Ss write about thoughts and feelings movie created. DV: Number words written after scary vs. neutral movie

Measures of Centrality MODE Most frequent value, occurrence MEDIAN Middle-most value; 50% above/below Arithmetic average MEAN How many words written after seeing scary movie? Number of words written: 2, 2, 3, 5, 8 MODE = 2 ? MEDIAN = 3 ? MEAN = ? 4 [ (2 + 3 + 5 + 8) / 5 = 4]

Relations Btwn Mean, Median, Mode Number of words written. Mode Median Mean N = 5: 1, 2, 2, 3, 8 2. 0 3. 8 N = 10: 1, 2, 3, 3, 3, 4, 5, 5, 6, 8 3. 0 3. 5 4. 0 N = 20: 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 8 How does change in N affect rel. btwn Mean, 4. 0 4. 35 If true distribution is normal, then as sample increases mean, median, and

MEASURES OF DISPERSON N = 20: 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 8 Mode Median 4. 0 Range: Difference between highest score and lowest score. Deviation (from mean), AKA “Error”: Difference between individual score and mean. Sum of Squared Errors (SS): Why squared? To get a meaningful index of average dispersion. Mean 4. 35 8 – 1 = 7 = range Max diff. btwn lowest / highest 8 – 4. 35 = 3. 65 = 8’s deviation 1 - 4. 35 + 2 – 4. 35. . . + 7 – 4. 35 + 8 – 4. 35 = 0. Useless! (1 - 4. 35)2 + (2 – 4. 35)2. . . + (7 – 4. 35)2 + (8 – 4. 35)2 = 87. 00. Useful!

Variance and Standard Deviation We need to get an estimate of average dispersion from mean, just like the mean gives an estimate of average score. Variance = s 2 = Average deviation in sample = 87 SS = 4. 58 = s 2 N - 1 20 -1 Two problems with variance: 1) Units, based on sq’d deviations, are not relatable to actual scores. 2) Variance tends to be a large, unwieldy, number. Standard Deviation = s 2 = sq. root of variance 4. 58 ==2. 14 1 sd above and below mean = 68% of distribution 2 sd above and below mean = 95% of distribution

Z Scores and Z Distribution DV 1: “How anxious were you during movie? ” DV 2: Number of words written about movie. Mean SD 4. 23 2. 71 28. 71 11. 65 discrete data ratio data Issue: How do we compare anxiety with word production? Z-score conversion: Effect is to convert different metrics into a co Z=X–X s Sub. 24: anxious = 3; words = 22 Z_anxious = 3 – 4. 23 = -. 45 2. 71 Z_words = 22 – 28. 71 = -. 58 11. 65 Z distribution is normal, mean = 0, SD = 1 SPSS: Descriptives, “Save standardized values as

Standard Error of the Mean Sample mean ( X ) estimates true population mean (µ) Many sample means from same population will vary. Standard Error of the Mean (SE) = the average amount that sample means vary around true mean. If n of sample mean ≥ 30, SE can be estimated based on s (std. deviation), and sample n. Formula for SE: SE X = s/√n SE Movie anxiety study: DV = reported anxiety; n = 43, s = 2. 71 √ 43) = 0. 41 SE = (2. 71 / Note: SE is much smaller than SD.

CONFIDENCE INTERVALS Issue: How do we know if the sample mean is a good estimate of the true mean? In other words, how do we estimate a mean’s accuracy? Confidence Intervals (CI) estimate accuracy of sample means. CI shows boundary values (highest & lowest) w/n which true mean is likely to occur. X Conventional boundary captures true mean 95% of time. X Calculation: Upper boundary = + (1. 96 * SE) Movie anxiety study: X = 4. 23, =SE = 0. 41 Lower boundary − (1. 96 * SE) Lower CI = 4. 23 - (1. 96 * 0. 41) = 4. 23 -. 80 = 3. 43

GRAPHIC REPRESENTATION OF CI Error bars overlap; means are likely from same distribution. Differences are not meaninful. Error bars DON’T overlap; means are likely from different distributions Differences are meaningful

GRAPHICALLY EXPLORING DATA USING CENTRALITY AND DISPERSION Why explore data? 1. Get a general sense or feel for your data. 2. Determine if distribution is normal, skewed, kurtotic, or multi-modal (more on these soon). 3. Identify outliers 4. Identify possible data entry errors

DATA BUGS ARE A HAZZARD: KNOW WHAT'S IN YOUR DATA! = + 12, 19, 17, 14, 17, 13, 17, 15 + = 147

Normally Distributed Data Set SPSS output: Note similarity between mean, median, mode

Skewed Distribution Positive Skew Possible "floor effect" Negative Skew Possible "ceiling effect"

Kurtosis Neuroticism Measure Positive kurtosis, “leptokurotic” Problems? "Normativity bias? " DV doesn't discriminate IV wasn't impactful Drinks Per Week Negative kurtosis, “platykurotic” Problems? Distinctiveness bias? IV and/or DV too ambiguous Population too diverse

Bimodality Note: What clues in “statistics” output that the distribution may be bimodal? Bimodality suggests 2 (or more) populations Multimodal: More than two modes.

Outliers

BOX AND WHISKER GRAPH Top 25% Upper Quartile Median (50 %) Lower Quartile Bottom 25%

BOX AND WHISKER GRAPH, AND DATA CHECKING subject number Detecting Skew Detecting Outliers

DEALING WITH OUTLIERS 1. Check raw data: Entry problem? Coding problem? 2. Remove the outlier: a. Must be at least 2. 5 DV from the mean (some say 3 DV) b. Must declare deletions in publications. c. Try to identify reason for outlier (e. g. , other anomalous responses). 3. Transform data: Convert data to a metric that reduces deviation. (More on this in next slide). 4. Change the score to a more conservative one (Field, 2009): a. Next highest plus 1 b. 2 SD or 3 SD above (or below) the mean. c. ISN’T THIS CHEATING? No (says Field) b/c omitting (untransformed) score may biases outcome. Again, report this 5. Run more subjects!

Data Transformations 1. Log Transformation (log(X)): Converting scores into Log X reduces positive skew, draws in scores on the far right side of distribution. NOTE: This only works on sets where lowest value is greater than 0. Easy fix: add a constant to all values. 2. Square Root Transformation (√X): Sq. roots reduce large numbers more than small ones, so will pull in extreme outliers. 3. Reciprocal Transformation (1/X): Divide 1 by each score reduces large values. BUT, remember that this effectively reverses valence, so that scores above the mean flip over to below the mean, and vice versa. Fix: First, preliminary transform by changing each score to highest score minus the target score. Do it all at same time