DATA ANALYSIS BASIC STATISTICS XIAO WU XIAO WUYALE
DATA ANALYSIS & BASIC STATISTICS XIAO WU XIAO. WU@YALE. EDU
PURPOSE OF THIS WORKSHOP • Statistics as a useful tool to analyze results • Basic terminology and most commonly used tests • Exposure to more advanced statistical tools
WHY DO WE NEED STATISTICS?
WHY DO WE NEED STATISTICS? • Summary • Classification • Interpretation • Pattern searching • Abnormality identification • Prediction • Intrapolation • Extrapolation
SUMMARY http: //www. mymarketresearchmethods. com/descriptive-inferential-statistics-difference/
SUMMARY • Mean, median, mode • Variance, standard deviation • Max, min values and range • Quartiles http: //www. mymarketresearchmethods. com/descriptive-inferential-statistics-difference/
EXAMPLE Firm A • Mean: $5, 800 Firm B • Mean: $5, 000
EXAMPLE Firm A • Mean: $5, 800 • Median: $4, 000 • SD: $7, 270 • 3 rd Quartile: $4, 000 • 1 st Quartile: $500 Firm B • Mean: $5, 000 • Median: $5, 000 • SD: $203 • 3 rd Quartile: $5, 175 • 1 st Quartile: $4, 825
EXAMPLE # Salary ($) 1 20000 1 4650 2 4000 2 4700 3 4000 3 4750 4 500 4 4800 5 5 500 4850 6 4900 7 4950 8 5000 9 5050 10 5100 11 5150 12 5200 13 5250 14 5300 15 5350
CLASSIFICATION Identification of variable • Independent vs. dependent • Numeric vs. categorical Variable Categorical Numeric Nominal Continuous Ordinal Discrete
PATTERN SEARCHING • Distribution of data • Some commonly used distributions • • Uniform Binomial Poisson … • Central limit theorem http: //www. mathwave. com/img/art/graphs_pdf 2. gif
UNIFORM • Every outcome has equal chance • Example: • Flipping a coin • Rolling a dice • What if you need to flip multiple times?
BINOMIAL • Two outcomes, probability p and 1 -p • Multiple trials: n • Example: • Flipping a coin 100 times • Germination of multiple seeds https: //onlinecourses. science. psu. edu/stat 414/sites/onlinecourses. science. psu. edu. stat 414/files/lesson 09/graph_n 15_p 02. gif
POISSON • Counts of rare, independent events • Each with probability, or average rate p • Example: radioactive decay http: //kaffee. 50 webs. com/Science/images/alpha_decay. gif
THE MOST IMPORTANT DISTRIBUTION
NORMAL DISTRIBUTION • Central limit theorem • Every distribution converges to a normal distribution • Large sample size normal distribution Parameters: • mean • standard deviation https: //www. mathsisfun. com/data/images/normal-distrubution-large. gif
PATTERN SEARCHING Hypothesis testing • Difference between two populations • Z-test or t-test? • What does p-value mean? • Family-wise error – Bonferroni correction • More than two possibilities • Chi square test • Fisher’s exact test • More than two variables • ANOVA
EXAMPLE 1 SAT score is related to gender • Null hypothesis • Alternative hypothesis (3 possibilities) • One or two tail? • Z or T test? • p=0. 07, conclusion?
EXAMPLE 2 Predictors of stroke • Age • Hypertension • Gender • …
EXAMPLE 3 Genome-wide association studies • Scanning markers across the DNA of many people to find genetic variations associated with certain diseases
PATTERN SEARCHING Hypothesis testing • One variable • Z-test or t-test? • What does p-value mean? • Family-wise error – Bonferroni correction • Compare two categorical variables • Chi square test • Fisher’s exact test • More than two variables • ANOVA
CHI SQUARE Punnett Square • A cross between two pea plants yields 880 plants, 639 green, 241 yellow • Hypothesis: The green allele is dominant and both parents are heterozygous. http: //www 2. lv. psu. edu/jxm 57/irp/chisquar. html
CHI SQUARE G g G GG (green) Gg(green) gg (yellow) • 75% green • 25% yellow
CHI SQUARE Green Yellow Observed (o) 639 241 Expected (e) 660 220 Deviation (d=o – e) -21 21 Deviation squared (d^2) 441 d^2/e 0. 668 2 Sum 2. 669 Degree of freedom: number of categories – 1 = 1
CHI SQUARE
PREDICTION • Regression • Linear regression • Multiple linear regression • Accuracy vs. simplicity • Validation • leave-k-out http: //2. bp. blogspot. com/-W 7 Ptp 8 u. B 02 U/T 8 UAGm 4 Uw 5 I/AAAAC 08/Dc. HCt. LWXv. U/s 1600/actn+1. png
EXAMPLE • Use brain structural measurements to predict a subject’s performance on picture vocabulary test • 144 total structural measurements • 521 subjects • First step: eliminate unnecessary variables • All zeros? • Highly correlated pairs • Variables that do not correlate well with performance score
EXAMPLE • • Run regression Validation: leave 1 out and leave 10 out Principle component analysis …
PREDICTION More complicated models: • Baysian approach • Use prior knowledge to update prediction • Diffusion weights • Use local structure to predict neighboring values
STATISTICAL TOOLS • EXCEL • Mat. Lab • R • Mini. Tab • …
QUESTIONS?
MY OWN RESEARCH • Cost-effectiveness analysis • Mathematical modeling in medicine • Simulate iterations rather than actual patients
RECENT RESULTS
RESULTS
GROUP EXERCISE
- Slides: 35