Dr Hctor Allende Review of Probability and Statistics

Dr. Héctor Allende Review of Probability and Statistics A Review of Probability and Statistics • Descriptive statistics • Probability • Random variables • Sampling distributions • Estimation and confidence intervals • Test of Hypothesis – For mean, variances, and proportions – Goodness of fit 1

Dr. Héctor Allende Review of Probability and Statistics Key Concepts • Population -- "parameters" –Finite –Infinite • Sample -- "statistics" • Random samples - Your MOST important decision! 2

Dr. Héctor Allende Review of Probability and Statistics Data • Deterministic vs. Probabilistic (Stochastic) • Discrete or Continuous: – Whether a variable is continuous (measured) or discrete (counted) is a property of the data, not of the measuring device: weight is a continuous variable, even if your scale can only measure values to the pound. • Data description: – Category frequency – Category relative frequency 3

Dr. Héctor Allende Review of Probability and Statistics Data Types • Qualitative (Categorical) –Nominal -- I E = 1 ; EE = 2 ; –Ordinal -- poor = 1 ; fair = 2 ; CE = 3 good = 3 ; excellent = 4 • Quantitative (Numerical) • –Interval -- temperature, viscosity –Ratio -- weight, height The type of statistics you can calculate depends on the data type. Average, median, and variance make no sense if the data is categorical (proportions do). 4

Dr. Héctor Allende Review of Probability and Statistics Data Presentation for Qualitative Data • Rules: – Each observation MUST fall in one and only one category. – All observations must be accounted for. • Table -- Provides greater detail • Bar graphs -- Consider Pareto presentation! • Pie charts (do not need to be round) 5

Dr. Héctor Allende Review of Probability and Statistics Data Presentation for Quantitative Data • Consider a Stem-and-Leaf Display • Use 5 to 20 classes (intervals, groups). –Cell width, boundaries, limits, and midpoint • Histograms –Discrete –Continuous (frequency polygon - plot at class mark) • Cumulative frequency distribution (Ogive - plot at upper boundary) 6

Dr. Héctor Allende Review of Probability and Statistics • Measures of Central Tendency – Arithmetic Mean – Median – Mode – Weighted mean • Measures of Variation – Range – Variance – Standard Deviation • Coefficient of Variation • The Empirical Rule 7

Dr. Héctor Allende Review of Probability and Statistics Arithmetic Mean and Variance -- Raw Data • Mean • Variance 8

Dr. Héctor Allende Review of Probability and Statistics Arithmetic Mean and Variance -- Grouped Data • Mean • Variance 9

Dr. Héctor Allende Review of Probability and Statistics Percentiles and Box-Plots • 100 pth percentile: value such that 100 p% of the area under the relative frequency distribution lies below it. – Q 1: lower quartile (25% percentile) – Q 3: upper quartile (75% percentile) • Box-Plots: limited by lower and upper quartiles – Whiskers mark lowest and highest values within 1. 5*IQR from Q 1 or Q 3 – Outliers: Beyond 1. 5*IQR from Q 1 or Q 3 (mark with *) – z-scores - deviation from mean in units of standard deviation. Outlier: absolute value of z-score > 3 10

Dr. Héctor Allende Review of Probability and Statistics Probability: Basic Concepts • Experiment: A process of OBSERVATION • Simple event - An OUTCOME of an experiment that can not be decomposed – “Mutually exclusive” – “Equally likely” • Sample Space - The set of all possible outcomes • Event “A” - The set of all possible simple events that result in the outcome “A” 11

Dr. Héctor Allende Review of Probability and Statistics Probability • A measure of uncertainty of an estimate – The reliability of an inference • Theoretical approach - “A Priori” – Pr (Ai) = n/N • n = number of possible ways “Ai” can be observed • N = total number of possible outcomes • Historical (empirical) approach - “A Posteriori” – Pr (Ai) = n/N • n = number of times “Ai” was observed • N = total number of observations • Subjective approach – An “Expert Opinion” 12

Dr. Héctor Allende Review of Probability and Statistics Probability Rules • Multiplication Rule: – Number of ways to draw one element from set 1 which contains n 1 elements, then an element from set 2, . . , and finally an element from set k (ORDER IS IMPORTANT!): n 1* n 2*. . . * nk 13

Dr. Héctor Allende Review of Probability and Statistics Permutations and Combinations • Permutations: – Number of ways to draw r out of n elements WHEN ORDER IS IMPORTANT: • Combinations: – Number of ways to select r out of n items when order is NOT important 14

Dr. Héctor Allende Review of Probability and Statistics Compound Events 15

Dr. Héctor Allende Review of Probability and Statistics Conditional Probability 16

Dr. Héctor Allende Review of Probability and Statistics Other Probability Rules • Mutually Exclusive Events: • Independence: – A and B are said to be statistically INDEPENDENT if and only if: 17

Dr. Héctor Allende Review of Probability and Statistics Bayes’ Rule 18

Dr. Héctor Allende Review of Probability and Statistics Random Variables • Random variable: A function that maps every possible outcome of an experiment into a numerical value. • Discrete random variable: The function can assume a finite number of values • Continuous random variable: The function can assume any value between two limits. 19

Dr. Héctor Allende Review of Probability and Statistics Probability Distribution for a Discrete Random Variable • Function that assigns a value to the probability p(y) associated to each possible value of the random variable y. 20

Dr. Héctor Allende Review of Probability and Statistics Poisson Process • Events occur over time (or in a given area, volume, weight, distance, . . . ) • Probability of observing an event in a given unit of time is constant • Able to define a unit of time small enough so that we can’t observe two or more events simultaneously. • Tables usually give CUMULATIVE values! 21

Dr. Héctor Allende Review of Probability and Statistics The Poisson Distribution 22

Dr. Héctor Allende Review of Probability and Statistics Poisson Approximation to the Binomial • In a binomial situation where n is very large (n > 25) and p is very small (p < 0. 30, and np < 15), we can approximate b(x, n, p) by a Poisson with probability ( lambda = np) 23

Dr. Héctor Allende Review of Probability and Statistics Probability Distribution for a Continuous Random Variable • F( y 0 ), is a cumulative distribution function that assigns a value to the probability of observing a value less or equal to y 0 24

Dr. Héctor Allende Review of Probability and Statistics Probability Calculations 25

Dr. Héctor Allende Review of Probability and Statistics Expectations Properties of Expectations 26

Dr. Héctor Allende Review of Probability and Statistics The Uniform Distribution A frequently used model when no data are available. 27

Dr. Héctor Allende Review of Probability and Statistics The Triangular Distribution A good model to use when no data are available. Just ask an expert to estimate the minimum, maximum, and most likely values. 28

Dr. Héctor Allende Review of Probability and Statistics The Normal Distribution 29

Dr. Héctor Allende Review of Probability and Statistics The Lognormal Distribution Consider this model when 80 percent of the data values lie in the first 20 % of the variable’s range. 30

Dr. Héctor Allende Review of Probability and Statistics The Gamma Distribution 31

Dr. Héctor Allende Review of Probability and Statistics The Erlang Distribution A special case of the Gamma Distribution when A Poisson process where we are interested in the time to observe k events 32

Dr. Héctor Allende Review of Probability and Statistics The Exponential Distribution A special case of the Gamma Distribution when 33

Dr. Héctor Allende Review of Probability and Statistics The Weibull Distribution A good model for failure time distributions of manufactured items. It has a closed expression for F ( y ). 34

Dr. Héctor Allende Review of Probability and Statistics The Beta Distribution A good model for proportions. You can fit almost any data. However, the data set MUST be bounded! 35

Dr. Héctor Allende Review of Probability and Statistics Bivariate Data (Pairs of Random Variables) • Covariance: measures strength of linear relationship • Correlation: a standardized version of the covariance • Autocorrelation: For a single time series: Relationship between an observation and those immediately preceding it. Does current value (Xt) relate to itself lagged one period (Xt-1)? 36

Dr. Héctor Allende Review of Probability and Statistics Sampling Distributions See slides 8 and 9 formulas to calculate sample means and variances (raw data and grouped data, simultaneously). 37

The Sampling Distribution of the Mean (Central Limit Theorem) Dr. Héctor Allende Review of Probability and Statistics 38

Dr. Héctor Allende Review of Probability and Statistics The Sampling Distribution of Sums 39

Dr. Héctor Allende Review of Probability and Statistics Distributions Related to Variances 40

Dr. Héctor Allende Review of Probability and Statistics The t Distribution 41

Dr. Héctor Allende Review of Probability and Statistics Estimation • Point and Interval Estimators • Properties of Point Estimators – Unbiased: E (estimator) = estimated parameter Note: S 2 is Unbiased if – MVUE: Minimum Variance Unbiased Estimators • Most frequently used method to estimate parameters: MLE - Maximum Likelihood Estimators. 42

Interval Estimators -- Large sample CI for mean Dr. Héctor Allende Review of Probability and Statistics 43

Interval Estimators -- Small sample CI for mean Dr. Héctor Allende Review of Probability and Statistics 44

Dr. Héctor Allende Review of Probability and Statistics Sample Size 45

Dr. Héctor Allende Review of Probability and Statistics CI for proportions (large samples) 46

Dr. Héctor Allende Review of Probability and Statistics Sample Size (proportions) 47

Dr. Héctor Allende Review of Probability and Statistics CI for the variance 48

Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -- large samples -- 49

Dr. Héctor Allende Review of Probability and Statistics CI for (p 1 - p 2) --- (large samples) 50

Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -- small samples, same variance -- 51

Dr. Héctor Allende Review of Probability and Statistics CI for the Difference of Two Means -small samples, different variances- 52

CI for the Difference of Two Means -- matched pairs -- Dr. Héctor Allende Review of Probability and Statistics 53

Dr. Héctor Allende Review of Probability and Statistics CI for two variances 54

Dr. Héctor Allende Review of Probability and Statistics Prediction Intervals 55

Dr. Héctor Allende Review of Probability and Statistics Hypothesis Testing • Elements of a Statistical Test. Focus on decisions made when comparing the observed sample to a claim (hypotheses). How do we decide whether the sample disagrees with the hypothesis? • Null Hypothesis, H. A claim about one or more 0 population parameters. What we want to REJECT. • Alternative Hypothesis, H : What we test against. a Provides criteria for rejection of H 0. • Test Statistic: computed from sample data. • Rejection (Critical) Region, indicates values of the test statistic for which we will reject H 0. 56

Dr. Héctor Allende Review of Probability and Statistics Errors in Decision Making True State of Nature H 0 Ha Decision Dishonest client Honest client Do not lend Correct decision Type II error Type I error Correct decision Lend 57

Dr. Héctor Allende Review of Probability and Statistics Statistical Errors 58

Dr. Héctor Allende Review of Probability and Statistics Statistical Tests 59

Dr. Héctor Allende Review of Probability and Statistics The Critical Value 60

Dr. Héctor Allende Review of Probability and Statistics The observed significance level for a test 61

Dr. Héctor Allende Review of Probability and Statistics Testing proportions (large samples) 62

Dr. Héctor Allende Review of Probability and Statistics Testing a Normal Mean 63

Dr. Héctor Allende Review of Probability and Statistics Testing a variance 64

Dr. Héctor Allende Review of Probability and Statistics Testing Differences of Two Means -- large samples -- 65

Dr. Héctor Allende Review of Probability and Statistics Testing Differences of Two Means -- small samples, same variance -- 66

Dr. Héctor Allende Review of Probability and Statistics Testing Differences of Two Means -small samples, different variances- 67

Testing Difference of Two Means -- matched pairs -- Dr. Héctor Allende Review of Probability and Statistics 68

Dr. Héctor Allende Review of Probability and Statistics Testing a ratio of two variances 69

Dr. Héctor Allende Review of Probability and Statistics Testing (p 1 - p 2) --- (large samples) 70

Dr. Héctor Allende Review of Probability and Statistics Categorical Data 71

Dr. Héctor Allende Review of Probability and Statistics One-way Tables (Cont. ) 72

Dr. Héctor Allende Review of Probability and Statistics Categorical Data Analysis 73

Dr. Héctor Allende Review of Probability and Statistics Example of a Contingency Table 74

Dr. Héctor Allende Review of Probability and Statistics Testing for Independence 75

Dr. Héctor Allende Review of Probability and Statistics Distributions: Model Fitting Steps 1 Collect data. Make sure you have a random sample. You will need at least 30 valid cases 2 Plot data. Look for familiar patterns 3 Hypothesize several models for distribution 4 Using part of the data, estimate model parameters 5 Using the rest of the data, analyze the model’s accuracy 6 Select the “best” model and implement it 7 Keep track of model accuracy over time. If warranted, go back to 6 (or to 3, if data (population? ) behavior keeps changing) 76

Dr. Héctor Allende Review of Probability and Statistics Chi-Square Test of Goodness of Fit 77

Dr. Héctor Allende Review of Probability and Statistics Kolmogorov-Smirnov Test of Goodness of Fit 78

Dr. Héctor Allende Review of Probability and Statistics A Review of Probability and Statistics • Descriptive statistics • Probability • Random variables • Sampling distributions • Estimation and confidence intervals • Test of Hypothesis – For mean, variances, and proportions – Goodness of fit 79