Lecture 10 Random Sampling and Sampling Distributions David
Lecture 10. Random Sampling and Sampling Distributions David R. Merrell 90 -786 Intermediate Empirical Methods for Public Policy and Management
Agenda n n n Normal Approximation to Binomial Poisson Process Random sampling Sampling statistics and sampling distributions Expected values and standard errors of sample sums and sample means
Binomial Random Variable Binomial random variable X is the number of “successes” in n trials, where n Probability of success remains the same from trial to trial n Trials are independent
Binomial Probability Distribution Discrete distribution with: n P(X=x) = (n!/(x!(n-x)!))px qn-x n n is number of trials n x is number of successes in n trials (x = 0, 1, 2, . . . , n) n n p is the probability of success on a single trial q is the probability of failure on a single trial
Properties of the Binomial RV n Mean: = np n Variance: = npq n Standard Deviation:
Binomial(n = 10, p =. 4) x 0 1 2 3 4 5 6 7 8 9 10 P(X=x) 0. 006047 0. 040311 0. 120932 0. 214991 0. 250823 0. 200658 0. 111477 0. 042467 0. 010617 0. 001573 0. 000105
Approximation to Binomial Distribution n Use normal distribution when: n n n is large np > 10 n(1 - p) > 10 Parameters of the approximating normal distribution are the mean and standard deviation from the binomial distribution
Approximation of Binomial Distribution n = 80, p =. 4
How Good is the Approximation? P(X < 29) Binomial with n = 80 and p = 0. 400000 x P( X <= x) 28. 00 0. 2131 Normal with mean = 32. 0000 and standard deviation = 4. 38000 x P( X <= x) 28. 0000 0. 1806 x P( X <= x) 28. 5000 0. 2121
Application 1 The Chicago Equal Employment Commission believes that the Chicago Transit Authority (CTA) discriminates against Republicans. The records show that 37. 5% of the individuals listed as passing the CTA exam were Republicans; the remainder were Democrats (no one registers as an independent in Illinois). CTA hired 30 people last year, 25 of them were Democrats. What is the probability that this situation could exist if CTA did not discriminate?
Application 1 (cont. ) n Success: a Republican is hired n The probability of success, p = 0. 375 n The number of trials, n = 30 n The number of successes, x = 5 n P(x 5) = ? ? ?
Application 1 (cont. ) = np = 30*. 375 = 11. 25 n Mean: n Variance: = npq = 30*. 375*. 625 = 7. 03 n Standard Deviation: = 2. 65 Normal with mean = 11. 25 and standard deviation = 2. 65 x P( X <= x) 5. 5000 0. 0150
Poisson Process rate 0 x x x time Assumptions time homogeneity independence no clumping
Poisson Process n n Earthquakes strike randomly over time with a rate of = 4 per year. Model time of earthquake strike as a Poisson process Count: How many earthquakes will strike in the next six months? Duration: How long will it take before the next earthquake hits?
Count: Poisson Distribution n What is the probability that 3 earthquakes will strike during the next six months?
Poisson Distribution Count in time period t
Minitab Probability Calculation n Click: Calc > Probability Distributions > Poisson Enter: For mean 2, input constant 3 n Output: Probability Density Function Poisson with mu = 2. 00000 x P( X = x) 3. 00 0. 1804 n
Duration: Exponential Distribution n Time between occurrences in a Poisson process Continuous probability distribution Mean =1/ t
Exponential Probability Problem n n n What is the probability that 9 months will pass with no earthquake? t = 1/12, t= 1/3 1/ t = 3
Minitab Probability Calculation n Click: Calc > Probability Distributions > Exponential Enter: For mean 3, input constant 9 n Output: Cumulative Distribution Function Exponential with mean = 3. 00000 x P( X <= x) 9. 0000 0. 9502 n
Exponential Probability Density Function n n n n MTB > set c 1 DATA > 0: 12000 DATA > end Let c 1 = c 1/1000 Click: Calc > Probability distributions > Exponential > Probability density > Input column Enter: Input column c 1 > Optional storage c 2 Click: OK > Graph > Plot Enter: Y c 2 > X c 1 Click: Display > Connect > OK
Exponential Probability Density Function
Sampling n n Population - entire set of objects that we are interested in studying Sample - a chosen subset of a population
Some Samples Are. . . n n random -- each item in the population has an equal chance of being selected to be part of the sample representative -- has the same characteristics as the population under study, a microcosm of the population
Population Parameters and Sample Statistics n Population Parameter n n Numerical descriptor of a population Values usually uncertain e. g. , population mean ( ), population standard deviation ( ) Sample Statistics n n n Numerical descriptor of a sample Calculated from observations in the sample e. g. , sample mean , sample standard deviation S
What is a sampling distribution? n n n Sample statistics are random variables Sample statistics have probability distributions “Sampling distribution” is the probability distribution of a sample statistic
MTB > Retrieve 'C: MTBWINDATARESTRNT. MTW'. Retrieving worksheet from file: C: MTBWINDATARESTRNT. MTW Worksheet was saved on 5/31/1994 MTB > info Information on the Worksheet Column C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 C 9 C 10 C 11 C 12 C 13 C 14 Name ID OUTLOOK SALES NEWCAP VALUE COSTGOOD WAGES ADS TYPEFOOD SEATS OWNER FT. EMPL PT. EMPL SIZE Count 279 279 279 279 Missing 0 1 25 55 39 42 44 44 12 11 10 14 13 16
MTB > desc 'sales' Descriptive Statistics Variable SALES N 254 N* 25 Mean 332. 6 Median 200. 0 Variable SALES Min 0. 0 Max 8064. 0 Q 1 83. 7 Q 3 382. 7 MTB > boxp 'sales' * NOTE * N missing = 25 Tr. Mean 248. 9 St. Dev 650. 5 SEMean 40. 8
MTB > hist 'sales' * NOTE * N missing = 25
MTB > let c 15 = loge('sales') J *** Values out of bounds during operation at J Missing returned 1 times MTB > let c 15 = loge('sales' + 1) MTB > name c 15 'logsales' MTB > desc 'logsales' Descriptive Statistics Variable logsales N 254 N* 25 Mean 5. 1830 Median 5. 3033 Variable logsales Min 0. 0000 Max 8. 9953 Q 1 4. 4394 Q 3 5. 9500 MTB > boxp 'logsales' * NOTE * N missing = 25 Tr. Mean 5. 2134 St. Dev 1. 1387 SEMean 0. 0715
Four Samples of Size 50 From Restaurant “Logsales” Data--Histograms
Random Samples from Restaurant “Logsales” Data--Summary MTB > Desc c 16 -c 19 Descriptive Statistics Variable C 16 C 17 C 18 C 19 N 43 43 48 43 N* 7 7 2 7 Mean 5. 246 5. 351 5. 366 5. 244 Median 5. 375 5. 352 5. 461 5. 198 Variable C 16 C 17 C 18 C 19 Min 2. 773 1. 099 2. 485 3. 434 Max 6. 621 8. 456 7. 091 6. 868 Q 1 4. 625 4. 710 4. 961 4. 595 Q 3 5. 787 6. 176 5. 994 6. 089 Tr. Mean 5. 280 5. 383 5. 388 5. 253 St. Dev 0. 867 1. 223 0. 888 0. 937 SEMean 0. 132 0. 186 0. 128 0. 143
Next Time. . . n Central Limit Theorem--”Sample averages are approximately normally distributed”
- Slides: 35