Advanced Quantitative Techniques Lab 2 Normality Graphing Distributions

Advanced Quantitative Techniques Lab 2: Normality, Graphing Distributions, Confidence Intervals

Normal distribution

What are the Characteristics of a Normal Distribution? • Unimodal • Bell shaped • Symmetric • Mean = Mode = Median • Skewness = 0 • Kurtosis = 3 • 68 – 95 – 99. 7 rule

If population has a Normal distribution 68. 2% of dataset is within 1 standard deviation of the mean 95. 4% of dataset is within 2 standard deviations of the mean 99. 7% of dataset is within 3 standard deviations of the mean

More about Normal distribution • Probability of any event is the area under the density curve. • Total area under curve = 1 (collectively exhaustive) • Normal distributions are idealized description of data • Total area is approximate; never precisely calculated because the line never touches x-axis.

Is population normal distributed? use calls_311. dta histogram POP 2010, width (600) frequency normal

Is population normal distributed? sum POP 2010, detail

Variance vs. Standard Deviation Variance (σ2) Average of squared differences from the mean Standard Deviation (σ) Square root of the variance

Skewness is a measure of symmetry Where is the tail? Mean > Median Skewness > 0 Mean = Median STATA: Skewness = 0 Mean < Median Skewness < 0

Skewness

Kurtosis • Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. (Kurtosis > 3) (Kurtosis = 3) (Kurtosis < 3)

Example of Normal distribution • use Lab_2_Data. 16. dta • histogram bwt, width (400) frequency normal

Example of Normal distribution • sum bwt, detail

Sampling • Population – a group that includes all the cases (individuals, objects, or groups) in which the researcher is interested. • Sample – a relatively small subset from a population.

Sampling • Random sample • Stratified sample: divide the population into groups and draw a random sample from each group • Cluster sample: group the population into small clusters, draws a simple random sample of clusters, and sample everything in the clusters

Sampling • Parameter – A measure used to describe a population distribution. • Statistic – A measure used to describe a sample distribution. • Estimation – A process whereby we select a random sample from a population and use a sample statistic to estimate a population parameter.

Inference

Inferential Statistics • We generally don’t know anything about the population distribution • We have a sample of data from the population • We assume that the average/mean is the most appropriate description of population (no more median because we assume normal distribution) • The sample is to be random and representative (“large enough”)

Inferential Statistics What can we infer about the population based on a sample? • From now on, we’re estimating the population mean (μ) with the sample mean ( ). • We are no longer talking about individual behavior; we’re talking about average behavior

Distribution of Means • Take a random sample over, and over again (random means each data point has an equal chance of being chosen). • You get many sample means • Plot the sampling distribution of these means: you get a distribution of averages (not raw data points!)

Distribution of Means • Sampling Distribution of Means: Frequency distribution (histogram) of the sample means, not of the data themselves. Freq Distribution of all possible sample means **This is not the distribution of x** • If we sample randomly from a large enough population, the distribution of the averages of the data (not the population data!) is a bell curve (normal distribution). • This is the case regardless of what the population distribution looks like.

Confidence Intervals • The goal of calculating confidence intervals is to determine how sure we are that the true population mean, μ, is approximated by the sample mean.

Confidence Intervals • Confidence Level – The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter. – 95% confidence level – there is a. 95 probability that a specified interval DOES contain the population mean. – 99% confidence level – there is 1 chance out of 100 that the interval DOES NOT contain the population mean.

STATA: ci Command • Open Stata and calls_311. dta. Ci means calls_per_thousand, level(90) Sample Size Significance Level Sample Mean Lower Bound of the CI Standard Error = Upper Bound of the CI

Build a 95% CI for 311 calls per thousand people. The default CI for the CI command in Stata is 95%. Precise Confident

Build a CI for Bronx calls/1, 000 pps that leaves a 10% chance of overestimation error. ci means calls_per_thousand if county=="005", level(80) Build a CI for Manhattan calls/1, 000 pps that leaves a 20% chance that the population mean is not captured by the interval. ci means calls_per_thousand if county=="061", level(80) Are they significantly different?

Confidence intervals in a Normal distribution