Linear RegressionCorrelation Quantitative Explanatory and Response Variables Goal

Linear Regression/Correlation • Quantitative Explanatory and Response Variables • Goal: Test whether the level of the response variable is associated with (depends on) the level of the explanatory variable • Goal: Measure the strength of the association between the two variables • Goal: Use the level of the explanatory to predict the level of the response variable

Linear Relationships • Notation: – Y: Response (dependent, outcome) variable – X: Explanatory (independent, predictor) variable • Linear Function (Straight-Line Relation): Y = a + b X (Plot Y on vertical axis, X horizontal) – Slope (b): The amount Y changes when X increases by 1 • b > 0 Line slopes upward (Positive Relation) • b = 0 Line is flat (No linear Relation) • b < 0 Line slopes downward (Negative Relation) – Y-intercept (a): Y level when X=0

Example: Service Pricing • Internet History Resources (New South Wales Family History Document Service) • Membership fee: $20 A • 20¢ ($0. 20 A) per image viewed • Y = Total cost of service • X = Number of images viewed • a = Cost when no images viewed • b = Incremental Cost per image viewed • Y = a + b X = 20+0. 20 X

Example: Service Pricing

Probabilistic Models • In practice, the relationship between Y and X is not “perfect”. Other sources of variation exist. We decompose Y into 2 components: – Systematic Relationship with X: a + b X – Random Error: e • Random respones can be written as the sum of the systematic (also thought of as the mean) and random components: Y = a + b X + e • The (conditional on X) mean response is: E(Y) = a + b X

Least Squares Estimation • Problem: a, b are unknown parameters, and must be estimated and tested based on sample data. • Procedure: – Sample n individuals, observing X and Y on each one – Plot the pairs Y (vertical axis) versus X (horizontal) – Choose the line that “best fits” the data. – Criteria: Choose line that minimizes sum of squared vertical distances from observed data points to line. Least Squares Prediction Equation:

Example - Pharmacodynamics of LSD • Response (Y) - Math score (mean among 5 volunteers) • Predictor (X) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)

Example - Pharmacodynamics of LSD (Column totals given in bottom row of table)

SPSS Output and Plot of Equation

Example - Retail Sales • U. S. SMSA’s • Y = Per Capita Retail Sales • X = Females per 100 Males

Residuals • Residuals (aka Errors): Difference between observed values and predicted values: • Error sum of squares: • Estimate of (conditional) standard deviation of Y:

Linear Regression Model • • Data: Y = a + b X + e Mean: E(Y) = a + b X Conditional Standard Deviation: s Error terms (e) are assumed to be independent and normally distributed

Example - Pharmacodynamics of LSD

Correlation Coefficient • Slope of the regression describes the direction of association (if any) between the explanatory (X) and response (Y). Problems: – The magnitude of the slope depends on the units of the variables – The slope is unbounded, doesn’t measure strength of association – Some situations arise where interest is in association between variables, but no clear definition of X and Y • Population Correlation Coefficient: r • Sample Correlation Coefficient: r

Correlation Coefficient • Pearson Correlation: Measure of strength of linear association: – Does not delineate between explanatory and response variables – Is invariant to linear transformations of Y and X – Is bounded between -1 and 1 (higher values in absolute value imply stronger relation) – Same sign (positive/negative) as slope

Example - Pharmacodynamics of LSD • Using formulas for standard deviation from beginning of course: s. X = 1. 935 and s. Y = 18. 611 • From previous calculations: b = -9. 01 This represents a strong negative association between math scores and LSD tissue concentration

Coefficient of Determination • Measure of the variation in Y that is “explained” by X – Step 1: Ignoring X, measure the total variation in Y (around its mean): – Step 2: Fit regression relating Y to X and measure the unexplained variation in Y (around its predicted values): – Step 3: Take the difference (variation in Y “explained” by X), and divide by total:

Example - Pharmacodynamics of LSD TSS SSE

Inference Concerning the Slope (b) • Parameter: Slope in the population model (b) • Estimator: Least squares estimate: b • Estimated standard error: • Methods of making inference regarding population: – Hypothesis tests (2 -sided or 1 -sided) – Confidence Intervals

Significance Test for b • 2 -Sided Test – H 0: b = 0 – H A: b 0 • 1 -sided Test – H 0: b = 0 – HA+: b > 0 or – H A -: b < 0

(1 -a)100% Confidence Interval for b • Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2 -sided hypothesis test

Example - Pharmacodynamics of LSD • Testing H 0: b = 0 vs HA: b 0 • 95% Confidence Interval for b : t. 025, 5

Analysis of Variance in Regression • Goal: Partition the total variation in y into variation “explained” by x and random variation • These three sums of squares and degrees of freedom are: • Total (TSS) df. Total = n-1 • Error (SSE) df. Error = n-2 • Model (SSR) df. Model = 1

Analysis of Variance in Regression • Analysis of Variance - F-test • H 0: b = 0 HA: b 0 F represents the F-distribution with 1 numerator and n-2 denominator degrees of freedom

Example - Pharmacodynamics of LSD • Total Sum of squares: • Error Sum of squares: • Model Sum of Squares:

Example - Pharmacodynamics of LSD • Analysis of Variance - F-test • H 0: b = 0 HA: b 0

Example - SPSS Output

Significance Test for Pearson Correlation • Test identical (mathematically) to t-test for b, but more appropriate when no clear explanatory and response variable • H 0: r = 0 Ha: r 0 (Can do 1 -sided test) • Test Statistic: • P-value: 2 P(t |tobs|)

Model Assumptions & Problems • Linearity: Many relations are not perfectly linear, but can be well approximated by straight line over a range of X values • Extrapolation: While we can check validity of straight line relation within observed X levels, we cannot assume relationship continues outside this range • Influential Observations: Some data points (particularly ones with extreme X levels) can exert a large influence on the predicted equation.