CSC 323 Quarter Winter 0203 Daniela Stan Raicu

  • Slides: 16
Download presentation
CSC 323 Quarter: Winter 02/03 Daniela Stan Raicu School of CTI, De. Paul University

CSC 323 Quarter: Winter 02/03 Daniela Stan Raicu School of CTI, De. Paul University 1/2/2022 Daniela Stan - CSC 323 1

Outline Chapter 2: Looking at Data – Relationships between two or more variables Ø

Outline Chapter 2: Looking at Data – Relationships between two or more variables Ø Ø Ø Linear regression Least-squares regression line Residual Analysis Cautions about regression and correlation SAS procedures for scatterplots, correlation and regression 1/2/2022 Daniela Stan - CSC 323 2

Linear Regression Objective: To quantify the linear relationship between an explanatory variable and response

Linear Regression Objective: To quantify the linear relationship between an explanatory variable and response variable by fitting a line to the data (that is, drawing a line that comes as close as possible to the points). Example: Regression line 1/2/2022 Daniela Stan - CSC 323 3

Linear Regression Ø A regression line is a straight line that describes how a

Linear Regression Ø A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Linear Regression equation: ^ y = a + b*x b = slope ~ rate of change a = intercept (x=0) Height= a + b*age 1/2/2022 Daniela Stan - CSC 323 4

Prediction Ø Use of Regression: to predict the value of y for any value

Prediction Ø Use of Regression: to predict the value of y for any value of x by substituting this x into the equation of the regression line. Example: Prediction via Regression Line Husband Wife: Ages • The regression equation is y = 3. 6 + 0. 97 x, where y is the average of all husbands who have wives of age x • For all women aged 30, we predict the average husband age to be 32. 7 years: 3. 6 + (0. 97)(30) = 32. 7 years • Suppose we know that an individual wife’s age is 30. What would we predict her husband’s age to be? 1/2/2022 Daniela Stan - CSC 323 5

Least-squares Regression Ø Used to determine the “best” line; Ø We want the line

Least-squares Regression Ø Used to determine the “best” line; Ø We want the line to be as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict) Ø The least - squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Y Observed value y Error Predicted value A residual is the difference between an observed value of the response variable y and the value predicted by the regression line. 1/2/2022 Daniela Stan - CSC 323 x 6

Least - Squares Regression The regression line makes the prediction errors as small as

Least - Squares Regression The regression line makes the prediction errors as small as possible. 1/2/2022 Daniela Stan - CSC 323 7

Least - Squares Regression (cont. ) Ø How is the least – squares regression

Least - Squares Regression (cont. ) Ø How is the least – squares regression line calculated? Where: = predicted value r = correlation, Sx, Sy = standard deviations = means 1/2/2022 Daniela Stan - CSC 323 8

Coefficient of Determination (R 2) Ø Measures usefulness of regression prediction Ø R 2

Coefficient of Determination (R 2) Ø Measures usefulness of regression prediction Ø R 2 (or r 2, the square of the correlation): measures how much variation in the values of the response variable (y) is explained by the regression line Ø Example: Ø r=1: R 2=1: regression line explains/captures all (100%) of the variation in y Ø r=. 7: R 2=. 49: regression line explains almost half (50%) of the variation in y 1/2/2022 Daniela Stan - CSC 323 9

A Caution: Beware of Extrapolation Ø Extrapolation is the use of regression line for

A Caution: Beware of Extrapolation Ø Extrapolation is the use of regression line for prediction outside the range values of the explanatory variable x that you used to obtain the line. Ø Such predictions are often not accurate. Ø Sarah’s height was plotted against her age Ø Can you predict her height at age 42 months? Ø Can you predict her height at age 30 years (360 months)? 1/2/2022 Daniela Stan - CSC 323 10

A Caution: Beware of Extrapolation Ø Regression line: y = 71. 95 +. 383

A Caution: Beware of Extrapolation Ø Regression line: y = 71. 95 +. 383 x Ø height at age 42 months? y = 88 Ø height at age 30 years? y = 209. 8 Ø She is predicted to be 6’ 10. 5” at age 30. 1/2/2022 Daniela Stan - CSC 323 11

Accuracy of the predictions One possible measure of the accuracy of the regression predictions

Accuracy of the predictions One possible measure of the accuracy of the regression predictions is given by the root mean square error (r. m. s. error). The r. m. s. error is defined as the square root of the average of the square residuals: In large data sets, the r. m. s. error is approximately equal to 1/2/2022 Daniela Stan - CSC 323 12

Confounding factor A confounding factor is a variable that has an important effect on

Confounding factor A confounding factor is a variable that has an important effect on the relationship among the variables in a study but it is not included in the study. Example: The mathematics department of a large university must plan the timetable for the following year. Data are collected on the enrollment year, the number x of first-year students and the number y of students enrolled in elementary math courses. The fitted regression line has equation: =2491. 69+1. 0663 x R 2=0. 694. 1/2/2022 Daniela Stan - CSC 323 13

Influential Point An observation is influential for the regression line, if removing it would

Influential Point An observation is influential for the regression line, if removing it would change considerably the fitted line. An influential point pulls the regression line towards itself. Regression line if is omitted 1/2/2022 Daniela Stan - CSC 323 Influential point/outlier 14

Summary - Warnings 1. Correlation measures linear association, regression line should be used only

Summary - Warnings 1. Correlation measures linear association, regression line should be used only when the association is linear. 2. Extrapolation – do not use the regression line to predict values outside the observed range – predictions are not reliable. 3. Correlation and regression line are sensitive to influential / extreme points. 1/2/2022 Daniela Stan - CSC 323 15

Data Mining Ø Exploring really large data bases in the hope of finding useful

Data Mining Ø Exploring really large data bases in the hope of finding useful patterns is called data mining. Domain Understanding Data Selection Cleaning & Preprocessing Knowledge Evaluation & Interpretation Discovering patterns The entire process is iterative and interactive. 1/2/2022 Daniela Stan - CSC 323 16