Correlation and Simple Linear Regression 1 Correlation Analysis
- Slides: 41
Correlation and Simple Linear Regression 1
Correlation Analysis Correlation analysis is used to describe the degree to which one variable is linearly related to another. There are two measures for describing correlation: 1. The Coefficient of Correlation 2. The Coefficient of Determination 2
Correlation The correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by , can take on any value from -1 to 1. -1 < < 0 0< <1 indicates a perfect negative linear relationship indicates a negative linear relationship indicates no linear relationship indicates a positive linear relationship indicates a perfect positive linear relationship The absolute value of indicates the strength or exactness of the relationship. 3
Illustrations of Correlation Y = -1 Y X Y = -. 8 X = 0 Y = 1 X X Y = 0 X Y =. 8 X 4
The coefficient of correlation: Sample Coefficient of Determination 5
The Coefficient of Correlation or Karl Pearson’s Coefficient of Correlation The coefficient of correlation is the square root of the coefficient of determination. The sign of r indicates the direction of the relationship between the two variables X and Y. 6
Simple Linear Regression • Regression refers to the statistical technique of modeling the relationship between variables. • In simple linear regression, regression we model the relationship between two variables • One of the variables, denoted by Y, is called the dependent variable and the other, denoted by X, is called the independent variable • The model we will use to depict the relationship between X and Y will be a straight-line relationship • A graphical sketch of the pairs (X, Y) is called a scatter plot 7
Using Statistics Scatterplot of Advertising Expenditures (X) and Sales (Y) 140 120 100 S ale s This scatterplot locates pairs of observations of advertising expenditures on the x-axis and sales on the y-axis. We notice that: ü Larger (smaller) values of sales tend to be associated with larger (smaller) values of advertising. 80 60 40 20 0 0 10 20 30 40 50 A d ve rtising ü The scatter of points tends to be distributed around a positively sloped straight line. ü The pairs of values of advertising expenditures and sales are not located exactly on a straight line. ü The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. ü The line represents the nature of the relationship on average. 8
Examples of Other Scatterplots 0 Y Y Y 0 0 X X X Y Y Y X X X 9
Simple Linear Regression Model n The equation that describes how y is related to x and an error term is called the regression model. n The simple linear regression model is: y = a+ bx +e where: a and b are called parameters of the model, a is the intercept and b is the slope. e is a random variable called the error term. 10
Assumptions of the Simple Linear Regression Model • • • The relationship between X and Y is a straight-line relationship. The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations. That is: ~ N(0, 2) Y Assumptions of the Simple Linear Regression Model E[Y]= 0 + 1 X Identical normal distributions of errors, all centered on the regression line. X 11
Errors in Regression Y . { Xi X 12
SIMPLE REGRESSION AND CORRELATION Estimating Using the Regression Line First, lets look at the equation of a straight line is: Dependent variable Y-intercept Independent variable Slope of the line 13
SIMPLE REGRESSION AND CORRELATION The Method of Least Squares To estimate the straight line we have to use the least squares method. This method minimizes the sum of squares of error between the estimated points on the line and the actual observed points. The sign of r will be the same as the sign of the coefficient “b” in the regression equation Y = a + b X Alternate Formula (using Regression Coefficients) 14
SIMPLE REGRESSION AND CORRELATION The estimating line Slope of the best-fitting Regression Line Y-intercept of the Best-fitting Regression Line 15
SIMPLE REGRESSION - EXAMPLE Suppose an appliance store conducts a five-month experiment to determine the effect of advertising on sales revenue. The results are shown below. (File PPT_Regr_example. sav) Advertising Exp. ($100 s) Sales Rev. ($1000 S) 1 1 2 1 3 2 4 2 5 4 16
SIMPLE REGRESSION - EXAMPLE X 1 2 3 4 5 Y 1 1 2 2 4 X 2 1 4 9 16 25 XY 1 2 6 8 20 17
SIMPLE REGRESSION - EXAMPLE b = 0. 7 18
Sample Coefficient of Determination Interpretation: We can conclude that 81. 67 % of the variation in the sales revenues is explain by the variation in advertising expenditure. Percentage of total variation explained by the regression. 19
SIMPLE REGRESSION AND CORRELATION If the slope of the estimating line is positive : - r is the positive square root If the slope of the estimating line is negative : - r is the negative square root The relationship between the two variables is direct 20
Steps in Hypothesis Testing using SPSS State the null and alternative hypotheses n Define the level of significance (α) n Calculate the actual significance : p -value n Make decision : Reject null hypothesis, if p≤ α, for 2 -tail test; and if p*≤ α, for 1 -tail test. (p* is p/2 when p is obtained from 2 -tail test) n Conclusion n
Hypothesis Tests for the Correlation Coefficient H 0: = 0 H 1: 0 (No significant linear relationship) (Linear relationship is significant) Test Statistic: Use p-value for decision making. 22
Correlations Advertising expenses ($00) Pearson Correlation Sig. (2 -tailed) Advertisi ng Sales expenses revenue ($00) ($000) 1. 904*. 035 N Sales revenue ($000) Pearson Correlation Sig. (2 -tailed) 5 5 . 904* 1 N . 035 5 5 *. Correlation is significant at the 0. 05 level (2 -tailed). 23
Standard Error of Estimate The standard error of estimate is used to measure the reliability of the estimating equation. It measures the variability or scatter of the observed values around the regression line. 24
Standard Error of Estimate Alternately 25
Standard Error of Estimate Y 2 1 1 4 4 16 26
Model Summary Adjusted R Std. Error of Model R R Square the Estimate 1. 904 a. 817. 756. 606 a. Predictors: (Constant), Advertising expenses ($00) ANOVAb Sum of Model Squares df 1 Regression 4. 900 Residual 1. 100 Total 6. 000 Mean Square F Sig. 1 4. 900 13. 364. 035 a 3. 367 4 a. Predictors: (Constant), Advertising expenses ($00) b. Dependent Variable: Sales revenue ($000) 27
Analysis-of-Variance Table and an F Test of the Regression Model H 0 : The regression model is not significant H 1 : The regression model is significant 28
Test Statistic Value of the test statistic: The p-value is 0. 035 Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. b is not equal to zero. Thus, the independent variable is linearly related to y. 29
Testing for the existence of linear relationship n We test the hypothesis: H 0: b = 0 (the independent variable is not a significant predictor of the dependent variable) H 1: b is not equal to zero (the independent variable is a significant predictor of the dependent variable). n If b is not equal to zero (if the null hypothesis is rejected), we can conclude that the Independent variable contributes significantly in predicting the Dependent variable. Test statistic, with n-2 degrees of freedom: 30
Coefficientsa Model 1 (Constant) Advertising expenses ($00) Standar dized Unstandardized Coefficients Std. B Error Beta t Sig. -. 100. 635 -. 157. 885. 700 . 191 . 904 3. 656 . 035 a. Dependent Variable: Sales revenue ($000) 31
Test statistic, with n-2 degrees of freedom: Rejection Region Value of the test statistic: Conclusion: The calculated test statistic is 3. 66 which is outside the acceptance region. Alternately, the actual significance is 0. 035. Therefore we will reject the null hypothesis. The advertising expenses is a significant explanatory variable. 32
Example The sales and advertising data for brass door hinges, for the past five months, are given by the marketing manager in the table below. The marketing manager says that next month the company will spend $1, 750 on advertising for the product. Use linear regression to develop an equation and a forecast for this product. 33
Causal Methods Linear Regression Month 1 2 3 4 5 Sales (Y) Advertising (X) (000 units) (000 $) 264 116 165 101 209 2. 5 1. 3 1. 4 1. 0 2. 0 a = Y – b. X 34
Causal Methods Linear Regression Month Sales, Y Advertising, X (000 units) (000 $) XY X 2 Y 2 1 2 3 4 5 264 116 165 101 209 2. 5 1. 3 1. 4 1. 0 2. 0 660. 0 150. 8 231. 0 101. 0 418. 0 6. 25 1. 69 1. 96 1. 00 4. 00 69, 696 13, 456 27, 225 10, 201 43, 681 Total 855 Y = 171 8. 2 X = 1. 64 1560. 8 14. 90 164, 259 a = – 8. 136 b = 109. 229 Y = – 8. 136 + 109. 229(X) 35
Causal Methods Linear Regression Month Sales, Y Advertising, X (000 units) (000 $) XY X 2 Y 2 1 2 3 4 5 264 116 165 101 209 2. 5 1. 3 1. 4 1. 0 2. 0 660. 0 150. 8 231. 0 101. 0 418. 0 6. 25 1. 69 1. 96 1. 00 4. 00 69, 696 13, 456 27, 225 10, 201 43, 681 Total 855 Y = 171 8. 2 X = 1. 64 1560. 8 14. 90 164, 259 r= n XY – X Y [n X 2 – ( X) 2][n Y 2 – ( Y) 2] =0. 98 36
Linear Trend – Using the Least Squares Method: An Example The sales of Jensen Foods, a small grocery chain located in southwest Texas, since 2005 are: Year t Sales ($ mil. ) 2005 1 7 2006 2 10 2007 3 9 2008 4 11 2009 5 13 37
Nonlinear Trends (File: PPT_Log_Regr) n n n n A linear trend equation is used when the data are increasing (or decreasing) by equal amounts A nonlinear trend equation is used when the data are increasing (or decreasing) by increasing amounts over time When data increase (or decrease) by equal percents or proportions plot will show curvilinear pattern Consider the sales data for Gulf Shores Importers, as shown in the next slide. Top graph is original data Consider log(sales) as the log base 10 of the original data which is now is linear Using SPSS or Data Analysis in Excel, generate the linear equation Regression output shown in subsequent slide 38
Log Trend Equation – Gulf Shores Importers Example Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Sales 124 176 307 524 714 1052 1638 2403 3358 4181 5389 8027 10587 13537 17516 Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Log(Sales) 2. 09 2. 24 2. 49 2. 72 2. 85 3. 02 3. 21 3. 38 3. 53 3. 62 3. 73 3. 90 4. 02 4. 13 4. 24 39
Log Trend Equation – Gulf Shores Importers Example – SPSS output 40
Log Trend Equation – Gulf Shores Importers Example 41
- Linear regression vs multiple regression
- Logistic regression vs linear regression
- Logistic regression vs linear regression
- Linear regression vs multiple regression
- Positive vs negative correlation
- Difference between regression and correlation
- Correlation and regression
- Difference between correlation and regression
- Difference between regression and correlation
- How to calculate sst in regression
- Absolute value of correlation coefficient
- Difference between correlation and regression
- Contoh soal analisis regresi dan korelasi
- Linear regression hypothesis
- Regression analysis excel 2007
- Useless regression chapter 16
- Simple linear regression
- Simple linear regression
- Simple linear regression spss
- Multiple linear regression analysis formula
- Positive correlation versus negative correlation
- Regression vs correlation
- Coefficient of correlation
- Simple regression analysis
- Knn linear regression
- Hierarchical linear regression spss
- Linear regression riddle a
- Aleksandar prokopec
- Multiple linear regression model
- Rumus regresi linear sederhana
- Multiple regression assumptions spss
- Linear regression andrew ng
- Gradient descent multiple variables
- Sum of squares
- Ap statistics linear regression
- Example of regression analysis
- Log linear regression model
- F statistic formula
- Log linear regression model
- Assumptions of classical regression model
- Linear regression origin
- Linear regression loss function