1192020 REGRESSION ANALYSIS Regression analysis attempts to establish

11/9/2020

REGRESSION ANALYSIS • Regression analysis attempts to establish nature of relation between variables • Measure of average relation between two or more variables • Most frequently used technique in economics and business research 11/9/2020

Historical Origin of Regression • Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers. • Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group.

REGRESSION ANALYSIS • Statistical tool to estimate the unknown values of one variable from known values of another variable • Independent (X) and dependent variable (Y) • Simple linear regression analysis: only one predictor and straight line • Dependent and independent refer mathematical or functional meaning to the • Values of Y are dependent on values of X, X may or may not be causing change in Y 11/9/2020

USES • Provides estimates of values of dependent variables from values of independent values : regression lines • Obtains a measure of error involved in using regression line as basis for estimation • Correlation coefficient can be calculated with help of regression coefficient 11/9/2020

DIFFERENCES WITH CORRELATION • Correlation : Measure of degree of relationship, measure degree of co variability • Regression : Study the nature of relationship • Correlation : Can not tell which variable is cause (& effect) • Regression : One variable is dependent, another independent 11/9/2020

REGRESSION LINES • Lines cut each other at point of average of X and Y • Drawn on assumption of least square 11/9/2020

REGRESSION EQUATIONS • Regression equation of ‘Y’ on ‘X’ is expressed as: Y = a + b. X • ‘Y’ is dependent variable, ‘X’ is independent • ‘a’ is ‘Y-Intercept’, ‘b’ is slope (change in Y for unit change in X) • Values of ‘a’ and ‘b’ by method of least squares 11/9/2020

REGRESSION EQUATIONS • Least Square Method : line should be drawn through plotted points in such a manner that the sum of squares of deviations of actual ‘y’ values from computed ‘y’ values is the least • Σ(y-ye)2 should be minimum to obtain best fitting line 11/9/2020

CHARACTERISTICS OF STRAIGHT LINE (BEST FIT) • Gives the best fit of data • Σ(y-ye)2 should be minimum, deviation above the line equals those below the line • Straight line goes through overall mean of data • For data representing sample from a population, least square line is ‘best’ estimate of population regression line 11/9/2020

REGRESSION EQUATIONS µ SIMILARLY, REGRESSION EQUATION OF ‘X’ ON ‘Y’ IS EXPRESSED AS: X = a + b. Y µ ‘X’ IS DEPENDENT VARIABLE, ‘Y’ IS INDEPENDENT. µ ‘a’ IS “X-INTERCEPT”, ‘b’ IS SLOPE (CHANGE IN ‘X’ FOR UNIT CHANGE IN ‘Y’). µ FIND VALUES OF ‘a’ AND ‘b’ BY METHOD OF LEAST SQUARES. 11/9/2020

EXPRESSION FOR A LINE y 9 y = 4 +0. 3 x 8 Q 7 6 P 5 y’ x’ 4 b (Slope) = y’/x’ 3 a = intercept 2 1 0 2 4 6 8 10 12 14 16 18 X

REGRESSION ANALYSIS : LIMITATIONS • Assumption; relationship has not changed since regression equation was computed • Relationship shown by the scatter diagram may not be the same if equation is extended beyond the values used in computing the equation 11/9/2020

LINE OF BEST FIT Regression Equation is given by Where, and • The numerator of equation for b is called Sum of Products SPxy • Denominator is Sum of Squared Deviations from mean SSx. • Denominator will always be +ive and sign of slope of the line would be determined by sign of numerator.

REGRESSION EQUATION FOR POINT ESTIMATE Ø If number of hrs study is 4 hrs, what will be estimate of marks in Exam? Ø ‘Point Estimate’ of y using the regression equation. Y =a+b*x = 1. 0277 + 5. 1389 * 4 = 21. 58 { Value of ‘x’ for which you wish to find estimate of y, should lie within the range of given data ( i. e. 3 -10)}. • Reliability of Point Estimate depends on: • Sample size. • Amount of variation within the sample. • Value of ‘x’ ? • Therefore, ‘Interval Estimate’ is always better.

STD ERROR OF ESTIMATE (Measure of Goodness of Fit) (Std Error of Regression)

ASSUMPTIONS LINE 1. All actual values of y for a given value of x are normally distributed around its estimated value y (half negative and half positive). 2. Mean of each error component is zero (Mean of all y’s for a given x is equal to y estimate. 3. Variances of each error component (variances of all the y’s for various x’s) are same homoscedasticity. 4. The errors are indep of each other.

Assumptions of the Simple Linear Regression Model Y LINE assumptions of the Simple Linear Regression Model LINEAR, INDEPENDENT, NORMAL & EQUAL VAR my|x=a + x y Identical normal distributions of errors, all centered on the regression line. N(my|x, sy|x 2) x X

Pictorial Presentation of Linear Regression Model

REPRESENTING STANDARD ERROR OF ESTIMATE 1 Sy, x Dependent Variable y 3 Sy, x 0 2 Sy, x Indep Variable y=a+bx X

STANDARD ERROR OF ESTIMATE Ø Standard Error of Estimate In HRS of study example Std error of estimate would be =√ 2. 884=1. 698 marks. What does it mean ?

INTERPRETING STD ERROR OF ESTIMATE v We can expect to find 68. 26% of the points (y values) within 1 sy, x 95. 45% of the points (y values) within 2 sy, x 99. 7% of the points (y values) within 3 sy, x. of estimated y (y hat) v Larger the std error of estimate, greater the scattering of points around the scatter line. v Conversely, if sy, x = 0, estimating eqn would be a perfect estimator of the dependent variable.

INTERVAL ESTIMATION Ø Interval estimation of y for an x value (for a given Lo. S and sample size) to Ø Accuracy of this interval estimation depends on the distance of x from its mean (x bar). Ø Closer the value of x, more reliable the estimate Ø Hence, for x values other than x bar, a correction factor is used

CONFIDENCE INTERVAL FOR ESTIMATION OF MEAN Ø Confidence Interval for mean value of y (using correction factor for a given x ) is given by: - to

PREDICTION OF INTERVAL ESTIMATION OF INDL Y VALUE Ø Confidence Interval for value of y (and not the mean value of y) is given by: - to THEREFORE INTERVAL FOR Y WOULD BE BIGGER THAN INTERVAL FOR MEAN Y

Confidence Interval for the Average Value of Y 250. 0 brain IL-6 200. 0 Y 150. 0 100. 0 50. 0 Mean Y 0. 0 20 40 Actual observations upper of 95% CL for mean 60 serum IL-6 80 lower of 95% CL for mean 100

Confidence Interval for the Average Value of Y and Prediction Interval for the Individual Value of Y Y Mean Y

AN ILLUSTRATION : LRCA Qn. A study was conducted by the Air Force on the effect of sleep deprivation on air traffic controllers’ performance whilst on watch. The sample data is as follows: No of hrs w/o Sleep No of Errors 8 8 12 12 16 16 20 20 24 24 8 6 6 10 8 14 14 12 16 12 Estimate No of errors if No of hrs w/o sleep were 10 at 95% CL.

CORRELATION ANALYSIS Ø How strong is the relationship between the dependent and indep variables. Ø How are the variables correlated. Ø Statistical tool to describe the deg to which one variable is linearly related to another. Ø Measures for describing the correlation between two variables: - Coefficient of Determination, r 2 - Coefficient of Correlation, r

COEFFICIENT OF DETERMINATION Ø Measures extent or strength of association. Ø Its % of explained variation in dependent variable (y). Ø Coeff of Determination = Total Variation – Unexplained Variation Total Variation For ATC Case = SST – SSE SST = 968 – 17. 3 968 Case of No of errors and going w/o sleep in ATC r 2 = 0. 64, What does it mean? Means 64% of errors explained ie due to lack of sleep and balance could be due to poor trg etc

COEFFICIENT OF DETERMINATION Ø Measures extent or strength of association. Ø Its % of explained variation in dependent variable (y). Ø Coeff of Determination: y < • • y=y • • • y=y < • r 2 = 0, IF y = y 0 for all values of x showing no correlation. < • r 2 = 1, IF y = y for all values of x showing perfect correlation. x

CORRELATION ANALYSIS INTERPRETING r 2 ANOTHER WAY. Interpret the coeff of determination by looking at amount of the variation in y that can be explained by the regression line. TOTAL VAR (y – y) y 0 • (y – y) < y < Total variation = Explained variation + Unexplained var UNEXPLAINED VAR (y – y ) x

CORRELATION ANALYSIS the Coefficient of Correlation, r r = r 2 Ø Measures the strength of relationship ie how strongly the variables are related Ø Multiple r = 0. 8, in case of ATC (Errors & Hrs w/o sleep) means very strong relationship between the two variables Ø Sign of ‘r’ is guided by the sign of the slope (b) of the regression line Ø - ive sign indicates inverse relationship between two variables

PROPERTIES OF SAMPLE CORRELATION COEFFICIENT (r) • Ranges between -1 to +1. • Sign of r tells whether relationship is positive or negative. • Larger absolute value of r indicates stronger relationship. • r value near zero indicates ‘no or poor’ relationship between x and y. • r = + 1 or - 1 indicates perfect linear relationship. • r values of 0, 1 or -1 are rare in practice.

? 11/9/2020