Class 7 Thurs Sep 30 Outliers and Influential

  • Slides: 24
Download presentation
Class 7: Thurs. , Sep. 30

Class 7: Thurs. , Sep. 30

Outliers and Influential Observations Outlier: Any really unusual observation. • • Outlier in the

Outliers and Influential Observations Outlier: Any really unusual observation. • • Outlier in the X direction (called high leverage point): Has the potential to influence the regression line. • Outlier in the direction of the scatterplot: An observation that deviates from the overall pattern of relationship between Y and X. Typically has a residual that is large in absolute value. • Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.

Housing Prices and Crime Rates • A community in the Philadelphia area is interested

Housing Prices and Crime Rates • A community in the Philadelphia area is interested in how crime rates are associated with property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenues from higher property values. • The town council looked at a recent issue of Philadelphia Magazine (April 1996) and found data for itself and 109 other communities in Pennsylvania near Philadelphia. Data is in philacrimerate. JMP. House price = Average house price for sales during most recent year, Crime Rate=Rate of crimes per 1000

Which points are influential? Center City Philadelphia is influential; Gladwyne is not. In general,

Which points are influential? Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.

Excluding Observations from Analysis in JMP • To exclude an observation from the regression

Excluding Observations from Analysis in JMP • To exclude an observation from the regression analysis in JMP, go to the row of the observation, click Rows and then click Exclude/Unexclude. A red circle with a diagonal line through it should appear next to the observation. • To put the observation back into the analysis, go to the row of the observation, click Rows and then click Exclude/Unexclude. The red circle should no longer appear next to the observation.

Formal measures of leverage and influence • Leverage: “Hat values” (JMP calls them hats)

Formal measures of leverage and influence • Leverage: “Hat values” (JMP calls them hats) • Influence: Cook’s Distance (JMP calls them Cook’s D Influence). • To obtain them in JMP, click Analyze, Fit Model, put Y variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances. • To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0. 06). No other observations have high influence or high leverage.

Rules of Thumb for High Leverage and High Influence • High Leverage Any observation

Rules of Thumb for High Leverage and High Influence • High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where # of coefficients in regression model = 2 for simple linear regression. n=number of observations. • High Influence: Any observation with a Cook’s Distance greater than 1 indicates a high influence.

What to Do About Suspected Influential Observations? See flowchart handout. Does removing the observation

What to Do About Suspected Influential Observations? See flowchart handout. Does removing the observation change the substantive conclusions? • If not, can say something like “Observation x has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change. ”

 • If removing the observation does change substantive conclusions, is there any reason

• If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation? – If yes, omit the observation and proceed. – If no, does the observation have high leverage (outlier in explanatory variable). • If yes, omit the observation and proceed. Report that conclusions only apply to a limited range of the explanatory variable. • If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

General Principles for Dealing with Influential Observations • General principle: Delete observations from the

General Principles for Dealing with Influential Observations • General principle: Delete observations from the analysis sparingly – only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

The Question of Causation • The community that ran this regression would like to

The Question of Causation • The community that ran this regression would like to increase property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenue from higher property values. • The regression without Center City Philadelphia is Linear Fit House. Price = 225233. 55 - 2288. 6894 Crime. Rate • The community concludes that if it can cut its crime rate from 30 down to 20 incidents per 1000 population, it will increase its average house price by $2288. 6894*10=$22, 887. • Is the community’s conclusion justified?

Potential Outcomes Model • Let Yi 30 denote what the house price for community

Potential Outcomes Model • Let Yi 30 denote what the house price for community i would be if its crime rate was 30 and Yi 20 denote what the house price for community i would be if its crime rate was 20. • X (crime rate) causes a change in Y (house price) for community i if. A decrease in crime rate causes an increase in house price for community i if

Association is Not Causation • • 1. 2. 3. A regression model tells us

Association is Not Causation • • 1. 2. 3. A regression model tells us about how the mean of Y|X is associated with changes in X. A regression model does not tell us what would happen if we actually changed X. Possible Explanations for an Observed Association Between Y and X Y causes X X causes Y There is a confounding variable Z that is associated with changes in both X and Y. Any combination of the three explanations may apply to an observed association.

X Causes Y Perhaps it is changes in house price that cause changes in

X Causes Y Perhaps it is changes in house price that cause changes in crime rate. When house prices increase, the residents of a community have more to lose by engaging in criminal actives; this is called the economic theory of crime.

Confounding Variables • Confounding variable for the causal relationship between X and Y: A

Confounding Variables • Confounding variable for the causal relationship between X and Y: A variable Z that is associated with both X and Y. • Example of confounding variable in Philadelphia crime rate data: Level of education may be associated with both house prices and crime rate. • The effect of crime rate on house price is confounded with the effect of education on house price. If we just look at data on house price and crime rate, we can’t distinguish between the effect of crime rate on house price and the effect of education on house price.

Note on Confounding Variables and Lurking Variables • The book’s distinction between lurking variable

Note on Confounding Variables and Lurking Variables • The book’s distinction between lurking variable and confounding variable is confusing and the term “lurking variable” is not standard in statistics, whereas “confounding variable” is. So I will just use the term confounding variable in the rest of the course.

Examples of Confounding Variables • Many studies have found that people who are active

Examples of Confounding Variables • Many studies have found that people who are active in their religion live longer than nonreligious people. Potential confounding variables?

Weekly Wages (Y) and Education (X) in March 1988 CPS Will getting an extra

Weekly Wages (Y) and Education (X) in March 1988 CPS Will getting an extra year of education cause an increase of $50. 41 on average in your weekly wage? What are some potential confounding variables?

Math enrollment data: The residual plot vs. time indicates that there is a confounding

Math enrollment data: The residual plot vs. time indicates that there is a confounding variable associated with time. It turns out that one of the schools (say the engineering school) in the university changed its program to require that entering students take another mathematics course. The variable of whether the engineering school requires its students to take another mathematics course is a confounding variable.

Establishing Causation • Best method is an experiment, but many times that is not

Establishing Causation • Best method is an experiment, but many times that is not ethically or practically possible (e. g. , smoking and cancer, education and earnings).

 • Main strategy for learning about causation when we can’t do an experiment:

• Main strategy for learning about causation when we can’t do an experiment: Consider all confounding variables you can think of. Try to take them into account (we’ll see how to do this when we study multiple regression in Chapter 11) and see if association between Y and X remains once the known confounding variables have been accounted for.

Other Criteria for Establishing Causation When We Can’t Do An Experiment 1. The association

Other Criteria for Establishing Causation When We Can’t Do An Experiment 1. The association is strong. 2. The association is consistent. 3. Higher doses are associated with stronger responses. 4. The alleged cause precedes the effect in time. 5. The alleged cause is plausible.