Chapter 7 Scatterplots and Correlation Scatterplots graphical display

  • Slides: 22
Download presentation
Chapter 7 Scatterplots and Correlation Scatterplots: graphical display of bivariate data Correlation: a numerical

Chapter 7 Scatterplots and Correlation Scatterplots: graphical display of bivariate data Correlation: a numerical summary of bivariate data

Objectives Chapter 7 Scatterplots p Explanatory and response variables p Interpreting scatterplots p Outliers

Objectives Chapter 7 Scatterplots p Explanatory and response variables p Interpreting scatterplots p Outliers p Categorical variables in scatterplots

Chapter 7 Basic Terminology p Univariate data: 1 variable is measured on each sample

Chapter 7 Basic Terminology p Univariate data: 1 variable is measured on each sample unit or population unit e. g. height of each student in a sample p Bivariate data: 2 variables are measured on each sample unit or population unit e. g. height and GPA of each student in a sample; (caution: data from 2 separate univariate samples is not bivariate data)

Basic Terminology (cont. ) p Multivariate data: several variables are measured on each unit

Basic Terminology (cont. ) p Multivariate data: several variables are measured on each unit in a sample or population. q For each student in a sample of NCSU students, measure height, GPA, and distance between NCSU and hometown; p Focus on bivariate data in chapter 7

Same goals with bivariate data that we had with univariate data p Graphical displays

Same goals with bivariate data that we had with univariate data p Graphical displays and numerical summaries p Seek overall patterns and deviations from those patterns p Descriptive measures of specific aspects of the data

Here, we have two quantitative variables for each of 16 students. Student Beers Blood

Here, we have two quantitative variables for each of 16 students. Student Beers Blood Alcohol 1 5 0. 1 2 2 0. 03 1) How many beers they drank, and 3 9 0. 19 4 7 0. 095 5 3 0. 07 2) Their blood alcohol level (BAC) 6 3 0. 02 7 4 0. 07 8 5 0. 085 9 8 0. 12 10 3 0. 04 11 5 0. 06 12 5 0. 05 13 6 0. 1 14 7 0. 09 15 1 0. 01 16 4 0. 05 We are interested in the relationship between the two variables: How is one affected by changes in the other one?

Scatterplots p Useful method to graphically describe the relationship between 2 quantitative variables

Scatterplots p Useful method to graphically describe the relationship between 2 quantitative variables

Scatterplot: Blood Alcohol Content vs Number of Beers In a scatterplot, one axis is

Scatterplot: Blood Alcohol Content vs Number of Beers In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph. Student Beers BAC 1 5 0. 1 2 2 0. 03 3 9 0. 19 4 7 0. 095 5 3 0. 07 6 3 0. 02 7 4 0. 07 8 5 0. 085 9 8 0. 12 10 3 0. 04 11 5 0. 06 12 5 0. 05 13 6 0. 1 14 7 0. 09 15 1 0. 01 16 4 0. 05

Focus on Three Features of a Scatterplot Look for an overall pattern regarding …

Focus on Three Features of a Scatterplot Look for an overall pattern regarding … 1. Shape - ? Approximately linear, curved, up-and-down? 2. Direction - ? Positive, negative, none? 3. Strength - ? Are the points tightly clustered in the particular shape, or are they spread out? Blood Alcohol as a function of Number of Beers … and deviations from the overall pattern: Outliers Blood Alcohol Level (mg/ml) 0, 20 0, 18 0, 16 0, 14 0, 12 0, 10 0, 08 0, 06 0, 04 0, 02 0, 00 0 1 2 3 4 5 6 Number of Beers 7 8 9 10

Scatterplot: Fuel Consumption vs Car Weight. x=car weight, y=fuel p (xi, yi): (3. 4,

Scatterplot: Fuel Consumption vs Car Weight. x=car weight, y=fuel p (xi, yi): (3. 4, 5. 5) (3. 8, 5. 9) (4. 1, 6. 5) (2. 2, 3. 3) cons. (2. 6, 3. 6) (2. 9, 4. 6) (2, 2. 9) (2. 7, 3. 6) (1. 9, 3. 1) (3. 4, 4. 9) FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT 7 6, 5 6 5, 5 5 4, 5 4 3, 5 3 2, 5 2 1, 5 2, 5 WEIGHT (1000 lbs) 3, 5 4, 5

Explanatory and response variables response variable the variable of interest. explanatory variable explains changes

Explanatory and response variables response variable the variable of interest. explanatory variable explains changes in the response variable. Typically, the explanatory (or independent variable) is plotted on the x axis, and the response (or dependent variable) is plotted on the y axis. Blood Alcohol as a function of Number of Beers Blood Alcohol Level (mg/ml) 0, 20 Response (dependent) variable: blood alcohol content 0, 18 0, 16 0, 14 0, 12 0, 10 0, 08 0, 06 0, 04 0, 02 y 0, 00 0 x 1 2 3 4 5 6 7 8 9 10 Number of Beers Explanatory (independent) variable: number of beers

SAT Score vs Proportion of Seniors Taking SAT Average SAT Score SAT Total 1350

SAT Score vs Proportion of Seniors Taking SAT Average SAT Score SAT Total 1350 IW IL 1250 NC 74% 1010 1150 1050 DC 950 0% 20% 40% 60% Percent of Seniors Taking SAT 80% 100%

Correlation: a numerical summary of bivariate data when both variables are quantitative. Correlation p

Correlation: a numerical summary of bivariate data when both variables are quantitative. Correlation p The correlation coefficient “r” p r does not distinguish x and y p r has no units p r ranges from -1 to +1 p Influential points

The correlation coefficient "r" The correlation coefficient is a measure of the direction and

The correlation coefficient "r" The correlation coefficient is a measure of the direction and strength of the linear relationship between 2 quantitative variables. It is calculated using the mean and the standard deviation of both the x and y variables. Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations.

Correlation: Fuel Consumption vs Car Weight FUEL CONSUMPTION vs CAR WEIGHT FUEL CONSUMP. (gal/100

Correlation: Fuel Consumption vs Car Weight FUEL CONSUMPTION vs CAR WEIGHT FUEL CONSUMP. (gal/100 miles) r =. 9766 7 6, 5 6 5, 5 5 4, 5 4 3, 5 3 2, 5 2 1, 5 2, 5 WEIGHT (1000 lbs) 3, 5 4, 5

Example: calculating correlation p p (x 1, y 1), (x 2, y 2), (x

Example: calculating correlation p p (x 1, y 1), (x 2, y 2), (x 3, y 3) (1, 3) (1. 5, 6) (2. 5, 8)

Properties of Correlation p p p r is a measure of the strength of

Properties of Correlation p p p r is a measure of the strength of the linear relationship between x and y. No units [like demand elasticity in economics (-infinity, 0)] -1 < r < 1

Properties (cont. ) r ranges from -1 to+1 "r" quantifies the strength and direction

Properties (cont. ) r ranges from -1 to+1 "r" quantifies the strength and direction of a linear relationship between 2 quantitative variables. Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y.

Properties of Correlation (cont. ) p r = -1 only if y = a

Properties of Correlation (cont. ) p r = -1 only if y = a + bx with slope b<0 p r = +1 only if y = a + bx with slope b>0 y = 1 + 2 x r=1 r = -1 10 9 8 7 6 y 5 4 3 2 1 0 20 y = 11 - x Y 15 10 5 0 0 2 4 6 x 8 10 0 2 4 6 X 8 10

Properties (cont. ) High correlation does not imply cause and effect CARROTS: Hidden terror

Properties (cont. ) High correlation does not imply cause and effect CARROTS: Hidden terror in the produce department at your neighborhood grocery p p p Everyone who ate carrots in 1920, if they are still alive, has severely wrinkled skin!!! Everyone who ate carrots in 1865 is now dead!!! 45 of 50 17 yr olds arrested in Raleigh for juvenile delinquency had eaten carrots in the 2 weeks prior to their arrest !!!

Properties (cont. ) Cause and Effect p There is a strong positive correlation between

Properties (cont. ) Cause and Effect p There is a strong positive correlation between the monetary damage caused by structural fires and the number of firemen present at the fire. (More firemen-more damage) p Improper training? Will no firemen present result in the least amount of damage?

Properties (cont. ) Cause and Effect (1, 2) (24, 75) (1, 0) (18, 59)

Properties (cont. ) Cause and Effect (1, 2) (24, 75) (1, 0) (18, 59) (9, 9) (3, 7) (5, 35) (20, 46) (1, 0) (3, 2) (22, 57) x = fouls committed by player; y = points scored by same player p r measures the strength of the linear relationship between x and y; it does not indicate cause and effect The correlation is due to a third “lurking” variable – playing time (x, y) = (fouls, points) correlation r =. 935 Points p 80 70 60 50 40 30 20 10 0 0 5 10 15 Fouls 20 25 30