Chapter 2 Graphical Methods for Describing Data Distributions

Chapter 2 Graphical Methods for Describing Data Distributions Created by Kathy Fritz

Variable • any characteristic whose value may change from one individual to another n o i t a i l i ff a l a c i Numbe t oli r of te P m o r f e c Distan Home xtbook s purch e g e l l o c o home t ased College

Data • The values for a variable from individual observations. : c n t o i e t , a n i l a i c li ff b a u l p a e c i R t Numbe Poli ocrat, r of te m e xtbook D s purch 1, 2, 3, ased: 4, . . . : e g e l l o c to e c. m t o e h , s m e o l i r f m e 2. c n 7 a 4 t 3 s i , s D e l i m 5. 53 , s e l i m 25

Suppose that a PE coach records the height of each student in his class. This is an example of a univariate data Univariate – consist of observations on a single variable made on individuals in a sample or population

Suppose that the PE coach records the height and weight of each student in his class. This is an example of a bivariate data Bivariate - data that consist of pairs of numbers from two variables for each individual in a sample or population

Suppose that the PE coach records the height, weight, number of sit-ups, and number of push-ups for each student in his class. This is an example of a multivariate data Multivariate - data that consist of observations on two or more variables

Two types of variables categorical numerical

Categorical variables • Qualitative • Consist of categorical responses 1. 2. 3. 4. 5. Car model Birth year Type of cell phone Your zip code Which club you have joined Which They areofall these categorical variables are variables! NOT categorical variables?

Numerical variables • quantitative It makes sense to perform math on these values. There operations are two types of numerical variables • observations or measurements take on discrete and continuous numerical values 1. 2. 3. 4. 5. Which of these GPAs Does it makes sense variables are Height of students to find average NOTan numerical? Codes to combination locks code to combination locks? Number of text messages per day Weight of textbooks

Two types of variables categorical numerical discrete continuous

Discrete (numerical) • Isolated points along a number line • usually counts of items • Example: number of textbooks purchased

Continuous (numerical) • Variable that can be any value in a given interval • usually measurements of something • Example: GPAs

Identify the following variables: 1. the color of cars in the teacher’s lot Categorical 2. the number of calculators owned by students at your college Discrete numerical 3. the zip code of an individual Categorical Is money a measurement or a count? 4. the amount of time it takes students to drive to school Continuous numerical 5. the appraised value of homes in your city Discrete numerical

Graphical Display Variable Type Data Type Purpose Display data distribution Use the following table to Comparative Bar Univariate for 2 or Compare 2 or more Categorical determine an appropriate Chart more groups What types Display data of graphical display a data set. Dotplot Univariate Numerical Bar Chart Univariate Categorical distribution graphs can be Compare 2 or more Numerical used with groups categorical Display data Numerical distribution data? Comparative dotplot Univariate for 2 or more groups Stem-and-leaf display Univariate Comparative stemand-leaf Univariate for 2 groups Histogram Univariate Scatterplot Bivariate Time series plot Univariate, collected Numerical over time Compare 2 or more 2. 3, groupswe will In section Display data see how the various Numerical distribution graphical displays for Investigate univariate, numerical relationship between Numerical 2 variables data compare. Numerical Investigate trend over time

Displaying Categorical Data Bar Charts Comparative Bar Charts

Bar Chart When to Use: Univariate, Categorical data To comply with new standards from the U. S. Department of This called Transportation, helmets reach thedistribution. bottom offor the A barischart isashould afrequency graphical display motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 – categorical data. Overall Results” (National Highway Traffic Safety Administration, A frequency distribution isby a observing table that August 2005) summarized data collected 1700 motorcyclists nationwide at selected roadway locations. displays the possible categories The frequency for a particularalong Each time a motorcyclist passed by, frequencies the observer noted whether with the associated or is the number times that the category rider was wearing no helmet (N), aof noncompliant helmet (NC), or relative frequencies. a compliant helmet (C). category appears in the data set. The data are summarized in this table: Helmet Use Frequency N 731 NC 153 This should equal the total number of C observations. 816 1700

Bar Chart To compile with new standards from the U. S. Department of Transportation, helmets should reach the bottom of the motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 – Overall Results” (National Highway Traffic Safety Administration, August 2005) summarized data collected by observing 1700 motorcyclists nationwide at selected roadway locations. Each time a motorcyclist passed by, the observer noted whether the rider was wearing no helmet (N), a noncompliant helmet (NC), or a compliant helmet (C). The data is summarized in this table: This should equal 1 (allowing for rounding). Helmet Use N NC C Relative Frequency 731 0. 430 153 0. 090 816 0. 480 1700 1. 000

Bar Chart How to construct 1. Draw a horizontal line; write the categories or All bars should have the same width so labels below the line at regularly spaced that both the height and the area of intervals the bar are proportional to the frequency or relative frequency of the 2. Draw a vertical line; label the scale using corresponding categories. frequency or relative frequency 3. Place a rectangular bar above each category label with a height determined by its frequency or relative frequency

Bar Chart What to Look For Frequently or infrequently occurring categories Here is the completed bar chart for the motorcycle helmet data. Describe this graph.

Comparative Bar Charts When to Use Univariate, Categorical data for Bar charts can two alsoor bemore used groups to provide a visual You use relative frequency rather comparison of two or more groups. than frequency on the vertical axis so that you can make meaningful How to construct comparisons even if the sample • Constructedsizes by using are notthe thesame. horizontal and vertical axes for the bar charts of two or more groups • Usually color-coded to indicate which bars Why? correspond to each group • Should use relative frequencies on the vertical axis

Each year the Princeton Review conducts a survey of students applying to college and of parents of college applicants. In 2009, 12, 715 high school students responded to the question “Ideally how far from home would you like the college you attend to be? ” Also, 3007 parents of students applying to college responded to the question far from home would What“how should you do first? you like the college your child attends to be? ” Data is displayed in the frequency table below. Frequency Ideal Distance Students Parents Less than 250 miles 4450 1594 250 to 500 miles 3942 902 500 to 1000 miles 2416 331 More than 1000 miles 1907 180 Create a comparative bar chart with these data.

Relative Frequency Ideal Distance Students Parents Less than 250 miles . 35 . 53 250 to 500 miles . 31 . 30 500 to 1000 miles . 19 . 11 More than 1000 miles . 15 . 06 Found by dividing the frequency by the total number students by the total Found by dividing theoffrequency number of parents What does this graph show about the ideal distance college should be from home?

Displaying Numerical Data Dotplots Stem-and-leaf Displays Histograms

Dotplot When to Use Univariate, Numerical data How to construct 1. Draw a horizontal line and mark it with an appropriate numerical scale 2. Locate each value in the data set along the scale and represent it by a dot. If there are two are more observations with the same value, stack the dots vertically

Dotplot What to Look For • A representative or typical value (center) in An theoutlier data set is an unusually large or small • The extent to which data values spread data the value. out A precise rule when an(shape) observation • The nature offor thedeciding distribution look for is an outlier is. What given we in Chapter 3. with along the numberunivariate, line numerical data • The presence of unusual (gapsfor and sets values are similar outliers) dotplots, stem-and-leaf displays, and histograms.

The first three observations are Professor Norm gave a 10 -question quiz last week plotted – note that you stack the in his introductory statistics class. The number points if values are repeated. of correct answers for each student is recorded First draw a horizontal line with an below. appropriate scale. 6 8 8 5 6 7 Write a few sentence describing this distribution. 6 5 4 7 9 4 6 6 6 5 5 9 5 This 4 6 is the 7 completed 7 3 8 dotplot. 7

Whattoto. Look. For What Therepresentativeorortypicalvalue(center)ininthe thedataset • • The Aextent symmetrical distribution is one that has a The extentto towhichthe data values spread out extent to data values spread out • • The data values spread out If we draw adistribution curve, (shape) along the number line • Thenatureline of the • Professor of (shape) alongthe number line vertical of symmetry where left half is Norm gave a 10 -question quiz last week smoothing out this dotplot, Thepresence of unusual values • • The of values mirror image right. The half. in his statistics class. number we introductory will a see that there is of the of correct ONLYanswers one peak. for each student is recorded below. Distributions with a single peak are said to be unimodal. The. Distributions center for with the two distribution of the number of peaksanswers are bimodal, and 6. There is not a lot of correct is about with more in than peaks variability thetwo observations. The distribution are multimodal. is approximately symmetrical with no unusual observations.

Comparative Dotplots When to Use Univariate, numerical data with observations from 2 or more groups How to construct • Constructed using the same numerical scale for two or more dotplots • Be sure to include group labels for the dotplots in the display What to Look For Comment on the same four attributes, but comparing the dotplots displayed.

Prof. Norm of correct answers Prof. Skew’s 8 7 8 on 9 7 7 class 3 is 8 7 larger than the center of Prof. Norm’s class. 8 7 6 6 6 5 5 9 8 There is also more variability in Prof. Skew’s distribution appears to have an unusual observation where one student had 2 answers while Writeonly a few Notice that the correct left side (or lower tail) of there were no unusual observations in Prof. sentences distribution is longer than the right side (or upper tail). Norm’s class. The distribution for Prof. Skew comparing these This distribution is. Prof. said. Norm’s to be negatively skewed (or is negatively skewed while distributions. skewed to the left). distribution is more symmetrical. Prof. Skew Createintroductory a comparative dotplot with the data Distributions where the right tail is longer In another statistics class, sets from the two statistics than the left is said to positivelyclasses, skewed Professor Skew also gave abe 10 -question quiz. The Professors’ Norm and Skew. (or skewed to the right). number answers for each student is Is of thecorrect distribution for Prof. Skew’s class recorded below. symmetric? Why or why not? The direction of skewness is always in the direction of the tail. 6 8 of the 10 distribution 8 8 for 7 the 9 longer 8 10 The center number

Stem-and-Leaf Displays When to Use Univariate, Numerical data How to construct Stem-and-leaf displays are an effective way to • Select one or more of the leading digits for summarize the stemunivariate numerical data when the set stem is notvalues too large. • List the data possible in a vertical column Each observation split observation into two to parts: • Record the leaf foriseach Be sure list beside corresponding stem value from Stemthe – consists of theevery first stem digit(s) • Indicate units for stems and leaves Leafthe - consists of the final digit(s) the smallest to the someplace in the display largest value

Stem-and-Leaf Displays What to Look For • A representative or typical value (center) in the data set • The extent to which the data values spread out • The presence of unusual values (gaps and outliers) • The extent of symmetry in the data distribution • The number and location of peaks

So the leaf will be the last The completed stem-and-leaf display is shown below. two digits. The. Let article Wireless” (AARP Bulletin, June 5. 6%“Going be represented as 05. 6% that all the However, it is somewhat difficult tosoread due to 2009) reported thedigits estimated percentage of numbers have two in. With front of thethe decimal. 05. 6%, leaf is If 5. 6 we the 2 -digit stems. households with only phone service use the 2 -digits, we wireless wouldand have stems from(no 05 behind to 20 – it will be written landline) for the 50 U. S. states andstems! the. For District of that’s way too many the stem 0. the second A common practice is to drop all but the first digit Columbia. for the first 19 Eastern states are given So let’s Data just use digit (tens) as our stems. number, 5. 7 also is written in the leaf. here. behind the stem 0 (with a 5. 6 5. 7 20. 0 16. 8 16. 5 13. 4 comma 10. 8 between). 9. 3 11. 6 8. 0 11. 4 16. 3 14. 0 10. 8 7. 8 20. 6 10. 8 5. 1 11. 6 1 What is the leaf for 20. 0% and. This where should that leafway be A What is an isappropriate thethe variable makes display 5. 6, 5 5 stem-and-leaf 95. 7, 5. 7 8 79. 3, 5 8. 0, 7. 8, 5. 1 display to summarize these ofwritten? interest? data. easier to read, but 2 0. 0, 0. 0 0. 6 0 6. 8, 6 6 36. 5, 0 13. 4, 164 0. 8, 0 01. 6, 1 1. 4, 6. 3, 4. 0, 0. 8, 1. 6 DOES NOT change the Wireless percent (A dotplot would also beoverall a reasonable choice. )of distribution the data set.

The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U. S. states and the District of Columbia. Data for the 19 Eastern states are given While it is not here. The center of the distribution necessary to write 0 555 97 88 79 5 for the estimated percentage of leaves in order 1 0 60 60 3 10 11 11 36 44 60 60 61 households with only wireless from smallest to 2 00 phone service is approximately largest, Stem: tens 11%. There doesby notdoing appearso, to Leaf: ones center. This of the be much the variability. display Write a few is more appearsdistribution to be a unimodal, sentences describing symmetric distribution with no easily seen. this distribution. outliers.

Comparative Stem-and-Leaf Displays When to Use Univariate, numerical data with observations from 2 or more group How to construct • List the leaves for one data set to the right of the stems • List the leaves for the second data set to the left of the stems • Be sure to include group labels to identify which group is on the left and which is on the right

The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated percentage of households with only wireless phone service (no landline) for the 50 U. S. states and the District of Western Eastern States Columbia. Data for. States the 13 Western states are given 998 0 555789 here. 11. 7 21. 1 8766110 18. 9 9. 0 1 00011134666 2 00 16. 7 8. 0 17. 7 25. 5 16. 3 11. 4 521 22. 1 9. 2 10. 8 Stem: tens Leaf: ones Create a of comparative stem-and The center of the distribution the estimated percentage with only wireless phone the service Write a few of households -leaf display comparing for the Western states is a little larger center sentences distributions ofthan the Eastern for the Eastern comparing these states. Both distributions are and Western states. symmetrical with approximately the same amount of distribution. variability.

Histograms When to Use Univariate numerical data Dotplots and stem-and-leaf displays are not How to construct Constructed Discrete data differently for effective ways to summarize numerical • Draw a horizontal scale and mark it with the possible discrete versus continuous data the data set contains a large values forwhen the variable data Discrete numerical data almost number of data values. • Draw a vertical scale and mark it with frequency or relative frequencyalways result from counting. In cases, each observation • Histograms Above each possible value, draw adon’t rectangle are such displays that workcentered well is a at that value with heightbut corresponding to its whole number for small dataa sets do work well for frequency or relative frequency larger numerical data sets. What to look for Center or typical value; spread; general shape and location and number of peaks; and gaps or outliers

Queen honey bees mate shortly after they become adults. During a mating flight, the queen usually takes multiple partners, collecting sperm that she will store and use throughout the rest of her life. A paper, “The Curious Promiscuity of Queen Honey Bees” (Annals of Zoology [2001]: 255 -265), provided the following data on the number of partners for 30 queen bees. 12 8 9 2 3 7 4 5 5 Here is a dotplot of these data. 6 6 4 6 7 7 7 10 4 8 1 6 7 9 7 8 11 6 10

Frequency The bars should be centered over the discrete data values and have heights Queen honey bees continued corresponding to the frequency of each data value. of partners In practice, histograms. Number for discrete data ONLY show the The variable, number of partners, is discrete. To rectangular bars. the histogram on of topqueen of the The distribution for. We thebuilt number of partners create a histogram: dotplot that the bars are centered the honey bees to is show approximately symmetric with aover center wedata already have horizontal axis – of bars are andathat heights of the at 7 discrete partners andvalues a somewhat large amount we need to frequency add a vertical axisto for the of each data value. variability. There doesn’t appear befrequency any outliers.

Here the of Whatare do two you histograms notice aboutshowing the shapes “queen bee these data set”. uses frequency two One histograms? on the vertical axis, while the other uses relative frequency

Histograms with equal width intervals When to Use Univariate numerical data How to construct Continuous data • Mark the boundaries of the class intervals on the horizontal axis • Use either frequency or relative frequency on the vertical axis • Draw a rectangle for each class interval directly above that interval. The height of each rectangle is the frequency or relative frequency of the corresponding interval What to look for Center or typical value; spread; general shape and location and number of peaks; and gaps or outliers

The top dotplot data showsonallcarry-on the dataluggage Consider the following values in eachpassengers. interval stacked in weight for 25 airline the middle of the interval. With continuous data, the rectangular cover This interval includes 10 and all upbars to 27. 8 but not 25. 0 17. 9 10. 1 27. 6 30. 0 18. 0 values 28. 7 28. 2 an 28. 0 interval of data (not 19. 9 justinclude one 28. 5 value). including 15. The next intervals will 15 and 31. 4 20. 9 33. 8 values 27. 6 21. 9 20. 8 Looking at up this is easy to see we 24. 9 26. 4 22. 0 not 34. 5 it 22. 7 25. 320, all 22. 4 values todotplot, but including andthat so on. could use intervals with a width of 5. Here. This is a is dotplot of thisnumerical data set. a continuous

From the dotplot, it is easy to see how the continuous histogram is created.

Comparative Histograms The article “Early Television Exposure and The. Subsequent biggest difference between theintwo histograms Attention Problems Children” • Must uselow two separate histograms the is at the end, with a much higher proportion of 3(Pediatrics, April 2004) investigated thewith television same horizontal axischildren. and relative year-old children falling in the 0 -2 TVfrequency hours interval viewing habits of U. S. These graphs showon than 1 -year-old viewingaxis habits of 1 -year children. old and 3 -year old thethe vertical children. 1 -yr-olds 3 -yr-olds

Histograms with unequal width intervals When to use when you have a concentration of data in the middle with some extreme values How to construct similar to histograms with continuous data, but with density on the vertical axis

When people are asked for the values such as age or weight, they sometimes shade thefrequency truth in their responses. When using relative on the vertical. The axis, article “Self-Report of Academic Performance” (Social the proportional area principle is violated. Methods and Research [November 1981]: 165 -185) focused on SAT scores and grade point average (GPA). For each Noticeinthe frequency forbetween the interval 0. 4 to student the relative sample, the difference reported GPA 2. 0 is GPA smaller the relative frequency for the and< actual was than determined. Positive differences interval -0. 1 individuals to < 0, but the area of the barthan is MUCH resulted from reporting GPAs larger the Class Relative Frequency larger. correct value. Interval -2. 0 to < -0. 4 0. 023 -0. 4 to < -0. 2 0. 055 -0. 2 to < 0. 1 0. 097 -0. 1 to < 0 0. 210 0 to < 0. 189 0. 1 to 0. 2 0. 139 0. 2 to < 0. 4 0. 116 0. 4 to 2. 0 0. 171

GPAs continued To fix this problem, we need to find the density of each interval. Class Interval Relative Frequency Width Density -2. 0 to < -0. 4 0. 023 1. 6 0. 014 -0. 4 to < -0. 2 0. 055 0. 275 -0. 2 to < 0. 1 0. 097 0. 1 0. 970 -0. 1 to < 0 0. 210 0. 1 2. 100 0 to < 0. 189 0. 1 1. 890 0. 1 to 0. 2 0. 139 0. 1 1. 390 0. 2 to < 0. 4 0. 116 0. 2 0. 580 0. 4 to 2. 0 0. 171 1. 6 0. 107 This is a correct histogram with unequal widths.

Cumulative Relative Frequency Plots When to use when you want to show the approximate proportion of data at or below any given value How to construct 1. Mark the boundaries of the class intervals on a horizontal axis 2. Add a vertical axis with a scale that goes from 0 to 1 3. For each class interval, plot the point that is represented by (upper endpoint of interval, cumulative relative frequency) 4. Add the point to represented by (lower endpoint of first interval, 0) 5. Connect consecutive points in the display with line segments

Cumulative Relative Frequency Plots What to Look For Proportion of data falling at or below any given value along the x axis The cumulative relative frequency of a given interval is the sum of the current relative frequency and all the previous relative frequencies.

Cumulative relative frequency = The National Climatic Data Center has been collecting Current relativedistribution frequency weather data for many years. A frequency relative frequency = + New Mexico, for annual rainfall totals for Albuquerque, frequency/58 Previous frequency from 1950 to 2008 are shown in the relative table below. Annual Rainfall (inches) Frequency Relative Frequency Cumulative Relative Frequency 4 to < 5 3 0. 052 5 to < 6 6 0. 103 6 to < 7 5 0. 086 7 to < 8 6 0. 103 0. 344 8 to < 9 10 0. 172 0. 516 9 to < 10 4 0. 069 0. 585 10 to < 11 12 0. 207 0. 792 11 to < 12 6 0. 103 0. 895 12 to < 13 3 0. 052 0. 947 13 to < 14 3 0. 052 0. 999 0. 052 + + 0. 155 0. 241

To create the cumulative relative frequency plot: The National Climatic Data Center has been collecting weather for many years. The frequency of the annual Plot the point: Plot the point (upper value of the interval, cumulative rainfall totals for Albuquerque, New Mexico, from 1950 to relative frequency of the interval) 2008 are shown in the table below. (smallest value of the first interval, 0) Annual Rainfall (inches) Frequency Relative Frequency Cumulative Relative Frequency 4 to < 5 3 0. 052 5 to < 6 6 0. 103 0. 155 6 to < 7 5 0. 086 0. 241 7 to < 8 6 0. 103 0. 344 8 to < 9 10 0. 172 0. 516 9 to < 10 4 0. 069 0. 585 10 to < 11 12 0. 207 0. 792 11 to < 12 6 0. 103 0. 895 12 to < 13 3 0. 052 0. 947 13 to < 14 3 0. 052 0. 999

The National Climatic Data Center has been collecting weather for many years. The annual rainfall data for Albuquerque, New Mexico, from 1950 to 2008, was used to construct the cumulative relative frequency plot below. What percent of the years had rainfall 7. 5 inches or less? About 30% Which interval has the most observations in it, 9 to < 10 or 10 to < 11? Why? 10 to < 11, because it has a steeper slope

Displaying Bivariate Numerical Data Scatterplots Time Series Plots

Scatterplots When to Use Bivariate Numerical data How to construct 1. Draw horizontal and vertical axes. Label the horizontal axis and include an appropriate scale for the x-variable. Label the vertical axis and include an appropriate scale for the y-variable. 2. For each (x, y) pair in the data set, add a dot in the appropriate location in the display. What to look for Relationship between x and y

The accompanying table gives the cost (in dollars) and an overall quality rating for 10 different brands of men’s athletic shoes (www. consumerreports. org). Cost 65 45 45 80 110 30 80 110 70 Rating 71 70 62 59 58 57 56 52 51 51 Is there a relationship between x = cost and y = quality rating? A scatterplot can help answer this question

65 45 45 80 110 30 80 110 70 Rating 71 70 62 59 58 57 56 52 51 51 Rating Cost Is there a relationship between x = cost and label y. First, = plot quality Next, each (x, y) pair. Here isdraw therating? completed appropriate horizontal scatterplot. and vertical axes. There appears to be a negative relationship between cost of athletic shoes and their quality rating – does that surprise you?

Time Series Plots When to Use Bivariate data with time and another variable How to construct 1. Draw horizontal and vertical axes. Label the horizontal axis and include an appropriate scale for the x-variable. Label the vertical axis and include an appropriate scale for the y-variable. 2. For each (x, y) pair in the data set, add a dot in the appropriate location in the display. 3. Connect each dot in order What to look for trends or patterns over time

The Christmas Price Index is computed each year by PNC Advisors. It is a humorous look at the cost of giving all the gifts described in the popular Christmas song “The Twelve Days of Christmas” (www. pncchristmaspriceindex. com). Describe any trends or patterns that you see. Why is there a downward trend between 1993 & 1995?

Graphical Displays in the Media Pie Charts Segmented Bar Charts

Pie (Circle) Chart When to Use Categorical data How to construct • A circle is used to represent the whole data set. • “Slices” of the pie represent the categories • The size of a particular category’s slice is proportional to its frequency or relative frequency. • Most effective for summarizing data sets when there are not too many categories

Pie (Circle) Chart The article “Fred Flintstone, Check Your Policy” (The Washington Post, October 2, 2005) summarized a survey of 1014 adults conducted by the Life and Health Insurance Foundation for Education. Each person surveyed was asked to select which of five fictional characters had the greatest need for life insurance: Spider-Man, Batman, Fred Flintstone, Harry Potter, and Marge Simpson. The data are summarized in the pie chart. The survey results were quite different from the assessment of an insurance expert. The insurance expert felt that Batman, a wealthy bachelor, and Spider-Man did not need life insurance as much as Fred Flintstone, a married man with dependents!

Segmented (or be Stacked) Bar Chartsby A pie chart can difficult to construct hand. The circular Categorical shape sometimes When to Use data makes if difficult to compare areas for different categories, particularly when the relative How to construct frequencies are similar. • Use a rectangular bar rather than a circle to we represent the entire data bar set. chart. So, could use a segmented • The bar is divided into segments, with different segments representing different categories. • The area of the segment is proportional to the relative frequency for the particular category.

Segmented (or Stacked) Bar Charts Each year, the Higher Education Research Institute conducts a survey of college seniors. In 2008, approximately 23, 000 seniors participated in the survey (“Findings from the 2008 Administration of the College Senior Survey, ” Higher Education Research Institute, June 2009). This segmented bar chart summarizes student responses to the question: “During the past year, how much time did you spend studying and doing homework in a typical week? ”

Common Mistakes

Avoid these Common Mistakes 1. Areas should be proportional to frequency, relative frequency, or magnitude of the number being represented. The eye is naturally to By replacing the bars drawn of a bar large areas graphical displays. chart withinmilk buckets, Sometimes, an effort to make areas areindistorted. the graphical displays more interesting, designers lose sight The two buckets for 1980 of this important represent 32 cows, principle. whereas Consider graph the one this bucket for(USA 1970 Today, October 3, 2002). represents 19 cows.

Avoid these Common Mistakes 1. Areas should be proportional to frequency, relative frequency, or magnitude of the number being represented. Another common distortion occurs when a third dimension is added to bar charts or pie charts. This distorts the areas and makes it much more difficult to interpret.

Avoid these Common Mistakes 2. Be cautious of graphs with broken axes (axes that don’t start at 0). • The use of broken axes in a scatterplot does not result in a misleading picture of the relationship of bivariate data. • In time series plots, broken axes can sometimes exaggerate the magnitude of change over time. • In bar charts and histograms, the vertical axis should NEVER be broken. This violates the “proportional area” principle.

Avoid these Common Mistakes 2. Be cautious of graphs with broken axes (axes that don’t start at 0). This bar chart is similar to one in an advertisement for a software product designed to raise student test scores. Areas of the bars are not proportional to the magnitude of the numbers represented – the area for the rectangle 68 is more than three times the area of the rectangle representing 55!

Avoid these Common Mistakes 3. Notice Watchthat outthe forintervals unequalbetween time spacing in time observations are irregular, yet the points in the plot are equally spaced series plots. along the time axis. This makes it difficult to assess the rate ofischange overtime. series plot. Here a correct If observations over time are not made at regular time intervals, special care must be taken in constructing the time series plot.

Avoid these Common Mistakes 4. Be careful how you interpret patterns in Does an increase in the number of Methodist scatterplots. ministers CAUSE the increase in imported rum? Consider the following scatterplot showing the relationship between the number of Methodist ministers in New England the amount of Cuban rum imported into Boston from 1860 to 1940 (Education. com). r =. 999973 A strong pattern in a scatterplot means that the two variables tend to vary together in a predictable way, BUT it does not mean that there is a cause-and-effect relationship.

Avoid these Common Mistakes 5. Make sure that a graphical display creates the right first impression. Consider the following graph from USA Today (June 25, 2001). Although this graph does not violate the proportional area principle, the way the “bar” for the none category is displayed makes this graph difficult to read. A quick glance at this graph may leave the reader with an incorrect impression.