 # Constructing box and whisker plots

## Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set. Box and whisker plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared. (See the section on five-number summaries for more information.)

Box and whisker plots are ideal for comparing distributions because the centre, spread and overall range are immediately apparent. A box and whisker plot is a way of summarizing a set of data measured on an interval scale. It is often used in explanatory data analysis. This type of graph is used to show the shape of the distribution, its central value, and its variability.

In a box and whisker plot:

• the ends of the box are the upper and lower quartiles, so the box spans the interquartile range
• the median is marked by a vertical line inside the box
• the whiskers are the two lines outside the box that extend to the highest and lowest observations. ## Example 1 – Box and whisker plots

Like Angela, Carl works at a computer store. He also recorded the number of sales he made each month. In the past 12 months, he sold the following numbers of computers:

51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.

1. Give a five-number summary of Carl's and Angela's sales.
2. Make two box and whisker plots, one for Angela's sales and one for Carl's.
3. Briefly describe the comparisons between their sales.

1. First, put the data in ascending order. Then find the median.

6, 7, 13, 17, 20, 25, 39, 41, 43, 49, 51, 62.
Median = (12th + 1st) ÷ 2 = 6.5th value
= (sixth + seventh observations) ÷ 2
= (25 + 39) ÷ 2
= 32

There are six numbers below the median, namely: 6, 7, 13, 17, 20, 25.
Q1 = the median of these six items
= (6 + 1 ) ÷ 2= 3.5th value
= (third + fourth observations) ÷ 2
= (13 + 17) ÷ 2
= 15

Here are six numbers above the median, namely: 39, 41, 43, 49, 51, 62.
Q3 = the median of these six items
= (6 + 1) ÷ 2= 3.5th value
= (third + fourth observations) ÷ 2
= 46

The five-number summary for Carl's sales is 6, 15, 32, 46, 62.

Using the same calculations, we can determine that the five-number summary for Angela is 1, 17, 26, 42, 57.
2. Please note that box and whisker plots can be drawn either vertically or horizontally. 3. Carl's highest and lowest sales are both higher than Angela's corresponding sales, and Carl's median sales figure is higher than Angela's. Also, Carl's interquartile range is larger than Angela's.

These results suggest that Carl consistently sells more computers than Angela does.

## Summary

There are several ways to describe the centre and spread of a distribution. One way to present this information is with a five-number summary. It uses the median as its centre value and gives a brief picture of the other important distribution values. Another measure of spread uses the mean and standard deviation to decipher the spread of data. This technique, however, is best used with symmetrical distributions with no outliers.

Despite this restriction, the mean and standard deviation measures are used more commonly than the five-number summary. The reason for this is that many natural phenomena can be approximately described by a normal distribution. And for normal distributions, the mean and standard deviation are the best measures of centre and spread respectively.

Standard deviation takes every value into account, has extremely useful properties when used with a normal distribution, and is mathematically manageable. But the standard deviation is not a good measure of spread in highly skewed distributions and, in these instances, should be supplemented by other measures such as the semi-quartile range.

The semi-quartile range is rarely used as a measure of spread, partly because it is not as manageable as others. Still, it is a useful statistic because it is less influenced by extreme values than the standard deviation, is less subject to sampling fluctuations in highly skewed distributions and  is limited to only two values Q1 and Q3. However, it cannot stand alone as a measure of spread.