4.5 Measures of dispersion
4.5.3 Calculating the variance and standard deviation

Text begins

Unlike range and interquartile range, variance is a measure of dispersion that takes into account the spread of all data points in a data set. It’s the measure of dispersion the most often used, along with the standard deviation, which is simply the square root of the variance. The variance is mean squared difference between each data point and the centre of the distribution measured by the mean.

Example 1 – Calculation of variance and standard deviation

Let’s calculate the variance of the follow data set: 2, 7, 3, 12, 9.

The first step is to calculate the mean. The sum is 33 and there are 5 data points. Therefore, the mean is 33 ÷ 5 = 6.6. Then you take each value in data set, subtract the mean and square the difference. For instance, for the first value:

(2 - 6.6)2  = 21.16

The squared differences for all values are added:

21.16 + 0.16 + 12.96 + 29.16 + 5.76 = 69.20

The sum is then divided by the number of data points:

69.20 ÷5 = 13.84

The variance is 13.84. To get the standard deviation, you calculate the square root of the variance, which is 3.72.

Standard deviation is useful when comparing the spread of two separate data sets that have approximately the same mean. The data set with the smaller standard deviation has a narrower spread of measurements around the mean and therefore usually has comparatively fewer high or low values. An item selected at random from a data set whose standard deviation is low has a better chance of being close to the mean than an item from a data set whose standard deviation is higher. However, standard deviation is affected by extreme values. A single extreme value can have a big impact on the standard deviation.

Standard deviation might be difficult to interpret in terms of how large it has to be when considering the data to be widely dispersed. The magnitude of the mean value of the dataset affects the interpretation of its standard deviation. When you are measuring something that is in the scale of millions, having measures that are close to the mean value doesn’t have the same meaning as when you are measuring something that is in the scale of hundreds. For example, a measure of two large companies with a difference of $10,000 in annual revenues is considered pretty close, while the measure of two individuals with a weight difference of 30 kilograms is considered far apart. This is why, in most situations, it is helpful to assess the size of the standard deviation relative to its mean.

Remember the following properties when you are using the standard deviation:

  • Standard deviation is sensitive to extreme values. A single very extreme value can increase the standard deviation and misrepresent the dispersion.
  • For two data sets with the same mean, the one with the larger standard deviation is the one in which the data is more spread out from the center.
  • Standard deviation is equal to 0 if all values are equal (because all values are then equal to the mean).

The reason why standard deviation is so popular as a measure of dispersion is its relation with the normal distribution which describes many natural phenomena and whose mathematical properties are interesting in the case of large data sets. When a variable follows a normal distribution, the histogram is bell-shaped and symmetric, and the best measures of central tendency and dispersion are the mean and the standard deviation. It’s a very useful probability distribution and relatively easy to use. Confidence intervals are often based on the standard normal distribution.

However, when:

  • the data set is small,
  • the distribution is asymmetric, or
  • the data set includes extreme values

it’s better to use the interquartile range.

Date modified: