Statistics Canada
Symbol of the Government of Canada

Estimation

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Self-weighting designs
Adjusting the weights
Other estimation methods
Estimating the sampling error
Examples of estimation using an simple random sampling design
Estimation of the population mean
Estimation of the population total

As we now know, the goal of conducting surveys is to obtain information about a particular population. When the sample has been selected and the information collected (see the Data collection chapter) and processed (see the Data processing chapter), there still remains the task of linking the information gathered from the sample back to the overall population.

Estimation is the process of determining a likely value for a variable in the survey population, based on information collected from the sample. Researchers are usually interested in looking at estimates of many statistics—totals, averages and proportions being the most frequent—for different variables. For example, a sample survey could be used to produce any of the following statistics: estimates for the proportion of smokers among all people aged 15 to 24 in the population; the average earnings of men and women with a university degree; or the total number of cars possessed by the whole survey population.

Underpinning the estimation process is the sampling weight of a unit, which indicates the number of units in the population (including the sampling weight) that are represented by this sampled unit. The sampling weight is the inverse of the unit's probability of selection.

  • Example 1: Suppose that the city of Winnipeg has decided to award bus travellers with free one-year bus passes as a way of promoting its services. A simple random sample of 10 people is selected from the 30 passengers on a city bus. Since simple random sampling gives equal probability of selection to every member of the population (in this case, all passengers on the bus), each passenger had one chance out of three of being selected. This translates into a sampling weight of three for every selected unit. This means that each person in the sample represents three persons in the population—himself or herself, plus two other persons

    To estimate this sampling weight, one could take the survey information for the 10 selected passengers and copy it three times to create an artificial population of 30. Totals, averages or proportions for the real population could then be estimated by the corresponding statistics computed using the artificial population. Instead of doing this, however, survey statisticians attach a sampling weight to each unit in the sample and take this weight into account when estimating.

    If one person in a sample (with a sampling weight of 18) had blue eyes and brown hair, then it is as if a total of 18 people in the population had blue eyes and brown hair.

    Example 2: You are conducting a survey to determine the total number of people living on your street and the average number of cars owned by each household. You decide to select a systematic sample of 5 households from the 20 on your street and intend to use that sample to estimate the totals you are looking for. The following table summarizes the information that you gathered during your interviews with the sampled households:
Table 1. Sample of households on Redwood Street
Household number Number of persons Number of cars Probability of selection Sampling weight
1 1 0 1/4 4
2 4 2 1/4 4
3 2 1 1/4 4
4 2 1 1/4 4
5 3 2 1/4 4
  • The selection probability of 1 in 4 comes from the fact that systematic sampling gives an equal chance of being selected to each household on your street. The sampling weight of 4 is just the inverse of that probability. When estimating, you have to look at the characteristics of each sampled household. In this case, it is decided that 4 households from the population of 20 on your street have the same characteristics.

    In order to estimate the total number of persons living on your street, you have to multiply the number of persons in a household by the number of households in that sampling weight, then add up all the final numbers. For example, there are 4 one-person households (represented by Household number 1), 4 four-person households, 8 two-person households (four households represented by Household number 3 and four households represented by Household number 4) and 4 three-person households. The estimation of the total number of persons would then be:

    Estimated number of persons living on your street
    = (4 x 1) + (4 x 4) + (8 x 2) + (4 x 3)
    = 48 people

    To estimate the average number of cars per household, you proceed in the same manner. Get an estimate of the total number of cars owned by households on your street and then, divide the estimate by the actual number of households on the street. For example, there are 4 households without a car (represented by Household number 1), 8 households with two cars (represented by Household number 2 and Household number 5), 8 households with one car each (represented by Household number 3 and Household number 4).

    Estimated number of cars
    = (4 x 0) + (8 x 2) + (8 x 1)
    = 24 cars

    Estimated average
    = 24 ÷ 20
    = 1.2 cars per household

Self-weighting designs

It is not always the case that all sampled units had the same sampling weight. Some designs give unequal probability of selection to units, resulting in units within the same sample having different sampling weights. Answers from one household or business could represent the answers for 200 units of the population, while the answers from another could represent only 50 units in the population.

When every unit in the sample has the same sampling weight, the sampling design is said to be self-weighted. This kind of design is time-saving and operationally convenient, particularly for large samples. Because every unit has the same weight, those weights can be ignored when estimating averages and proportions. The average for the sample gives an appropriate estimate of the average for the whole population.

Simple random sampling and systematic sampling are examples of self-weighted designs. In that sense, calculations could have been made easier in Example 2. For instance, to estimate the average number of cars per household in the population, we could have used the same average as the one used in the sample. The 5 sampled households own a total of 6 cars, an average of 1.2 cars per household. This is the same result as that obtained using the sampling weight procedure.

Adjusting the weights

Sometimes, the sampling weights are adjusted prior to estimation. There are basically two reasons for weight adjustment:

  • Adjusting for non-response: Using sampling weights for estimation works fine when you have been able to interview all selected units. In Example 2, if two of the five sampled households refused to answer or were unavailable at the time of the survey, you would only have answers for three households, thus representing only 12 of the 20 households on the street. The two non-responding units represent four households each. This means that we have no information on the number of persons or cars for 8 households on your street. In order to adjust for that, survey statisticians usually increase the weights of responding units to account for the loss of representativeness caused by non-response. The goal would be to use only the 3 units for which we have information, but still represent the 20 households on the street.
  • Adjusting for external information: Sometimes, we know the actual total for one or more variables measured in the sample. In Example 3 of the Probability sampling section, a population of the 1,000 best horror movies was equally divided into 500 classic movies and 500 modern movies. Even though you knew this prior to sampling, you decided to select a simple random sample of 100 movies and ended up with 77 classic movies and 23 modern movies. Each of these movies has a weight of 10 (because you selected 1 movie out of every 10 titles). Using the answers from the survey and the sampling weight, your sample would represent a population of 770 classic movies and 230 modern movies. This could lead to inaccurate estimates. One solution would be to decrease the weight of every sampled classic and increase the weight of every sampled modern movie so that your sample gives an estimate of 500 classics and 500 modern films in the population. This should reduce the distortion caused by a 'bad' sample.

    Of course, stratifying by release date prior to sampling would have solved this problem. However, in a lot of cases, we have totals at the population level, but we don't know the attribute of each unit on the sampling frame. For example, from the Census of Population, we know how many men and women there are in a specific city, but all we have for sampling is a list of households. Thus, stratifying our population by sex would not be possible. Demographic projections by age and sex for each province are often used in social surveys to adjust sampling weights.

The weights adjusted for non-response and/or external counts are used for estimation, in the same way as the sampling weight was used in Example 1.

Other estimation methods

Using the weights to inflate the sample results is not the only estimation method that exists, but it is the simplest one and the only one that we will cover. Nevertheless, it is important to know that there exist some other methods that could lead to more precise estimates (e.g., using auxiliary information). The estimation process has to take into account the sampling design that was used. Otherwise, the resulting estimates could be severely biased.

Estimating the sampling error

As mentioned before, any estimates derived from samples are subject to what is called the sampling error. This comes from the fact that only a part of the population was observed, instead of the whole. A different sample could have come up with different results. The amount of variation that exists among the estimates from the different possible samples is what makes the sampling error. (There are roughly 14 million different combinations of 6 numbers from 1 to 49, so imagine how many ways there are to select a sample of 25,000 Canadian households!) Of course, this sampling error is unknown, since we would need to know the answer for each unit of the population in order to calculate it. Nevertheless, it can be estimated by using the survey data. The extent of the sampling error depends on many things, including the sampling method, the estimation method, the sample size and the variability of the estimated characteristic. This is why each sample estimate has its own sampling error. This error should thus be approximated for each estimated total, average, proportion, etc. produced by the survey.

Examples of estimation using an simple random sampling design

Simple random sampling is the simplest of all sampling methods. Estimation using the simple random sampling method has been studied extensively. There are simple formulas to estimate the sampling error for many statistics when simple random sampling is used, especially since it is a self-weighting design. We present here the most common estimator for a population average (mean) and total, under simple random sampling.

Estimation of the population mean

In a simple random sample, the estimate of the population mean is identical to the mean of the sample:


equation for the estimation of the population mean

where
 x = an observed value
 mathematical symbol for the estimate of the population mean = estimate of the population mean
mathematical symbol sum
x
= sum of all observed x values in the sample
n = number of observations in the sample.

Note: Lowercase x and n should be used if you are referring to a sample survey and upper case X and N should be used when referring to a population.

If the sample results have been summarized in a frequency table, then the estimate for the population mean is the same as the sample. Thus,

equation for the estimation of the population mean using a frequency table

where

 x = an observed value
 f = the frequency of the value (the number of times that this value have been observed in the sample)
 mathematical symbol for the estimate of the population mean = estimate of the population mean
mathematical symbol sum
xf
= sum of all observed xf values (the product of the observed values times its frequency) in the sample
mathematical symbol sum
f
= sum of the frequencies in the sample.

Example 2: A farmer randomly selects 10 eggs from a gross of 12 dozen eggs (144 eggs) he finds in his hen house. He carefully weighs each egg.
The following weights were recorded in grams:

0.75, 0.70, 0.55, 0.50, 0.60, 0.65, 0.75, 0.65, 0.75, 0.50

What is the mean weight of the gross of eggs?

Using the above formula, we can determine the mean weight of all of the eggs:

equation for the estimation of the population mean

Estimation of the population total

For a simple random sample, the estimation formula of a total for the population is

equation for the estimated population total

where
 x = an observed value
 mathematical symbol for estimated population total = estimated population total
mathematical symbol sum
x
= sum of all observed x values in the sample
n = number of observations in the sample
N = total number of observations in the population.

It is just the estimate for the mean value multiplied by the number of units in the population. In the previous example, the mean weight of an egg is 0.64 grams, so it is logical to think that the total weight of the 144 eggs would be 92.16 grams (144 x 0.64 = 92.16 grams).

If sample results have been summarized in a frequency table, then the estimate formula for total population is

equation for the estimated population total using a frequency table

where
 x = an observed value
 mathematical symbol for estimated population total = estimated population total
mathematical symbol sum
xf
= sum of all observed xf values in the sample
mathematical symbol sum
f
= sum of frequencies in the sample
 N = total number of observations in the population.