Methodology of the Canadian Labour Force Survey
Chapter 6 Weighting and estimation
Methodology of the Canadian Labour Force Survey
Chapter 6 Weighting and estimation
On this page
6.0 Introduction
Estimation is the survey process by which estimates
of unknown population parameters are produced using data from a sample,
possibly in combination with auxiliary information from other sources. Examples of population parameters of interest
include population totals, means and ratios, as well as their averages over a
number of survey months.
Labour Force Survey (LFS) estimates are
produced using weights attached to each person for which LFS data is available.
This chapter describes the steps involved in deriving final weights for
estimation. Section 6.1 describes the calculation of design weights that
reflect the sample design described in Chapter 2. Section 6.2 describes how the
design weights are adjusted for nonresponding households to become what are
called the subweights. Section 6.3 describes the composite calibration that is
applied to the subweights to ensure consistency with external estimates of
population, account for undercoverage and improve the efficiency of the
estimates. This section also describes the integrated method of weighting,
which ensures a common final weight for every person within a household.
Finally, Section 6.4 describes how the weights are used to calculate some of
the main population parameters estimated by the LFS.
6.1 Design weight
The design weight of a person l is equal to the inverse of his or her
probability of being selected in the sample,
. This can be denoted by
. The
design weight is often interpreted as the number of units in the target
population that the sampled unit represents. Since every person of a selected
household is included in the sample, computing the selection probability of a
given person is equivalent to computing the probability that the person’s
household is selected.
6.1.1 Basic weight
As described in Section 2.6.2, the overall
selection probability of household k in PSU j in rotation group i of stratum h is
, for all households in stratum h. Recall that
is
the rounded inverse sampling ratio for stratum h, as established during the allocation of the sample.
In all provinces except Prince Edward Island
(PEI), the LFS uses a two-stage sampling design to select households. As such,
the derivation of the basic weights is different for PEI than for the rest of
the provinces; however, because the dwellings are selected systematically
according to the stratum ISR, the selection probability in PEI is
as in the other provinces.
Since the LFS data is collected for every
eligible person within a selected household, the basic selection probability of
a person l in stratum h of any province is
and
his or her basic weight is
This sampling design is self-weighting within strata because it has a
constant basic weight within each stratum.
The design weights would be equal to the basic weights if the sampling
design and the population remained unchanged. However, because the primary
sampling units (PSUs) experience growth over time and the systematic sampling
rate is fixed, this would lead to an ever-increasing sample size. To avoid
this, the sample size is controlled through the sampling procedures described
in section 3.3.2: PSUs can be sub-sampled using the PSU sub-sampling method or
the sub-clustering method; the stratum can be redesigned based on updated
information. These methods change the basic selection probability of households
(and people). It is thus necessary to adjust the basic weights to create
cluster specific weights to compensate for these sampling procedures.
6.1.2 Cluster weight
Cluster weights are used for strata with a
two-stage design, i.e., the strata
for all provinces except PEI. A cluster corresponds to a PSU in these strata.
In population centres, construction can cause the number of dwellings in some
clusters to grow substantially over time. Interviewers are assigned clusters,
and if significant growth occurs in one or more of their clusters, their workload
would also grow substantially. This could affect the quality of the
interviewer’s work and his or her ability to complete the assignment. When the
number of dwellings in a cluster increases to more than double the initial
level, without becoming too extreme, the cluster may be randomly sub-sampled
using either the cluster / mechanical sub-sampling or sub-clustering method.
These methods of sub-sampling modify the selection probabilities of households.
As a result, the basic weight
is modified by a cluster adjustment factor
to give the cluster weight
Unfortunately, the self-weighting property is lost when either of these
methods is used. Additional details of these methods can be found in Kennedy
(1998). When growth is extreme, sub-sampling may not be practical, and the
stratum is updated as described in below.
Cluster sub-sampling
This method is the simplest and
most common of all subsampling methods. The sampling rate is modified to reduce
the number of households selected in the cluster. If the cluster was originally sampled at a rate of
and subsampling leads to a sampling rate of
, then the cluster adjustment factor is
. The basic weights of interviewed households
are multiplied by this factor. In order to use this method, the growth has to
be sufficient to warrant a factor of at least two. Due to outlier problems
encountered by special surveys that use the LFS frame, the maximum value of the
cluster adjustment factor is three.
Sub-clustering
When a cluster more than triples
in size and street patterns are well defined, the growth cluster is divided
into 4 or more sub-clusters. A sample of 2 of these smaller sub-clusters is
taken and then a sample of households is selected within each selected sub-cluster.
This procedure is equivalent to adding another stage of sampling within growth
clusters. It does not change the selection probability of clusters, but it does
change the selection probability of households within growth clusters. The cluster sub-weight represents this selection
process.
Stratum updates
When growth is so extreme that
the sub-sampling processes described above are insufficient, then a stratum
update is required, as described in Section 3.3.2. Updated counts of dwellings
for all clusters in the stratum are required and new clusters are formed by sub-clustering
existing clusters in the frame based on the new counts. An update to the stratum sample is
implemented, based on Keyfitz (1951), as modified by Drew, Choudhry, and Gray
(1978), retaining as many of the originally selected PSUs as possible. The new
sample is phased-in over six months. An interim weighting factor is applied to
all PSUs in the stratum until completion of the phase-in. This weighting factor
adjusts for the new knowledge derived from the latest count of dwellings that
is not otherwise reflected in the active sample.
6.1.3 Stabilization weight
The final stage of sampling is conducted using systematic sampling at a
fixed rate. As the same sampling rate is used consistently over time, growth in
the population, and hence in the number of households, would lead to an
ever-increasing sample size and escalating survey costs if sample stabilization
were not carried out. Sample stabilization consists of randomly sub-selecting
households from the sample in order to maintain the sample size at its planned
level. This random selection procedure is performed using systematic sampling
within each stabilization area and independently between stabilization areas. A
stabilization area is defined as containing all households belonging to the
same Employment Insurance Economic Region (EIER) and the same rotation group.
Sample stabilization modifies the selection
probability of households. As a result, the cluster weight
is modified by a stabilization adjustment
factor
to give the stabilization weight
. By definition, the design weight of a person l in stratum h,
, is
equal to its stabilization weight
, i.e.,
Calculating the stabilization adjustment
The stabilization adjustment factor
is
computed separately within sub-areas. A sub-area is defined as all strata
within a stabilization area that have a common sampling fraction. Stabilization
weighting departs slightly from the principle of weighting by the inverse of
the selection probability since it is performed within sub-areas and not within
stabilization areas. Such a weighting procedure is often called
poststratification, with the poststrata being the sub-areas in this case.
To give a simplified example, suppose that there is a stabilization area in which all households have a basic
selection probability of 1 in 200 at the time of design and a common cluster
adjustment factor of 1. In this simplified example, the stabilization area is
thus not partitioned into sub-areas. If the stabilization area has a planned
sample size of 300 households at the time of design, and if the sampling rates
used in fact yield 350 households, then 50 households must be dropped randomly
from the stabilization area. This changes the selection probability of
households from 1 in 200 to 3 in 700 (i.e.,
1/200 times 300/350). The basic weight of 200 is thus multiplied by the factor
350/300 to yield the stabilization weight 700/3=233.333333.
Households that have one of the following two characteristics are excluded
from sample stabilization and stabilization weighting:
- Households belonging
to a cluster that has been subsampled using cluster sub-sampling or
sub-clustering as described in Section 6.1.2;
- Households living in a
recently-built dwelling, which has been added to the cluster list and was thus
not eligible to be dropped (interviewer selected dwelling).
Since such households do not get a chance to be dropped from the sample,
they are excluded from stabilization weighting as well.
6.2 Subweight
While an attempt is made to interview all households in the selected
sample s, refusals and other factors
make it impossible to contact some households. Part of this household
nonresponse is first treated by using a longitudinal imputation method (see Section
5.3.3). Then, the remaining nonrespondent households are treated by removing them
from the file and adjusting the design weights of responding households,
including those that have been imputed, by a nonresponse adjustment factor. The
basic principle consists of determining an appropriate model for the unknown
response probabilities and then computing the nonresponse adjustment factors as
the inverse of the estimated response probabilities.
In the LFS, the nonresponse model used is the
uniform nonresponse model within classes. With this model, all households
within a given nonresponse class c are assumed to have the same response probability
. The estimated response probability
is simply the design-weighted response rate of
households within class c. The
nonresponse adjustment factor for a person l belonging to a responding household in class c is
and the nonresponse adjusted weight, or the
subweight, is
Every person within a given responding household has the same
nonresponse adjustment factor and thus the same subweight.
6.2.1 Nonresponse classes
The key to reducing nonresponse bias is to determine nonresponse classes
that explain the unknown nonresponse mechanism well and that are constructed in
such a way that the assumption of constant response probability within classes
is reasonable. From an efficiency point of view, it is also desirable that
nonresponse classes be as homogeneous as possible with respect to the main
variables of interest, that is, classes should be formed in such a way that the
respondents within a given class are similar to nonrespondents in terms of the
main variables of interest. As a result, variables used to construct classes
should be explanatory for the nonresponse mechanism and also for the main
variables of interest.
In the LFS, every aboriginal or high-income stratum forms a separate
nonresponse class. The remaining classes are defined by crossing the variables PROVINCE,
EIER, TYPE and ROTATION (excluding households belonging to an Aboriginal or
high-income class). The variable TYPE has four categories and indicates the
type of stratum to which a household belongs: Remote, Rural, Urban non-Census
Metropolitan Area (CMA) (including PEI one-stage strata) and Urban CMA. The
variable ROTATION corresponds to the six rotation groups. Note that the nonresponse classes do
not overlap and, collectively, they cover the entire population. Collapsing of
classes is performed when a nonresponse adjustment factor is greater than two
in a given class. This is done by
removing the last class variable (ROTATION) and recalculating the nonresponse
adjustment factors among the redefined classes (PROVINCE by EIER by TYPE). The problematic class then gets the new
adjustment factor, as well as all other classes (i.e. rotation groups) within the same PROVINCE, EIER and TYPE. The reason for collapsing nonresponse classes
is to avoid large nonresponse adjustment factors since they tend to increase
the variability of the estimates.
6.3 Final weight
The last step of the weighting process is to derive the final weights,
which are used to obtain official estimates. Composite calibration and the
integrated method of weighting are used to produce the final weights. The
integrated method of weighting is used to ensure a common final weight for
every person in the household.
6.3.1 Composite calibration
Calibration is used for the following three
reasons: to ensure consistency with Census projected estimates and with all
surveys using these Census estimates; to account for undercoverage; and to
improve the efficiency of the estimates. To account for undercoverage and
improve the efficiency of the estimates, auxiliary variables used in
calibration must be correlated with the main variables of interest. One way to
achieve this goal is to choose auxiliary variables by modelling the variables
of interest. For example, an appropriate model can show that being employed or
unemployed is related to the age and sex of a person.
The LFS uses composite calibration (or
regression composite estimation) to produce the final weights. Composite calibration
is essentially the same as calibration, except that some control totals are
estimates from the previous month’s survey data and the auxiliary variables
associated with these control totals are imputed for some units.
Composite calibration can lead to substantial
improvement in the efficiency of the estimates if there is a strong
month-to-month correlation in the information collected. Such improvement is due to the overlapping
nature of the LFS sample. On the one hand, gains in efficiency are obtained
because composite calibration uses information obtained in the previous month
from the exit rotation group. On the other hand, it also has a reduction in
efficiency due to missing values in the birth rotation group. Overall, it was
found empirically that composite calibration is beneficial in the LFS.
Like calibration, composite calibration is a
technique that finds weights
, for
all people in the subset of all people from the sample, s, who belong to a responding or imputed household,
, as close as possible to the subweights
, subject to some constraints. More formally, composite
calibration weights,
, are obtained in the LFS by minimizing the
distance function subject
to two sets of constraints: calibration constraints, and composite calibration
constraints.
The first set of constraints, the calibration
constraints, require that estimates based on the weights,
, for a vector of auxiliary variables
,
, are equal to the vector of known population
totals,
. In
other words, the calibration constraints can be given by
. In the LFS, these known population totals,
often called control totals, are Census estimates projected to the current
month for the number of people aged 15 and over in Economic Regions (ERs) and
CMAs/Census Agglomerations (CAs), and for the number of people in 24 age-sex
groups by province. Additional control totals are used to ensure that the
estimated number of people aged 15 and over is the same for each rotation
group. To perform calibration, the vector
must be known for every person
. In the case of the LFS, this means that the
age-sex group, ER, and CMA/CA of each person
must
be known.
The second set of constraints, the composite
calibration constraints, involve control totals that are estimates from the
previous month’s survey data, and auxiliary variables associated with these
estimated control totals. The auxiliary variables may not be known for all
people
and are
thus imputed for some. These control totals and auxiliary variables are called
composite control totals and composite auxiliary variables respectively. There
are 28 composite auxiliary variables for each province and they are all defined
with respect to the previous month’s survey data (see Appendix G for a complete
list).
Imputation of auxiliary control variables
If the vector of composite auxiliary variables
for unit l, denoted by
, is defined for the previous month (month
the
corresponding vector of estimated control totals, denoted by
, must also be computed using the previous
month’s data. The vector of composite auxiliary variables
is not observed for people in the birth
rotation group since they were not interviewed in the previous month. Imputation
is used to fill in missing values for these units using a combination of two
imputation methods.
In the first method, mean imputation is used to obtain the modified
vector:
where
is the subset of people
who
belong to the birth rotation group and
is
the provincial number of people aged 15 and over. In a previous empirical
study, it was found that this imputation method was efficient for estimating
population parameters defined at the current month t.
In the second imputation method, the modified vector
is
defined as:
where
is the vector
defined
at the current month t and
is
the probability that
given that
. In the LFS,
, for
, and is replaced in the previous equation by
the estimate
. Essentially, the idea is to perform
carry-backward imputation (imputation by current month’s values to fill in
previous month’s values) to impute
for the
birth rotation group since it is known that there is a strong month-to-month
correlation for the composite auxiliary variables. However, the values of
in the
non-birth rotation groups are modified due to the fact that carry-backward
imputation eliminates change for people in the birth rotation group. The
correction in the non-birth rotation group is determined so as to preserve the
property of asymptotic unbiasedness of the estimates. In a previous empirical
study, it was found that this imputation method (which determines
) was efficient for estimating population parameters defined as differences
between two successive months.
As stated, neither
nor
, is actually used in the survey. Instead, a
combination of the two methods is used.
The composite auxiliary variables are defined as
where
is
a tuning constant that equals 2/3. This leads to a compromise between the two
imputation methods. A study on the choice of
can
be found in Chen and Liu (2002). Alternative imputation methods have also been
studied in Bocci and
Beaumont
(2005) using the idea of calibrated imputation.
The LFS composite calibration
weights
are therefore
obtained by minimizing the distance function given by Equation (6.1), subject
to both sets of constraints
The minimization leads to the composite calibration
weights
where
the composite calibration adjustment factor
is
given by
.
Additional details about LFS composite
calibration can be found in Singh, Kennedy and Wu (2001), Fuller and Rao (2001)
and Gambino, Kennedy and Singh (2001). Gambino, Kennedy and Singh (2001) also
discuss issues related to missing and out-of-scope people at the previous month
in the non-birth rotation groups. Missing values are imputed using random
hot-deck imputation and
is
assigned to out-of-scope people at the previous month. The idea is to determine
so
that
remains, like
, an estimate of the unknown vector of control
totals
, which is defined for the previous month.
Missing values and out-of-scope people at the current month are dealt with in
the usual way.
6.3.2 Integrated method of weighting
Since some auxiliary variables and all
composite auxiliary variables are defined at the person level, the composite
calibration weights
are not
constant within a household, unlike the subweights
. This does not pose a problem as long as the
interest is in estimating person-related population parameters, such as the
total number of people employed in the population. However, in the LFS, there
is also sometimes interest in estimating household-related population
parameters. For example, there may be interest in estimating the total number
of households having a certain characteristic, such as having at least one
member employed. There is more than one weighting alternative for such
population parameters.
In order to avoid producing two sets of final
weights, the integrated method of weighting was introduced in the LFS to obtain
a unique set of final weights that can be used for both person-related and
household-related population parameters; see Lemaître and Dufour (1987). With
this method, the final composite calibration weight is constant for all the
people within a household. This is achieved by replacing
and
for a
given person l by the average of
and
over all members of his or her household and
then computing the composite calibration weights as in Section 6.3.1. This
ensures a common final weight for all people within the same household. This
additional constraint on the final weights is expected to reduce the efficiency
of the estimates. However, Pandey, Alavi and Beaumont (2003) have found
empirically that the reduction in efficiency is small in the context of the
LFS.
6.3.3 Treatment of negative weights and rounding
Sometimes calibration results in negative weights. In this situation,
composite calibration is performed again on the post-calibration weights, with
the negative weights reset to their subweights. If after this second round of
composite calibration there are still negative weights, then these negative
weights are set equal to 1 and it is accepted that the composite calibration
constraint will not be perfectly satisfied. This rarely occurs. After both
rounds of composite calibration the weight is rounded to the nearest integer,
producing the final weight.
6.4 Estimation
Once the final weights have been calculated, they are used to estimate
several types of population parameters, including the following examples of
totals, rates and moving averages.
Each month, the LFS calculates the number of
employed people in the population. If
is a binary variable indicating whether a given
person l of the population is
employed
or not
, the population total Y represents the number of employed people in the population P. The population total is calculated as
Using the final weights, this population total
can be estimated by
where
is the
subset of all the people from s who
belong to a responding or imputed household and
is the
composite calibration weight, or final weight, attached to person l.
The LFS also calculates the unemployment rate
each month. If
is
a binary variable indicating whether a given person l of the population is unemployed
or not
and
is
a binary variable indicating whether person l is in the labour force
or not
, then the population rate
represents the unemployment rate in the
population.
.
It can be estimated using the
final weights
by
.
As
well, every month, the LFS produces three-month moving average estimates of the
unemployment rates for each EIER using data from the three most recent months. If the T-month moving average of a total
at time t is
and it is estimated using the final weights by
then the estimated three-month moving average for the unemployment rate
can be calculated as
Moving average estimates are used because they are more stable than monthly estimates; however, their interpretation is different since they estimate a different population parameter.