Methodology

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Skip to text

Survey universe
The sample
Weighting
Cross-sectional representation
Data quality
Non sampling errors (includes response rates and imputation)
Sampling errors
Standard error and coefficient of variation
Quality indicators
Suppression rules

Text begins

Survey universe

SLID is a household survey that covers all individuals in Canada, excluding residents of the Yukon, the Northwest Territories and Nunavut, residents of institutions and persons living on Indian reserves or in military barracks. Overall, these exclusions amount to less than three percent of the population.

The sample

The samples for SLID are selected from the monthly Labour Force Survey (LFS) and thus share the latter's sample design. The LFS sample is drawn from an area frame and is based on a stratified, multi-stage design that uses probability sampling. The total sample is composed of six independent samples, called rotation groups because each month one sixth of the sample (or one rotation group) is replaced. For more information on the LFS design, refer to the Statistics Canada Publication Methodology of the Canadian Labour Force Survey.

The SLID sample is composed of two panels. Each panel consists of two LFS rotation groups and includes roughly 17,000 households. A panel is surveyed for a period of six consecutive years. A new panel is introduced every three years, so two panels always overlap. With the 2008 reference year, a new SLID panel (panel 6) was selected from the LFS.  This is the first SLID panel to be selected from the new LFS design introduced at the end of 2004.  SLID panels 3 to 5 were selected from the 1994 LFS design and  SLID panels 1 and 2 were selected from the 1984 LFS design.

For the reference years 1993 to 1997, the SLID cross-sectional sample was combined with the sample of the Survey of Consumer Finances (SCF).  The SCF samples were also selected from the LFS.  Each year, the SCF sample consisted of four LFS rotation groups.

Weighting

The estimation of population characteristics from a survey is based on the premise that each sampled unit represents, in addition to itself, a certain number of unsampled units in the population. A basic survey weight is attached to each record to indicate the number of units in the population that are represented by that unit in the sample.

For each reference year, SLID produces two sets of weights: one is representative of the initial population (the longitudinal weights) while the other is representative of the current population (the cross-sectional weights).

For the production of longitudinal weights, three types of adjustments are applied to the basic survey weights in order to improve the reliability of the estimates. The basic weights are first inflated to compensate for non-response and then adjusted for influential values. These adjusted weights are then further adjusted to ensure that estimates on relevant population characteristics would respect population totals from sources other than the survey.

The first set of population totals used for SLID is based on Statistics Canada's Demography Division population counts for different age/sex groups as well as counts by household and family size at the provincial level. These annual population totals are based in large part on totals from the Census of population. The second set of totals is derived from Canada Revenue Agency (CRA) administrative data (T4 file) and is intended to ensure that the weighted distribution of income (based on wages and salaries) in the data set matches that of the Canadian population.

The switch from 1996 to 2001 Census-based population totals for recent years and the use of T4 information from CRA were introduced with the release of data for 2003. SCF estimates from 1990 to 1995 and SLID estimates from 1996 to 2002 were revised back to 1990 at the same time.

For the production of the cross-sectional weights, SLID combines the two panels and assigns a probability of selection to individuals who joined the sample after the panel was selected.  As with the longitudinal weights, the cross-sectional weights are adjusted for non-response and influential values.  The cross-sectional weights are also adjusted to ensure that estimates on specific population characteristics respect totals of the cross-sectional target population.  The types of population totals are the same as those used for the longitudinal weights but correspond to the cross-sectional population.

Since 2002, a third set of weights has been produced which combined two overlapping panels to form a new longitudinal sample. These weights are referred to as the combined longitudinal weights. These weights allow SLID data users to conduct longitudinal analyses using both panels. The analyses, however, are limited to the period of up to three years where the panels overlap and refer to the population at the time of selection of the most recent panel. 

For a detailed description of the weighting process, refer to the publication Longitudinal and Cross-sectional Weighting of the Survey of Labour and Income Dynamics.   For a description of the combined panel weighting, refer to the publication Combined-panel Longitudinal Weighting, Survey of Labour and Income Dynamics.

Cross-sectional representation

Each longitudinal sample, or "panel" in SLID initially constitutes a representative cross-sectional sample of the population. However, because the real population changes each year, whereas by design the longitudinal sample does not, the sample must be modified to properly reflect these changes to the composition of the population. This is done by adding to the sample all new people in the population who are found to be living with the initial respondents (and likewise dropping them from the sample if they leave at later time-points).

Any original respondents who leave the target population (by moving abroad, into institutions, etc.) are given a zero weight for cross-sectional purposes. In this way, the cross-sectional sample, composed of the original respondents minus those who left the target population plus those who have entered it, is virtually fully representative of the population at each subsequent time-point. The missing group is composed of persons who have newly entered the target population and are not living with anyone who was in the target population when the most recent panel was selected. However, since SLID introduces a new panel every three years, this group is quite small.

Data quality

There are two types of errors inherent in sample survey data, namely, non-sampling errors and sampling errors. The reliability of survey estimates depends on the combined impact of non-sampling and sampling errors. For more detailed information on data quality indicators see the research paper Data quality for the Survey of Labour and Income Dynamics (SLID).

Non-sampling errors

Non-sampling errors generally result from human errors such as simple mistakes, misunderstanding or misinterpretation. The impact of randomly occurring errors over a large number of observations will be minimal. Errors occurring systematically can, on the other hand, have a major impact on the reliability of estimates. Considerable time and effort is invested into reducing non-sampling errors in SLID.

Non-sampling errors may arise from a variety of sources such as coverage, response, non-response and processing errors.

Coverage error arises when sampling frame units do not exactly represent the target population. Units may have been omitted from the sampling frame (under-coverage), or units not in the target population may have been included (overcoverage), or units may have been included more than once (duplicates). Undercoverage represents the most common coverage problem.

Slippage is a measure of survey coverage error. It is defined as the percentage difference between control totals (Census population projections) and weighted sample counts. Slippage rates for household surveys are generally positive because some people that should be enumerated are missed. In table A below, slippage rates from 1997 to 2005 are calculated using the 2001 Census population projections while slippage rates from 2006 to 2010 are based on the 2006 census population projections. According to the numbers in the table below, in 2010, SLID covered 86.5% of its target population. SLID estimation procedures use Census population projections to compensate for determined slippage.

Rates are also available upon request for sex, province and age groupings.

Table A
Person level slippage rates in SLID

  1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Canada (%) 8.4 9.0 8.4 9.5 10.6 12.4 13.4 14.2 14.5 16.0 16.3 13.3 13.0 13.5

Response errors may be due to many factors, such as faulty questionnaire design, interviewers' or respondents' misinterpretation of questions, or respondents' faulty reporting. Great effort is invested in SLID to reduce the occurrence of response error. Measures undertaken to minimize response errors include the use of highly-skilled and well-trained interviewers, and supervision of interviewers to detect misinterpretation of instructions or problems with the questionnaire design. Response error can also be brought about by respondents who, willingly or not, provide inaccurate responses.

Income data are especially prone to misreporting, as income is a sensitive issue and includes many items for which respondents are not always familiar. To minimize response burden and data errors, the respondents are given the choice of granting Statistics Canada permission to access their tax files. The majority of respondents grant permission, allowing SLID to collect income data directly from administrative files.

Non-response errors occur in sample surveys because not all potential respondents cooperate fully. The extent of non-response varies from partial non-response to total non-response.

Total non-response occurs when the interviewer is unable to contact the respondent, no member of the household is able to provide information, or the respondent refuses to participate in the survey.

Response is calculated at the household level. A household is considered to be "respondent" if at least one of its members responds to the interview. There is the additional stipulation that the information on the household's composition cannot be missing for more than one year.

Total household non-response is handled by adjusting the basic survey weight for individuals within responding households to compensate for individuals in nonresponding households.

Non-responding members (if any) within responding households will have final data that are either shown as "missing" on the final database or imputed, depending on the variable (see partial non-response section for details on imputation).

The importance of the non-response error is unknown but in general this error is significant when a group of people with particular characteristics in common refuse to cooperate and where those characteristics are important determinants of survey results. The bias introduced by non-response increases with the differences between respondent and non-respondent characteristics. Methods employed to compensate for non-response make use of information available for both respondents and non-respondents in an attempt to minimize this bias.

High response rates are essential for the data quality of any survey and thus considerable effort is invested to encourage effective participation from SLID respondents.

Cross-sectional households' response rates, given in Table B, range between 67.3% (2010) and 85.9% (1996).

Table B
Response rates in SCF (1990-1992), SCF_SLID (1993-1997) and SLID (1998-2008)

Year Response rate (%)
1990 79.0
1991 80.0
1992 80.7
1993 84.2
1994 82.6
1995 83.3
1996 85.9
1997 83.9
1998 82.7
1999 82.7
2000 79.2
2001 79.1
2002 79.0
2003 78.3
2004 74.7
2005 76.1
2006 74.9
2007 71.8
2008 70.6
2009 70.1
2010 67.3

Partial non-response occurs when the respondent does not understand or misinterprets a question, refuses to answer a question, or is unable to recall the requested information. Imputing missing values compensates for this partial non-response.

Income data are imputed using previous years' data updated for any changes in circumstances. In the absence of previous years' data, data is imputed using the "nearest neighbour" technique, in which a respondent with certain similar characteristics becomes the "donor" for the imputed value.

Amounts received through certain government programs, such as child tax benefits, the Goods and Services Harmonized Sales Tax Credit, and the Guaranteed Income Supplement, are also derived from other information.

Processing errors can occur at various stages in the survey: data capture, editing, coding, weighting or tabulation. The computer-assisted collection method used for SLID reduces the chance of introducing capture errors because checks for consistency and completeness of the data are built into the computer application. To minimize coding, weighting or tabulation errors, diagnostic tests are carried out periodically. These tests include comparisons of results with other data sources.

Sampling errors

Sampling errors occur because inferences about the entire population are based on information obtained from only a sample of the population. The results are usually different from those that would be obtained if information were collected from the whole population. Errors due to the extension of conclusions based on the sample to the entire population are known as sampling errors. The sample design, the variability of the population characteristics measured by the survey, and the sample size determine the magnitude of the sampling error. In addition, for a given sample design, different methods of estimation will result in sampling errors of different sizes.

Standard error and coefficient of variation

A common measure of sampling error is the standard error (SE). The standard error measures the degree of variation introduced in estimates by selecting one particular sample rather than another of the same size and design. The standard error may also be used to calculate confidence intervals associated with an estimate (Y). Confidence intervals are used to express the precision of the estimate. It has been demonstrated mathematically that, if the sampling was repeated many times, the true population value would lie within the confidence interval Y ± 2SE 95 times out of 100 and within the narrower confidence interval defined by Y ± SE, 68 times out of 100. Another important measure of sampling error is given by the coefficient of variation, which is computed as the estimated standard error as a percentage of the estimate Y (i.e., 100 × SE / Y).

To illustrate the relationship between the standard error, the confidence intervals and the coefficient of variation, let us take the following example. Suppose that the estimated average income from a given source is $10,000, and that its corresponding standard error is $200. The coefficient of variation is therefore equal to 2%. The 95% confidence interval estimated from this sample ranges from $9,600 to $10,400, i.e. $10,000 ± $400. Thus it is assumed with a 95% degree of confidence that the average income of the target population is between $9,600 and $10,400.

The bootstrap approach is used for the calculation of the standard errors of the estimates. For more information on the bootstrap technique and examples of software that can be used to produce bootstrap variances see the document Using bootstrap weights with WesVar and SUDAAN.

Quality indicators

Quality indicators (QIs) are based on the estimate's coefficient of variation (CV) and suppression rules. The following symbols are used:

Table C
Quality rules

QI Code Description
A Excellent (0% <= CV < 2%)
B Very good (2% <= CV < 4%)
C Good (4% <= CV <8%)
D Acceptable (8% <= CV <16%)
E Use with caution (CV greater than or equal to 16% )
F Too unreliable to be published
. Not available for a complete reference period
.. Not available for a specific reference period
... Not applicable
p Preliminary
r Revised
x Suppressed to meet the confidentiality requirements of the Statistics Act

Suppression rules

Suppression rules, or data reliability cutoffs, are currently established based on the sample size that underlies the estimate. In general, a sample size of 25 observations is required for the estimate to be published. Depending on the type of estimate, this rule can vary slightly. These rules help protect the confidentiality of survey respondents and ensure the reliability of estimates.

Table D
Suppression rules

Estimate Suppress if:
Percentage, Distribution, Proportion/Shares
  • % under the low-income cutoff (LICO)
  • Income distribution
  • Proportion of families with income=0
Denominator sample size* < 25
or
Denominator sample size* < 100
and numerator sample size < 5
Ratios
  • Female/male earnings
Numerator sample size < 25
or
Denominator sample size < 25
Quintiles (shares, means and upper income limits)
  • Shares of income by quintile
  • Average income by quintile
  • Upper income limits
Sample /5 < 25
or
Upper income limit for upper income quintile or total of quintiles
Other estimates
  • Counts
  • Mean
  • Medians
  • Gini coefficients
sample < 25

*The denominator sample size refers to the sample size of the total estimate from which the distribution, percentage, proportion or share is derived.

Date modified: