Methodology of the Canadian Labour Force Survey
Chapter 5 Processing and imputation Methodology of the Canadian Labour Force Survey
Chapter 5 Processing and imputation

5.0 Introduction

After collection, the Labour Force Survey (LFS) data go through several steps of processing before estimates are produced. To facilitate production of estimates from a complete and consistent microdata file, editing, imputation, and weight adjustments are used to identify and compensate for invalid, inconsistent, and missing data. Data processing can be divided into the following four steps:

Receipt of data from the regional offices and Phase I editing
Phase II editing
Hot-deck imputation
Post-imputation processing

Invalid and inconsistent data is identified and replaced with valid, consistent data using edits and various imputation methods, depending on the type of nonresponse. Item nonresponse, where one or more questionnaire items is unknown, is treated by carry-forward imputation, imputation by deduction, or hot-deck imputation, depending on the response history of the respondent and what survey data was collected for the record. Person nonresponse, where it is not possible to obtain any survey information for a person, is treated by hot-deck imputation. Household nonresponse, where it is not possible to obtain survey information for the entire household, is treated by hot-deck imputation or a nonresponse weight adjustment (see Chapter 6), depending on the response history of the household.

The following sections will explain the processing steps in more detail, with much of the focus on the hot-deck imputation system.

5.1 Receipt of data and Phase I editing

During the collection period, cases with completed interviews are transmitted from the regional offices to the head office on a daily basis. Data are then processed at the head office. The LFS collects socio-demographic (e.g., age, sex, education, immigration status, and aboriginal status) and labour force (e.g., labour force status, class of worker, industry, and earnings) data. Item/block editing and consistency editing are performed in several stages. Phase I editing includes four stages: record acceptance, demographic editing, Labour Force Information (LFI) item acceptance, and industry/occupation editing.

In the record acceptance stage, each record is checked to ensure that all necessary components were completed during the interview. This involves checking that there is demographic data for each household member and that there is labour force data for those who should have it based on their final response code, age, household membership, etc. Missing and inconsistent values of age and household membership are imputed at this stage by either carry-forward imputation or imputation by deduction.

Demographic editing involves detailed editing of the demographic information and is the final stage of editing at the household level. In this stage, all of the demographic data for all individuals in the household are edited at both the individual and within-family levels. A series of validity and consistency edits check for consistency of responses across questions for each individual and between household members. Both automated and manual corrections may be made at this stage.

The next stage of editing is the LFI item acceptance stage. In this stage, each record is run through a pre-edit process to check the validity of the labour force data received. The flow of the questions is checked to determine whether the responses for the labour force data follow a single, consistent and correct path. This process also checks the range and validity of the responses on the path.

The last stage in Phase I editing is industry and occupation coding. Records requiring coding are identified and coded using either an automated system or a manual system when the automated system cannot assign a complete code. The codes are validated and checked for consistency. Industry is coded to the North American Industry Classification System (NAICS) standard and occupation is coded to the National Occupation Classification – Statistics (NOC-S) standard. NAICS 2012 and NOC 2011 are currently used in the LFS processing system.

5.2 Phase II editing

Phase II editing includes resolution of ‘Don’t know’ and ‘Refusal’ responses and detailed consistency editing. During this phase of editing, each record is checked to determine if it contains any entries of ‘Don’t know’ or ‘Refusal’. These responses are considered item-level nonresponse. All item-level nonresponse is identified and treated with imputation by deduction or carry-forward imputation where possible. Consistency edits are applied to ensure that each record is internally consistent. If this process does not succeed then the missing or inconsistent items are flagged for hot-deck imputation.

5.3 Hot-deck imputation

In hot-deck imputation for the LFS, the missing values of a recipient are replaced by the corresponding values of a randomly selected donor within the same imputation class. Imputation classes are defined based on variables available for both recipients and potential donors. Two separate sets of imputation classes are formed, one set of classes for item nonresponse, and another set for person and household nonresponse and item nonresponse which cannot be resolved through item imputation.

In January 2005, a longitudinal hot-deck imputation strategy was implemented based on research by Bocci and Beaumont (2004). This strategy is primarily used to treat person and household nonresponse. The strategy uses the previous month’s values (possibly imputed) of some variables together with some socio-demographic variables from the current month as matching variables to form the imputation classes for both donors and recipients. Recently, the effectiveness of these matching variables for the treatment of person and household nonresponse was reviewed by White and Benhin (2013). The study resulted in an improved set of matching variables that was implemented in January 2015.

The following subsections describe the steps of LFS hot-deck imputation in more detail. Figure 5.1, at the end of this chapter, shows a general picture of the hot-deck imputation system (HDIS) process. More detailed specifications of the HDIS strategy can be found in Lorenz (1996).

5.3.1 Imputation pre-processing

Before the actual hot-deck imputation of missing values can be performed, some pre-processing steps must be completed.

First, responses are identified and each response record is assigned a preliminary imputation type. The data extracted from the head office processing system (HOPS) are divided into response and nonresponse files. The nonresponse records will be accounted for with a nonresponse weighting adjustment, which is discussed in Chapter 6, and the response records will undergo hot-deck imputation. The records in the ‘response’ file at this stage did not necessarily respond in the current month. If the person had responded in the previous month, then it is defined as a response in this step. All response records are initially divided into three groups: A, B and C. Group A contains potential donors. These are all persons for whom the reported data contain no missing values and are internally consistent. Group B is formed by all persons who have no missing values and are internally consistent after the first phase of editing, but do not belong in Group A because they had one or more items imputed during editing. The remaining persons form Group C and require imputation.

The second pre-processing step derives imputation matching variables. Some variables are not initially in a form which can be used directly in imputation. For example, the two occupation group variables OCC4 and OCC10 are derived from the NOC variable. Also in this step, earnings data are converted to an hourly basis by dividing total earnings for the time period by hours worked. This ensures that there is a uniform measure of earnings and that the value imputed for earnings is consistent with the value for the number of hours reported by a recipient.

The third pre-processing step is the identification of outlier earnings and the finalization of Groups A, B, and C. Earnings values that are either extremely high or low are deemed suspicious, and so they are set to missing and are imputed. Individuals who reported earnings that are very high or very low without being extreme keep those earnings values, but are excluded from being potential donors by being assigned to Group B. Outlier detection classes are formed by crossing the variables province-sex-age group, and occupation group. Different threshold values based on the quartile method are set in each class. The quartile method for outlier detection is described in Survey Methods and Practice published by Statistics Canada (2003).

After outlier detection, records in Group A form the potential donor pool. Records in Group C are the recipients and will be imputed by hot-deck imputation. The records in Group B do not need to be imputed and are also not eligible as donors.

The last step of pre-processing is to assign a temporary path (TPATH) to each record, where possible. This variable TPATH will be used as an important matching variable in the imputation for item nonresponse. The use of TPATH will be explained in detail in the next section.

5.3.2 Imputation for item nonresponse

Once all of the pre-processing steps have been completed, missing values can be imputed. Random hot-deck imputation within classes is used to fill-in missing values. This procedure is applied in such a way that the recipient data satisfy consistency edit rules and validity edit rules after imputation. For example, variables requiring non-blank values for a given recipient must be imputed using non-blank values. In a given imputation class, each recipient is imputed by selecting a series of donors using simple random sampling without replacement until a donor that satisfies all the edit rules is found. Once a suitable donor has been found, all of the recipient’s missing items are imputed with data from that donor.

The initial imputation classes are formed by crossing the following eighteen categorical variables:

TPATH (12 categories)
LMLFS3 (3 categories)
COW (3 categories)
OCC4 (4 categories)
PROV (10 categories)
AGEGP3 (3 categories)
ABQ1 (2 categories)
IMM (3 categories)
LMLFS7 (7 categories)
LMINDG (20 categories)
MULTJOB (2 categories)
AGEGP1 (5 categories)
SEX (2 categories)
OCC10 (10 categories)
AGEGP2 (8 categories)
STUD (2 categories)
EDUC (2 categories)
DWELRENT (2 categories).

The order of these variables reflects their importance in explaining the labour force variables as determined by the empirical studies in White and Benhin (2013). A detailed description of the values of the categories for each variable is given in Appendix F.

Note that the variables LMLFS3, LMLFS7 and LMINDG refer to values from the last month.

The variable TPATH has an important role in the imputation system. The first seven possible values of TPATH correspond to the seven possible values of the labour force status variable, LFSSTAT. Each donor is assigned a value for LFSSTAT based on reported data. For the recipients, the value of LFSSTAT may not be known; however, there may be enough information to exclude one or more of the seven possible values. The variable TPATH is used to ensure that only valid values for LFSSTAT are imputed to recipients by replicating each donor by its number of valid TPATH values and assigning only one value of TPATH to each recipient. At the end of the imputation step, the replicated donors are removed. For example, assume that a donor has LFSSTAT = 2. This donor then has three valid TPATH values: 2, 8, and 10 (see Appendix F for the definition of all possible TPATH values). The donor is therefore replicated three times with each replicate given one of the three valid TPATH values. When imputation classes are formed, each of the donor replicates will be in a separate imputation class.

Imputation is performed in each class that contains enough donors to pass the following two constraints:

The number of donors must be larger than the number of recipients of that class;
Each class must contain at least three donors.

If either of these constraints is not satisfied, then the least important variable (DWELRENT) is removed from the list of imputation class variables and the imputation process is attempted again for the remaining recipients. If after this second pass of imputation there are still some recipients that have not been imputed due to classes that do not satisfy the above two constraints, then a third pass of imputation is performed by removing the second-least important variable (EDUC). This process of removing one variable followed by imputation continues until all recipients have been imputed or until only the first five variables – TPATH, LMLFS3, COW, OCC4 and PROV – remain. Any recipients not yet imputed at that point are sent for whole record imputation, in which all labour force variables of the recipient, including those that were reported, are replaced by those of a randomly selected donor using a different set of matching variables – see Section 5.3.3.

In a given imputation class satisfying the above two constraints, each recipient is imputed by first selecting a donor such that the validity edit rules are satisfied. If no such donor can be found then the record is sent for whole record imputation as described in Section 5.3.3. If a suitable donor can be found (i.e., one that satisfies the validity edit rules after imputing the missing values of the recipient), the missing values of the recipient are replaced by the corresponding values from the donor and consistency edit rules are checked. If all edit rules are satisfied then the imputation process for this recipient is completed; otherwise, a second suitable donor (i.e., satisfying the validity edits) is attempted and the consistency edit rules are checked again. If all edit rules are satisfied after this second attempt then the imputation process for this recipient is completed; otherwise, the entire record will be imputed using the values of the last attempted donor. This imputation for the entire record is slightly different than the imputation process described below in that it uses a different and more precise set of matching variables.

5.3.3 Imputation for person nonresponse

Person or household nonresponse where previous month data is available and item nonresponse that could not be treated with item nonresponse imputation is treated by whole record (longitudinal) imputation. In this strategy, data from the previous month (possibly imputed) for some variables and data from the current month for other variables are used to form the imputation classes. This strategy is also used for person and household nonresponse where there is no response in the previous month but there was a response in the past. Donors and recipients for the whole record imputation are both person-records, even when dealing with entire household nonresponse. For households where this imputation is not possible, a nonresponse weight adjustment is performed instead.

The variables currently crossed to form initial imputation classes for whole record imputation are given below in order of importance:

PROV (10 categories)
LMLFS3 (3 categories)
AGEGP1 (5 categories)
SEX (2 categories)
LMINDG (20 categories)
LMLFS7 (7 categories)
EIER (56 categories)
EDUC (2 categories)
ABQ1 (2 categories)

The categories for these variables are detailed in Appendix F.

As with item nonresponse imputation, the same two imputation constraints apply to the imputation classes:

The number of donors must be larger than the number of recipients of that class;
Each class must contain at least three donors.

If one of these constraints is not satisfied, then classes are collapsed by removing the least important variable. The process of removing one variable and reforming the imputation classes continues until all recipients are imputed.

Recipient records are imputed using donor values from the current month, even though imputation classes are based on values from both the current month and the previous month. Also, validity and consistency edit checks are not needed when whole record imputation is performed because the donors must already satisfy all validity and consistency rules.

5.4 Post-imputation process

The post-imputation process includes eliminating the replicates of donors that were created during the derivation of TPATH, setting all of the group level output flags to indicate that imputation has taken place, and post-processing the earnings data by calculating both hourly and weekly earnings for all employees based on either reported or imputed hours and wages. The labour force status and some other variables are also derived.

Figure 5.1 Simplified Flowchart of the LFS Hot-Deck Imputation System

Description for Figure 5.1

This diagram shows the general structure of the flow of records through the LFS hot-deck imputation system. The flowchart starts with the records being returned from the region offices to the head office. The records flow to a decision where the initial response status is determined. Records that are nonresponse with no previous response flow to the nonresponse file. Records that have a complete or partial current response, or that have a previous month response flow to the response file. Records on the response file then go to another decision called consistency editing. Records that are complete and consistent with no edits required and no outliers flow to Group A, potential donors. Records that are complete and consistent after edits or that have outliers flow to Group B, ineligible as donors. The remaining incomplete records flow to Group C, recipients. The recipient records flow to a decision where it is determined whether the variable TPATH can be assigned. Records for which TPATH cannot be assigned go to the process whole record imputation using 9 variables. Records for which TPATH can be assigned will be treated as item nonresponse and flow to another decision where there is an attempt to find a valid class and donor using the 18 variables used to form classes for item imputation. Records for which either a valid class or a valid donor cannot be found are sent to the process whole record imputation using 9 variables. Records for which both a valid class and a valid donor are found are sent to a consistency check decision. Records that pass the consistency check are sent to the process item imputation using 18 variables. Records that fail the consistency check are sent to the process whole record imputation using the last attempted donor and 18 variables.

Date modified:: 2017-12-21

Language selection

Search and menus

Search

Methodology of the Canadian Labour Force Survey
Chapter 5 Processing and imputation Methodology of the Canadian Labour Force Survey
Chapter 5 Processing and imputation

5.0 Introduction

5.1 Receipt of data and Phase I editing

5.2 Phase II editing

5.3 Hot-deck imputation

5.3.1 Imputation pre-processing

5.3.2 Imputation for item nonresponse

5.3.3 Imputation for person nonresponse

5.4 Post-imputation process

Methodology of the Canadian Labour Force Survey Chapter 5 Processing and imputation Methodology of the Canadian Labour Force Survey Chapter 5 Processing and imputation

5.0 Introduction

5.1 Receipt of data and Phase I editing

5.2 Phase II editing

5.3 Hot-deck imputation

5.3.1 Imputation pre-processing

5.3.2 Imputation for item nonresponse

5.3.3 Imputation for person nonresponse

5.4 Post-imputation process

Acknowledgement

Note of appreciation

Standards of service to the public

Copyright

Methodology of the Canadian Labour Force Survey
Chapter 5 Processing and imputation Methodology of the Canadian Labour Force Survey
Chapter 5 Processing and imputation