Section 5
Processing errors

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Proportion of households or individuals requiring imputation, at the national and provincial levels

Errors can arise in all types of data handling. The main stages of data processing are response coding, data entry, editing, imputation of partial nonresponse and weighting. In the Survey of Household Spending (SHS), different procedures are applied at each stage in order to minimize processing errors, and the survey estimates are compared with other data sources prior to release.  Errors related to the adjustments made at the weighting stage have been described in sections 2 and 3. The other types of processing errors are covered in this section.

Coding is necessary for only a few questions. This is done by the interviewer and subsequently verified by a senior interviewer. Before 2001, data entry was done with the help of an automated edit system that grouped the questionnaires into batches and chose some questionnaires from each batch to be entered a second time. Any errors found were to be corrected. If the number of errors in a batch was greater than a certain threshold, then the entire batch was submitted for re-entry. Due to the introduction of a new data capture system (BLAISE), no questionnaire batch edit procedure has been used since 2001, unlike in previous years. However, some edits were implemented in the new data capture system to ensure consistency of data captured. The results of a preliminary data capture study seem to show that data capture error rates in the new system are similar to those in the old system.

An initial automated edit is carried out after each questionnaire has been verified manually by both the interviewer and the senior interviewer. This ensures that the respondent's answers follow some essential consistency rules. Unusual situations that may justify corrections are also identified. This automated edit stage takes place in Statistics Canada's regional offices in case it is necessary to recontact respondents if some supplementary information is needed to resolve inconsistencies in the answers provided. Specially-trained members of the edit teams solve any problems identified. Thereafter, other edit checks are done at head office and invalid responses are corrected.    

The processing of SHS data also involves imputation for partial nonresponse. Partial nonresponse occurs when the respondent refuses to answer or does not know the answer to certain questions. The imputation approach differs depending on whether the data are categorical or continuous. Categorical data take on only specific values (as in yes/no questions or type of dwelling questions), while continuous data can take any numerical value (as for income and expenditure data).

Income and expenditure data are imputed using the nearest neighbour technique. The imputation is done on one group of variables at a time, with the groups chosen by taking the relationships among the variables into account. A group generally corresponds to a section of the questionnaire. For each group, the missing values for a recipient (a household that has some missing data for at least one of these variables) are imputed from data on the most similar record among all donors (households that have no missing values for these variables).  For each recipient, the closest donor is chosen as the one that minimizes a particular distance function. This function is based on matching variables that are chosen because they are correlated with the variables to be imputed. For example, the total income of a household is chosen as a matching variable for all sections pertaining to expenditures. It must also be ensured that, after receiving the donor values, the recipient household satisfies certain consistency rules. In general, the imputation is done at the household level, but in some groups (e.g., income and clothing expenditures), the imputation is done at the person level since the original data are collected at that level for these variables.

Note that since 2001, the imputation of all expenditure and income data has been done using the Canadian Census Edit and Imputation System (CANCEIS) of Statistics Canada. This new system is based on methodology that is slightly different from that in the system used previously. The new system allows a better use of categorical variables as matching fields when selecting a donor.  Moreover, this system lends itself to the imputation of both continuous and categorical data. The new system was tested prior to its implementation and the results it gave were similar to those with the old system. Starting with 2003, categorical data, which are found mainly in the dwelling characteristics and facilities sections of the questionnaire, are imputed with the CANCEIS system. The categorical data were previously imputed with the help of a "hot deck" imputation technique that randomly chooses a donor from a group of respondent households with similar characteristics.

The bias caused by imputation of partial nonresponse is difficult to evaluate. It depends on the differences between respondents and nonrespondents as well as the ability of the imputation method to produce unbiased estimates. However, the imputation rates give an indication of the importance of partial nonresponse. They are presented in the following section.

5.1 Proportion of households or individuals requiring imputation, at the national, provincial and territorial levels

A first indication of the magnitude of partial nonresponse is the proportion of households requiring imputation and the number of variables imputed per household. The questionnaire can be divided into two major groups of variables:  those collected at the household level and those collected at the individual level (such as income and clothing expenditure). For the latter, it is important to note that the respondent may provide only the total income or total clothing expenditures if he/she is unable to provide the breakdowns by source of income or type of expenditure. The level of imputation for the components of income and clothing expenditure is then larger, but this does not affect total income, total clothing expenditure or total expenditure.

The percentage of households requiring imputation for household expenditure (excluding clothing expenditures and expenditures in the section on Personal Taxes, Security and Money Gifts) is presented in the next sub-section. The subsequent sub-section presents the percentage of persons requiring imputation for a clothing expenditure variable, the percentage of persons requiring imputation for an income variable and the percentage of persons requiring imputation for a variable in the section on Personal Taxes, Security and Money Gifts. Finally, the last sub-section presents the results for the percentage of households requiring imputation for at least one of the variables. After data imputation by the system, some corrections might have been needed on both imputed and non-imputed variables, in order to ensure data consistency. In reality, these changes constitute only a very small percentage. The results are provided at the national, provincial and territorial levels. This gives an indication of which provinces or territories are most affected by imputation.

5.1.1 Household expenditure imputation by province or territory

Table 5.1-1 shows the percentage of usable households requiring imputation of at least one expenditure variable. Usable households are all households living in eligible dwellings, excluding households who could not be contacted, who refused to participate in the survey, or who provided incomplete data or who were out of balance (see definitions in Section 2.1). The table is broken down by the number of imputed variables (out of 242) for a household.
 
Note that regular mortgage payments and mortgage insurance premiums are included under shelter costs and thus under total expenditure. Starting with 2002, these two variables were added to the calculation of imputation rates shown in Table 5.1-1. The impact of this change is a higher overall imputation rate.

Starting in 2005, a change was made to the questionnaire regarding expenditures on communication services in the home (telephone, cell phone and Internet access), cable television services and satellite distribution services. Because of the growing use of packages (bundled services), a household may be billed for combined services, with the result that it is impossible for it to provide expenditures for individual services. In such a case, the respondent household may provide only the total expenditure for these services while indicating which services are included in the package. Expenditures for individual services are then imputed in two stages. First, we impute households for which only a few services are missing, followed by households for which only the total expenditure for the package is available. For the latter households, the imputed expenditures for services (those included in the package) are adjusted proportionally so that their sum corresponds to the total expenditure on the package as provided by the respondent household. Since this change has had a major impact on the overall imputation rate for expenditures, the imputation rates in Table 5.1-1 are shown separately with and without the costs of communications services in the home, rental of cable television services and rental of satellite distribution services. Also, since this change has had an impact on the level of imputation of expenditures for these five services, Table 5.1-2 is provided, showing the imputation rate and a measure of the impact of imputation for each of these services.

Table 5.1-1 Households requiring expenditure imputation by province and territory

Table 5.1-1 shows that it was necessary to impute expenditures for 48.2% of households nationally. This higher rate in 2005 (compared to productions of the survey prior to 2004) is due to the increased use of a package plan for communications services in the home (telephone, cell phone, Internet access), cable television services and satellite distribution services. Approximately 41% (data not shown) of usable households required imputation of at least one of these five services. In almost all of these cases, the household had reported paying for a package (bundled services) and the expenditures associated with the services included in the package were imputed. The higher imputation rates when these five variables are taken into account, such as shown in the column "2 variables imputed" and the column "3 or more variables imputed," are due to the fact that a package usually includes two or more services. Excluding expenditures related to communications services in the home, cable television services and satellite distribution services, the overall imputation rate is 13.4% at the national level, which is comparable to the rates obtained in previous years. Just for the variable representing mortgage insurance premiums, imputation is required for 5.7% of usable households (or 15.7% of households when selecting only households that reported mortgages on dwellings that they owned and occupied) (data not shown).

When expenditures related to telecommunications services in the home (telephone, cell phone and Internet access), cable television services and satellite distribution services are excluded, it may be seen that nearly 76% of usable households (requiring imputation)  required imputation of a single variable. Also, very few households had more than one variable imputed (3.2%). The provinces with the lowest proportions of households requiring imputation of at least one expenditure variable are Prince Edward Island (9.0%) and New Brunswick (9.4%). The highest rates are in the Northwest Territories (19.8%), Nova Scotia (17.2%) and Ontario (17.0%). Ontario and British Columbia have the highest percentages of households that required imputation for more than one expenditure variable. In those two provinces, more than 30% of the households that required imputation had two or more expenditure variables imputed.

If we exclude regular mortgage payments, mortgage insurance premiums, expenditures related to communications services in the home, cable television services and satellite distribution services, then the low percentage of households for which variables had to be imputed, combined with a generally low number of variables to be imputed when imputation is required, suggests that the impact of imputed values on the estimates should not be too high.

Since there is a higher level of imputation for expenditures related to communications services in the home, cable television services and satellite distribution services, it is important to measure the effect of imputation on the estimates of totals for these five variables. This measure, along with the imputation rate, can be used to see how the amount of imputation done for these variables changes over time.  Owing to the growing popularity of packages (bundled services) within the population, the imputation level should increase over time. To measure the impact of imputation, the weighted total of the imputed data is divided by the total estimate (sum of weighted values). This measure represents the proportion of the total value of the estimate that is obtained from imputed data.

Table 5.1-2 Impact of imputation of communications services in the home, cable television services and satellite distribution services at the national level

According to Table 5.1-2, the imputation rate and the impact of imputation are greater for expenditures related to Internet access services and the rental of cable television services. This is mainly due to the fact that among households that reported paying for a package, a large proportion of packages included these two services. The high level of imputation performed on the components in Table 5.1-2 suggests that the estimates of these components might be greatly affected by imputation, while the effect on the estimate of the total of these five services combined will be negligible, since households must provide the total expenditure associated with the package. While the imputation rate and the impact are high for expenditures on Internet access services, the increase that occurred in 2005 for average Internet access expenditures was consistent with the trends observed from other independent sources of information. Internet access services accounted for 17.2% of all household expenditures on communications. Total expenditures on the five services in Table 5.1-2 combined represent only 2.6% of total household expenditure.

5.1.2 Person expenditure and income imputation by province or territory

Since some respondents provide only totals for clothing expenditure and income variables, a two-step procedure is used to impute these variables (at the individual level). Individuals who require imputation of only certain components are imputed first, followed by those for whom only totals are available but imputation on all components is required. (See reference [1] for a more detailed description of this process.)

The percentage of usable individuals (persons who are members of usable households) requiring imputation for an income variable is presented by province or territory in Table 5.2. The table shows the percentage of persons who had exactly one variable imputed, the percentage who had two or more variables (but not all) imputed and the percentage of persons for whom only total income was available (and hence required having all their components imputed).  The total percentage of persons requiring some form of income imputation is also provided. The second to last column of Table 5.2 indicates the total percentage of persons requiring some form of imputation for clothing expenditure variables. The last column of Table 5.2 indicates the total percentage of persons requiring some form of imputation for the Personal Taxes, Security and Money Gifts section of the questionnaire.

Note that questions related to personal income, personal taxes, security and money gifts are asked for each household member aged 15 or over on December 31 of the reference year. Thus, since the 2003 reference year, the percentage  of persons requiring some form of imputation for income variables as well as for the Personal Taxes, Security and Money Gifts section was calculated using only persons aged 15 or over and was not based on all  persons as done in previous years. This modification resulted in an imputation rate slightly higher for those variables. As was done in previous years, the percentage of persons requiring imputation for clothing expenditure variables is based on all persons, since those expenditure questions are asked for each household member.

Table 5.2 Persons requiring income imputation, persons requiring clothing expenditure imputation and persons requiring imputation for variables in personal taxes, security and money gifts section by province and territory

These results show that 4.5% of persons from usable households had imputation performed on at least one income variable.  For 80% of them, the respondent gave the total income but all their components had to be imputed. For a very large proportion of the remaining persons requiring imputation, only one component of income (one variable) had to be imputed. Provincially, the percentages of persons requiring imputation on at least one income variable range from a low of 2.4% for Quebec to a high of 7.7% for Prince Edward Island.

From the second to last column of the table, it can be seen that about 20.9% of persons required imputation for at least one of the clothing expenditure variables. The provincial rates range from 11.5% for New Brunswick to 25.8% for British Columbia. Almost all these people provided their total expenditure on clothing but required imputation of the components.  The higher level of imputation required on clothing expenditure components suggests that the estimates for these components could be greatly affected by imputation, while the effect on the estimates for total clothing expenditure will be negligible.

From the last column of the table, results show that 3.3% of persons had some imputation performed on at least one variable in the Personal Taxes, Security and Money Gifts section. Provincially, these percentages are also low, ranging from a low of 1.6% for Newfoundland and Labrador to a high of 5.8% for Ontario. Only Nova Scotia and Ontario have a rate exceeding 5%. In both those provinces, the higher imputation rate is due to the variable corresponding to personal income tax paid on 2005 income, which required proportionally more imputation than for the other provinces.

5.1.3 Imputation of categorical variables by province and territory

Table 5.3 shows the percentage of usable households requiring imputation of at least one categorical variable. The table is broken down by the number of imputed variables (out of 58) for a household. Categorical variables that required imputation can be found in the following sections of the questionnaire: Dwelling Characteristics (with the exception of the dwelling type variable); Facilities Associated with the Dwelling; Tenure (with the exception of variables related to a tenure change during the reference year); Tobacco and Miscellaneous for variables pertaining to purchases through direct sales (yes/no questions). Note that other categorical variables from the questionnaire, such as the household composition variables or questionnaire skips, are edited and validated by subject matter experts from the Income Statistics Division. Therefore, the latter variables are not imputed using the nearest neighbour technique.

Table 5.3 Households requiring imputation of categorical variables by province and territory

Table 5.3 indicates that at the national level, 7.4% of households required some categorical imputation for dwelling characteristics, facilities associated with the dwelling, tenure and purchases through direct sales, but approximately 78% of those households had only one variable imputed. The lowest proportions of households requiring imputation for at least one categorical variable are observed for the Atlantic Provinces.

Date modified: