Canadian Survey on Disability, 2017: Concepts and Methods Guide
7. Data quality
Archived Content
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
Skip to text
Text begins
7.1 Overview of data quality evaluation
The objective of the Canadian Survey on Disability is to produce quality estimates on the type and severity of disabilities of Canadians aged 15 years and over (as of May 10, 2016) as well as on a variety of other important indicators of the experiences and challenges of persons with disabilities. This chapter reviews the quality of the data for this survey.
Sections 7.2 and 7.3 below explain the two types of errors that occur in surveys—sampling and non-sampling errors. Each type of error is evaluated in the context of the CSD. Sampling error is the difference between the data obtained from the survey sample and the data that would have resulted from a complete census of the entire population taken under similar conditions. Thus, sampling error can be described as differences arising from sample-to-sample variability. Non-sampling errors refer to all other errors that are unrelated to sampling. Non-sampling errors can occur at any stage of the survey process, and include non-response for the survey as well as errors introduced before or during data collection or during data processing.
This chapter describes the various measures adopted to prevent errors from occurring wherever possible and to adjust for any errors found throughout the different stages of the CSD. Areas of caution for interpreting CSD data are noted. Readers may also refer to the Guide to the Census of Population, 2016 for related information on data quality.
7.2 Sampling errors and bootstrap method
The estimates that can be produced with this survey are based on a sample of individuals. Somewhat different estimates might have been obtained if we had conducted a complete census with the same questionnaires, interviewers, supervisors, processing methods and so on, as those actually used. The difference between an estimate derived from the sample and an estimate based on a comprehensive enumeration under similar conditions is known as the estimate’s “sampling error”.
To produce estimates of the sampling error for statistics produced from the CSD, we used a particular type of bootstrap method. Several bootstrap methods exist in the literature, but none was appropriate for the CSD’s complex sample design. The following characteristics of the sample design make it difficult to estimate the sampling errors:
- A two-phase design in which households (or dwellings) are selected in the first phase and individuals in the second phase. In the first phase, a random sample of approximately one in four households, stratified by collection unit (CU), was selected to respond to the long-form census. In the second phase, a sample of some 50,000 individuals having reported a difficulty in activities of daily living on the census was selected for the CSD.
- The sampling fraction of the first-phase sample (census long-form) is non-negligible (about 1/4 in the non-remote regions), and the sampling fraction of the CSD is rather high in some strata.
- The CSD strata (combinations of province/territory, age group, remote or non-remote region, mild, moderate or high severity level) are non-nested within the census strata (CUs or groups of CUs).
- The method used has to be flexible enough to produce standard statistics such as proportions, totals, averages and ratios, as well as more sophisticated statistics, including percentiles and logistic regression coefficients.
In 2006, a general bootstrap method for two-phase sampling was developed and applied to the Aboriginal Peoples Survey (APS) (Langlet, Beaumont and Lavallée, 2008).Note The underlying idea of the general bootstrap method is that the initial bootstrap weights can be seen as the product of the initial sampling weights and a random adjustment factor. In the case of a two-phase sample, the variance can be split into two components, each associated with one sampling phase. The two-phase general bootstrap method generates a random adjustment factor for each phase of sampling. In this case, the initial bootstrap weight of a given unit is the product of its initial sampling weight and the two random adjustment factors. Once initial bootstrap weights have been calculated, all weight adjustments applied to the initial sampling weights were applied to the initial bootstrap weights to obtain the final bootstrap weights, which capture the variance associated not only with the particular sample design but also with all weight adjustments applied to the full sample to derive the final weights.
For the 2012 CSD, we were able to adapt the method developed for the 2006 APS to reflect the complexities of the design of the National Household Survey (NHS), which replaced the long-form census questionnaire. In terms of calculating variance, the 2011 NHS sample design was considered a three-phase plan: the first phase involved the initial selection of approximately one in three households; the second involved the selection of a household sample among all non-respondent households for non-response follow-up; and the third involved a sample of respondents following non-response follow-up. In order to use the generalized two-phase method, the three NHS phases were combined into a single phase, while the CSD sample made up the second phase.
Given the return of the long-form census questionnaire in 2016, a slightly modified version of the 2012 method was used for the 2017 CSD. In terms of calculating variance, the 2016 Census sample design is considered a two-phase plan: the first phase involves the initial selection of approximately one in four households, while the second is the census respondent sample. Although the 2016 Census had a very high response rate (97.8% for the long-form), the second phase of the census accounted for non-response in calculating variance. Therefore, in order to use the generalized two-phase method, the two census phases were combined into a single phase, while the 2017 CSD sample made up the second phase.
There is a major advantage in having two sets of random adjustment factors. The first set of adjustment factors can be used for estimates based on the first phase only, i.e., estimates based on the census long-form sample. These estimates are used when the weights are adjusted to the census totals during post-stratification (Section 6.1). This produces variable census totals for each bootstrap sample, which reflects the fact that the census totals used are based on a sample and not on known fixed totals.
For the CSD, 1,000 sets of bootstrap weights were generated using the general bootstrap method. The method used is slightly biased in that it slightly overestimates the variance. The extent of the overestimation is considered negligible for the CSD. The method can also produce negative bootstrap weights. To overcome this problem, the bootstrap weights were transformed to reduce their variability. Consequently, the variance calculated with these transformed bootstrap weights has to be multiplied by a factor which is a function of a certain parameter, known as phi. The parameter’s value is chosen as the smallest integer that makes all bootstrap weights positive. For the CSD, this factor is 4. The variances calculated from the transformed bootstrap weights must therefore be multiplied by 42 = 16. Similarly, the coefficients of variation (square root of the variance divided by the estimate itself) must be multiplied by 4. However, most software applications that produce sampling error estimates from bootstrap weights have an option to specify this adjustment factor, so that the correct variance estimate is obtained without the extra step of multiplying by the constant.
Start of text boxIt is extremely important to use the appropriate multiplicative factor for any estimate of sampling error such as variance, standard error or CV. Omission of this multiplicative factor will lead to erroneous results and conclusions. This factor is often specified as the “Fay adjustment factor” in software applications that produce sampling error estimates from bootstrap weights.
For examples of procedures using the Fay adjustment factor, see the 2017 Canadian Survey on Disability User Guide Guide to the Analytical Data Files.
The measure of sampling error used for the CSD is the coefficient of variation (CV) of the estimate, which is the standard error of the estimate divided by the estimate itself. For this survey, when the CV of an estimate is greater than 16.5% but less than or equal to 33.3%, the estimate is accompanied by the letter “E” to indicate that the data should be used with caution. When the CV of an estimate is greater than 33.3%, or if an estimate is based on 10 units or less, the cell estimate is replaced by the letter “F” to indicate that the entry is suppressed for reliability reasons.
7.3 Non-sampling errors
Besides sampling errors, non-sampling errors can occur at almost every step of a survey. Respondents may misunderstand the questions and answer them inaccurately, responses may be inadvertently entered incorrectly during data capture and errors may be introduced in the processing of data. These are all examples of non-sampling errors.
Over a large number of observations, randomly occurring errors will have little effect on estimates drawn from the survey. However, errors occurring systematically may contribute to biases in the survey estimates. Thus, much time and effort was devoted to reducing non-sampling errors in the survey. At the content development stage, extensive activities were undertaken to develop questions and response categories that would be well understood by respondents. The new questionnaire was tested thoroughly during several rounds of qualitative testing. In addition, many initiatives were taken in the field to encourage participation and reduce the number of non-response cases. Also important were the numerous quality assurance measures applied at the data collection, coding and processing stages to verify and correct errors in the data. Weighting adjustments were made by taking into account the different characteristics of non-respondents compared to respondents and thus minimizing any potential bias that may have been introduced.
The following paragraphs discuss the different types of non-sampling errors and the various measures used to minimize and correct these errors in the CSD.
Coverage errors
Coverage errors occur when the sampled population excludes people intended to be in the target population. Because the CSD is an extension of the 2016 long-form Census, it inherits the coverage problems of that survey, which in turn inherits the coverage problems of the 2016 Census. For more information about coverage errors on the census, please see the 2016 census Coverage Technical Report, to be released on the Statistics Canada’s website in 2019. For more information about the quality of census data, please consult Chapter 10 of the Guide to the Census of Population, 2016.
Non-response errors
Non-response errors result from not being able to collect complete information on all units in the selected sample. Non-response produces errors in the survey estimates in two ways. First, non-respondents often have different characteristics from respondents, which can result in biased survey estimates if non-response is not corrected properly. In this case, the larger the non-response rate, the larger the bias may be. Secondly, if non-response is higher than expected, it reduces the effective size of the sample. As a result, the precision of the estimates decreases (the sampling error on the estimates will increase). This second aspect can be overcome by selecting a larger sample size initially. However, this will not reduce the potential bias in the estimates.
The scope of non-response varies. One level of non-response is item non-response, where the respondent does not respond to one or more questions, but has completed a significant pre-defined portion of the overall questionnaire. Generally, the extent of partial non-response was small in the CSD as a result of extensive qualitative reviews and testing of questionnaire items. There is also total non-response when the person selected to participate in the survey could not be contacted or did not participate once contacted. Weights of respondents were increased in order to compensate for those who did not respond, as described in Section 6.1.
To reduce the number of non-response cases, many initiatives were also undertaken prior to and during data collection (as mentioned in Chapter 4). The Statistics Canada website included a CSD web page which provided a series of questions and answers for respondents, as well as general information about the survey. At the outset of collection, each selected respondent received an introductory letter providing an overview of the survey and explaining the importance of participating. This was accompanied by a coloured infographic showing results of the last disability survey as well as a small leaflet in Braille. During data collection, tweets and messages containing graphics and information were regularly posted on Statistics Canada’s Twitter account and Facebook page to promote the CSD.
In addition, in-depth interviewer training was conducted by experienced Statistics Canada staff. In conjunction with the training, detailed interviewer manuals were provided as a reference. All interviewers were under the direction of senior interviewers, who oversaw activities in the regional offices. Rigorous efforts to reach non-respondents through call-backs and follow-ups were also made by interviewers. Whenever possible, more than one phone number was provided for each selected respondent to maximize the chance of reaching the person during the collection period. These phone numbers were obtained through a record linkage with Statistics Canada’s most recent residential telephone file.Note
During the collection period, several reminder letters were sent to respondents assigned to the online collection mode encouraging them to respond. Additionally, emails containing a link to the questionnaire and a personal secure access code were sent to respondents who preferred to respond online rather than on the phone and who gave the interviewer an email address when contacted. A table of final response rates obtained for the 2017 CSD is provided in Section 4.8 of this guide. The overall response rate for the survey was 69.5%. Response rates were highest in the older age groups, who were easier to reach by telephone. Approximately 40% of responses were obtained through self-reporting, compared with 60% through telephone interview.
Measurement errors
Measurement errors occur when the response provided differs from the real value. Such errors may be attributable to the respondent, the interviewer, the questionnaire, the collection method or the data processing system. Extensive efforts were made for the 2017 CSD to develop questions which would be understood, relevant and sensitive to respondents’ needs.
Several rounds of qualitative testing were done for the CSD, in particular to test the new electronic questionnaire format and certain questions that were modified from 2012. Qualitative testing was carried out by Statistics Canada's Questionnaire Design Resource Centre (QDRC). To minimize measurement error, adjustments were made to question wording, categories of response, help text and question flows.
Many other measures were also taken to specifically reduce measurement error, including the use of skilled interviewers, extensive training of interviewers with respect to the survey procedures and content, and observation and monitoring of interviewers to detect problems due to questionnaire design or misunderstanding of instructions.
Processing errors
Processing errors may occur at various stages, including programming of the electronic questionnaire, data capture by the interviewer or the respondent, coding and data editing. Quality control procedures were applied to every stage of data processing to minimize this type of error. The CSD was conducted through an electronic questionnaire, either interviewer-led or via online self-reporting. A number of edits were built into the system to warn the respondent or the interviewer in the event of inconsistencies or unusual values, making it possible to correct them immediately (see Section 5.7).
At the data processing stage, a detailed set of procedures and edit rules were used to identify and correct any inconsistencies between the responses provided. For every step of data cleaning, a set of thorough, systematized procedures were developed to assess the quality of every variable on file and correct every error found. A snapshot of the output files was taken at each step and verification was made comparing files at the current and previous step. The programming of all edit rules were tested before being applied to the data. Examples of data processing verification included: 1) the review of all question flows, including very complex sequences, to ensure skip values were accurately assigned and distinguished from different types of missing values; 2) an in-depth qualitative review of open-ended and ‘other-specify’ responses for accurate and rigorous coding; 3) experienced supervision of coding to standardized classifications; and 4) review of all derived variables against their component variables to ensure accurate programming of derivation logic, including very complex derivations. For additional information on data processing, please consult Chapter 5 of this guide.
- Date modified: