Canadian Survey on Disability, 2022: Concepts and Methods Guide
5. Processing
Archived Content
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
Skip to text
Text begins
5.1 Pre-processing: data capture
All responses to the 2022 Canadian Survey on Disability (CSD) questions were captured directly in the electronic questionnaire (EQ) application, both for the interviewer-led (iEQ) component and the respondent self-reporting (rEQ) component. Additional case management information for the iEQ was captured and transmitted to head office. Data from the rEQ were transmitted directly to head office. Paradata was also collected in the form of audit trail files which helped inform on things like survey length, time spent on difficult question, etc. These electronic systems create many efficiencies in both time and costs associated with data capture and transmission. All survey responses were kept highly secure through industry-standard encryption protocols, firewalls and encryption layers.
For some CSD questions, data underwent a preliminary verification process when respondents were completing the survey. This was accomplished by means of a series of soft edits programmed into the EQ. That is, where a particular response appeared to be inconsistent with previous answers or outside of expected values, the interviewer or the self-reporting respondent was notified with an on-screen warning message, providing them with an opportunity to modify the response provided. The response data were subjected to more in-depth processing once they were transmitted to head office, as described in the sections below.
5.2 Survey processing steps
Once survey responses were transmitted to head office, more extensive data processing for the CSD began. This involved a series of steps to convert the questionnaire responses from their initial raw format to a high-quality, user-friendly database involving a comprehensive set of variables for analysis. A series of data operations were executed to clean files of inadvertent errors, edit the data for consistency, code open-ended questions, create useful variables for data analysis, and finally to systematize and document the variables for ease of analytical usage.
The CSD uses a set of social survey processing tools developed at Statistics Canada called the “Social Survey Processing Environment” (SSPE). The SSPE involves statistical software programs (SAS-based), custom applications and manual processes for performing the following systematic processing steps:
- Receipt of raw data
- Clean up
- Recodes
- Flow edits
- Coding
- Consistency edits
- Variable conversion
- Derived variables
- Creation of final processing file
- Creation of dissemination files
Each step of processing from the initial clean-up to the construction of derived variables are described in more detail in the sections of this chapter below. Chapter 6 provides the details related to final database creation.
5.3 Record clean up: in-scope and complete records
Following the receipt of raw data from the electronic questionnaire applications, a number of preliminary cleaning procedures were implemented for the 2022 CSD at the individual records level. These included the removal of all personal identifier information from the files, such as names and addresses, as part of a rigorous set of ongoing mechanisms for protecting the confidentiality of respondents. In addition, we made sure to save only one copy of any duplicates (i.e., two entries for a single respondent) found at this stage. Each pair was examined individually in order to ascertain the best record to keep. The only exceptions to this rule were when the first record obviously contained errors.
Also part of clean-up procedures was the review of all respondent records to ensure each respondent was “in-scope” and had a sufficiently completed questionnaire. Specific criteria for respondents are outlined below.
- To be “in scope” for the 2022 CSD, respondents must be at least 15 years of age on Census Day, May 11, 2021, and reside in a private household in Canada at the time of the survey. Specific questions in the entry module were used to confirm these criteria before beginning the interview. In-scope respondents include two groups: 1) those who were screened in upon completing the Disability Screening Questions (DSQ) and were therefore part of the disability population and 2) those who were screened out by the DSQ and were thus considered non-disabled. Both groups remain in the final survey database.
- To have a “complete” questionnaire, respondents who met the criteria of the population of persons with a disability must have provided an answer to the last question of the Labour Force Discrimination (LFD) module.Note This ensures that we get responses to a number of essential questions: those required to produce data tables for persons with disabilities as required by the 1995 Employment Equity Act. See Appendix D for more information.
- To have a “complete” questionnaire, respondents who were assigned to the population of persons without a disability must have provided an answer to the last question in the DSQ. Respondents without a disability were not required to complete the rest of the CSD questionnaire.
During data collection, information was exchanged several times between headquarters and the regional office interviewers. Since this information could not always be adequately saved in the collection system, specific tickets were opened where information about respondents was sent for RO’s to coordinate input. This took place in a system called CTOC.
Once the final status of each respondent was determined, cases considered out of scope or incomplete were removed from the database. The weights of respondents with complete questionnaires were adjusted upward to compensate for these losses (see section 6.1 for more information on weighting).
At this stage, a few edits are also applied to ensure validity of data. For example:
- Year and age variables with minimum and maximum values (e.g., In which year did you start working for your current employer was required to be between 1950 and 2022).
5.4 Recodes: variable changes and multiple-response questions
This stage of processing involved changes at the level of individual variables. Variables could be dropped, recoded, re-sized or left as is. Formatting changes were intended to facilitate processing as well as analysis of the data by end-users. One such change at the variable level was the conversion of multiple-response questions (“Select-all-that-apply” questions) to corresponding sets of single-response variables which are easier to use. For each response category associated with the original question, a new variableNote was created with “yes/no” response values. An example is provided below. This process is called “destringing” the variables.
Start of text box
Original multiple-response question:
AADH_Q20 Why do you not have a hearing aid?
ON-SCREEN HELP: Select all that apply.
- Cost
- Do not want to or not willing to upgrade from current aid or assistive device
- Not available
- Available aids cannot be adapted
- Other reasons
End of text box
Start of text box
Final variables in single-response “yes/no” format:
AADH_20A Why do you not have a hearing aid?
- Cost
- Yes
- No
AADH_20B Why do you not have a hearing aid?
- Do not want to or not willing to upgrade from current aid or assistive device
- Yes
- No
AADH_20C Why do you not have a hearing aid?
- Not available
- Yes
- No
AADH_20D Why do you not have a hearing aid?
- Available aids cannot be adapted
- Yes
- No
AADH_20E Why do you not have a hearing aid?
- Other reasons
- Yes
- No
End of text box
5.5 Flow edits: response paths, valid skips and question non-response
Another set of data processing procedures applied to the 2022 CSD was the verification of questionnaire flows or skip patterns. All response paths and skip patterns that were built into the questionnaire were verified to ensure that the universe (or coverage) for each question was accurately captured during processing.
Different category types for question response and non-response are explained below in order to assist users to better understand question universes as well as statistical outputs for CSD survey variables.
Question response and non-response categories
The electronic questionnaire items were identical for both interviewers (iEQ) and self-completing survey respondents (rEQ). Respondents or interviewers were generally invited to select a response from among a set of answer categories provided on the screen. In some instances, survey questions were open-ended, requiring a write-in response. An optional response category of “Don’t know” was provided in a limited number of questions. In some situations, a respondent may have skipped past the question by hitting the Next button without having provided a response. For certain critical survey questions, a missed question would elicit an automated reminder to the respondent to complete the missed question. However, respondents always had the option to skip over a question.
Special numeric codes have been designated for each type of non-responseNote in order to facilitate user recognition and data analysis.
Response
- An answer directly relevant to the content of the question that is captured by a list of pre-existing answer categories or that can be categorized through coding, as is the case with ‘other-specify’ items and open-ended questions.
Valid skip
- Indicates that the question was not asked of the respondent, based on their response to a previous question. Where there is a valid skip, the respondent is not considered to be part of the universe for that question.
- Code is set to “6” as the last digit, with any preceding digits set to “9”, such as 6, 96, or 996 (etc.), based on the length of the variable.
Don’t know
- In an EQ survey, it is not always possible to identify situations where a respondent doesn’t know the answer. This is because the respondent always has the choice to skip past a question by pressing the Next button, without specifying the reason why. These missed items are normally coded as Not Stated (see category below). However, for some CSD questions, it was important to distinguish whether respondents truly did not know the answer and so a Don’t Know category was included in the list of available responses.
- Code is set to “7” as the last digit, with any preceding digits set to “9”, such as 7, 97, or 997 (etc.), based on the length of the variable.
Not stated
- Indicates that the question was asked of the respondent but not answered, such as when a respondent skips a question by hitting the Next button without having provided an answer.
- Code is set to “9” as the last digit, with any preceding digits set to “9”, such as 9, 99, or 999 (etc.), based on the length of the variable.
Non-response for derived variables (DVs)
The construction of derived variables (DVs) for the CSD database often involved combining or regrouping answers of more than one survey question. Among the component variables of a DV, it is possible that some may have had valid answers, while others may have had non-response values. Where components for a given DV included any non-response code of Don’t Know or Not Stated, DVs were coded to reflect the best possible understanding of the combination of responses involved.
Non-response for external census linked variables
In the case of external census variables linked to the CSD, it should be noted that these variables do not generally contain any missing data such as Don’t Know, Refusal or Not Stated responses, since census processing operations for most variables involved imputation of all missing responses before they were linked to the CSD. The only exception to this involves variables related to the Activities of Daily Living question on the census, where data were not imputed as these variables were intended only to provide a sampling frame for the post-censal CSD. As noted below, any missing values for these variables are coded to “Not stated”.
However, there were other categories of non-response for census variables as described below:
Not applicable
- Indicates that the question did not apply to the respondent’s situation, as determined by valid answers to a previous question. For some census variables, there may have been multiple Not Applicable categories available, each indicating that the question did not apply for a different reason. In any such cases, the respondent is not considered to be part of the universe for that question.
- New CSD codes were created to replace the census “not applicable” codes, which had negative values. The new codes were necessary in order for the CSD data to be compliant with the processing system used, which didn’t allow for negative values in categorical variables. However, the new CSD codes preserve the distinctions made by the census codes for identifying different reasons why the variable may not apply. Also, where needed, census variables were extended in length by one digit to accommodate these special codes.
- CSD code is set to “2”, “3”, “4” or “5” as the last digit (replacing census codes of -2, -3, -4, and -5), with any preceding digits set to “9”, such as 92, 992 or 9992 (etc.); 93, 993 or 9993 (etc.); 94, 994 or 9994 (etc.); or 95, 995 or 9995 (etc.), based on the length of the variable.
Suppressed
- Indicates that census data have not been linked to the CSD database based on the respondent’s request, as expressed at the time of the CSD interview.
- CSD code is set to “0” as the last digit, with any preceding digits set to “9”, such as 90, 990 or 9990 (etc.), based on the length of the variable.
Not stated
- Indicates that the census question was not answered nor imputed. Refers only to the census Activities of Daily Living question variables, where data were not imputed, as these variables were intended only to provide a sampling frame for the post-censal CSD.
- CSD code is set to 92, replacing census code of -2-Not stated.
More information on derived variables and census variables is provided in sections 5.9 and 5.10 below.
5.6 Coding
The next step of data processing involved the review and classification of write-in responses to questionnaire items, wherever applicable—a process called coding. Two types of questions required the application of coding procedures: “Other-specify” items and questions that were completely open-ended. These are described in more detail below.
“Other-specify” items
For most questions on the CSD questionnaire, a list of answer category options was presented to respondents for their consideration. These often included on-screen help text with explanations and examples to assist with respondent selection of the most appropriate category for their situation. However, in the event that a respondent’s answer could not be easily assigned to an existing category, many questions also allowed respondents or interviewers to enter a long-answer text response in the “Other-specify” category.
All questions with “Other-specify” categories were examined and coded during processing. A total of 30 questions were coded for ‘other-specify’ responses. Twenty-five of these involved multiple response questions (“mark all that apply”) and five involved single response questions. Based on coding guidelines prepared by subject-matter specialists, many of the long answers provided by respondents for these questions were recoded back into one of the existing answer categories. Responses that were unique and qualitatively different from existing categories were kept as “Other”.
Open-ended questions and standard classifications
An additional 19 questions on the 2022 CSD questionnaire were recorded in a completely open-ended format. These included questions related to the following:
- The respondent’s main two medical conditions which caused them the most difficulty or limited their activities the most (up to two conditions may be reported) (ICD);
- Occupation and industry of work (NAICS and NOC);
- Major field of post-secondary study (CIP);
For most of these questions, responses were coded using a custom in-house tool called the Coding and Correction Environment (CCE). Standardized classification systems for all 4 fields were used and included the International Classification of Diseases (ICD), the North American Industry Classification System (NAICS), the National Occupation Classification (NOC), and the Classification of Instructional Programs (CIP).
Coding for standardized classifications involved a team of experienced coders and quality control supervisors. Subject matter experts in data processing applied additional verification procedures, which were particularly scrutinous for CSD 2022, given the comparability context
- A series of planned/targeted checks were automatically applied to data returning from the CCE system. These checks returned files containing potentially erroneous data, which was then reviewed by subject matter experts. Data were then corrected as needed and loaded back into data files to be processed.
- A series of consistency checks to ensure no erroneous data existed.
- E.g., verifying that all codes were in fact in the coreset and that no data entry errors occurred
- E.g., verifying high level findings against other contextual information published by Statistics Canada (e.g., Labour Force Survey).
- A series of validation checks which involved looking at 2017 and 2022 together to ensure that differences were as expected, or if not that they were understood and validated for comparability purposes.
5.7 Consistency edits
A number of edits and imputations are required to ensure that survey data are consistent and complete. Consistency edits target inconsistencies between survey variables. At this point, data had already gone through various edits built into the electronic questionnaire. So, these edits are targeted. To give a few examples, we programmed edits to:
- Ensure that no age variables (e.g., age of onset of ones’ disability) can be larger than the respondent’s actual age
- Ensure that the age at which a limitation began cannot be less than when the disability or difficulty itself began
- Ensure that no respondents who report needing an aid or assistive device then identify ‘none’ in the follow-up question
- Ensure that for respondents who started at new jobs or businesses in 2022, the month they started cannot occur after the month that the interview took place
- Ensure that respondents who indicated using cannabis, could not then indicate ‘never’ in the follow-up question on frequency.
5.8 Variable conversion
At this stage, final variable names are established on the file. For example, the letter Q which appears in all question acronyms is removed from final variable names. All final variable names must respect an 8-character limit.
5.9 Derived variables
In order to facilitate more in-depth analysis of the rich CSD dataset, over 170 derived variables (DVs) were created by regrouping or combining answers from one or more questions on the questionnaire. This includes the creation of 39 new derived variables, specific to 2022. All DV names have a “D” in the first character position of the name for quick identification. The 2022 CSD Data Dictionaries identify all DVs.
5.10 External census-linked variables
A CSD census linkage was performed which, not only supports the legal framework and basis for the existence of the CSD, but provides tremendous benefit to the public by helping to inform disability and inclusion policies through sound analyses, increasing accountability of the Government of Canada and transparency of information for the Canadian public. The linkage between the CSD and the census allows for comparisons of the outcomes of persons with and without disabilities, specifically labour market status, income and education, often demonstrating socioeconomic gaps between persons with and without disabilities. Without the ability to compare between persons with and without disabilities, it would not be possible to fulfil the policy and programing commitments mentioned in the previous response. Furthermore, the use of the census data for CSD respondents increases the range of available information which can support policy and programing on topics not included within the CSD, such as housing or journey to work, while reducing response burden.
In addition to the CSD variables, approximately 350 census variables were added to the final CSD processing file for 2022 through record linkage. Respondents were informed of the plan to link the CSD data to administrative data sources through the addition of the generic record linkage statement, which was added to the ‘Getting started’ pages of the survey/interview. All linked information is kept confidential and used for statistical purposes only.
For all census variables, the census variable name was preserved as much as possible on the CSD database. Some exceptions applied since CSD variable names are restricted to eight characters whereas census variable names sometimes exceeded eight characters in length. Consistency with variable names used in CSD 2017 was also considered. The 2022 CSD Data Dictionaries provide a complete listing of census variables.
The final structure and content of the data files are described in Chapter 6.
5.11 Data validation and confrontation against CSD 2017
As noted, the 2022 CSD was designed to have as much comparability as possible with the previous cycle in 2017. Until the 2022 CSD, there had not been two comparable cycles of a disability survey since the Participation and Activity Limitation survey (PALS) in 1991 and 1996.
A time series will be especially important to understand the potential long-term impact of the COVID-19 pandemic on PWD including rates of disability by type, labour market activities, income, unmet needs, among others.
Additional data validation and confrontation was performed on CSD 2022 using CSD 2017 as a benchmark.
Z-scores could not be used since we do not have enough historical comparable data. Therefore, this involved performing a scan of all Statistics Canada literature and releases that talked about persons with disabilities since 2017.
It also involved comparing historical data for key variables over time to ensure that changes were well documented and were understood within their larger contexts and not tied to discrepancies in processing methods for example.
- Date modified: