Canadian Survey on Disability, 2017: Concepts and Methods Guide
5. Data processing

Skip to text

Text begins

5.1 Pre-processing: Data capture

All responses to the 2017 Canadian Survey on Disability (CSD) questions were captured directly in the electronic questionnaire (EQ) application, both for the interviewer-led (iEQ) component and the respondent self-reporting (rEQ) component. Additional case management information for the iEQ was captured through a Blaise software system in Statistics Canada’s regional offices prior to the transmission of data to head office. Data from the rEQ were transmitted directly to head office. These electronic systems create many efficiencies in both time and costs associated with data capture and transmission. All survey responses were kept highly secure through industry-standard encryption protocols, firewalls and encryption layers.

For some CSD questions, data underwent a preliminary verification process when respondents were completing the survey. This was accomplished by means of a series of edits programmed into the EQ. That is, where a particular response appeared to be inconsistent with previous answers or outside of expected values, the interviewer or the self-reporting respondent was notified with an on-screen warning message, providing them with an opportunity to modify the response provided. Approximately 90% of respondents who received these triggered messages made a correction to their answers. The response data were subjected to more in-depth processing once they were transmitted to head office, as described in the sections below.

5.2 Survey processing steps

Once survey responses were transmitted to head office, more extensive data processing for the CSD began. This involved a series of steps to convert the questionnaire responses from their initial raw format to a high-quality, user-friendly database involving a comprehensive set of variables for analysis. A series of data operations were executed to clean files of inadvertent errors, edit the data for consistency, code open-ended questions, create useful variables for data analysis, and finally to systematize and document the variables for ease of analytical usage. 

The CSD uses a set of social survey processing tools developed at Statistics Canada called the “Social Survey Processing Environment” (SSPE). The SSPE involves statistical software programs (SAS-based), custom applications and manual processes for performing the following systematic processing steps:

Each step of processing from the initial clean-up to the construction of derived variables are described in more detail in the sections of this chapter below. Chapter 6 provides the details related to final database creation.

5.3 Record clean up: In-scope and complete records

Following the receipt of raw data from the electronic questionnaire applications, a number of preliminary cleaning procedures were implemented for the 2017 CSD at the individual records level. These included the removal of all personal identifier information from the files, such as names and addresses, as part of a rigorous set of ongoing mechanisms for protecting the confidentiality of respondents. In addition, we made sure to save only one copy of any duplicates (i.e., two entries for a single respondent) found at this stage. Each pair was examined individually and the general rule was to keep the first entry received. The only exceptions to this rule were when the first record obviously contained errors.

Also part of clean-up procedures was the review of all respondent records to ensure each respondent was “in-scope” and had a sufficiently completed questionnaire. Specific criteria for respondents are outlined below.

  1. To be “in scope” for the 2017 CSD, respondents must be at least 15 years of age on Census Day, May 10, 2016, and reside in a private household in Canada at the time of the survey. Specific questions in the entry module were used to confirm these criteria before beginning the interview. In-scope respondents include two groups: 1) those who were screened in upon completing the Disability Screening Questions (DSQ) and were therefore part of the disability population and 2) those who were screened out by the DSQ and were thus considered non-disabled. Both groups remain in the final survey database.
  2. To have a “complete” questionnaire, respondents who met the criteria of the population of persons with a disability must have provided an answer to the last question in the Labour Force Discrimination (LFD) module.Note This ensures that we get responses to a number of essential questions: those required to produce data tables for persons with disabilities as required by the 1995 Employment Equity Act. See Appendix E for more information.
  3. To have a “complete” questionnaire, respondents who were assigned to the population of persons without a disability must have provided an answer to the last question in the DSQ. Respondents without a disability were not required to complete the rest of the CSD questionnaire since it did not apply to them.

During data collection, information was exchanged several times between headquarters and the regional office interviewers. Since this information could not always be saved adequately in the collection system, a log was created and maintained at headquarters for all useful information pertaining to the respondents. For example, a respondent might have contacted the regional office after the interview to provide additional information or to make a correction to the information they initially provided. Since the case had been finalized, the interviewer no longer had access to update it or add notes, so the log was used to record this information. All the notes entered in the application by both interviewers and respondents were reviewed at headquarters and all relevant information was entered in the log.

To determine the final status of a questionnaire (complete or incomplete) for a sample unit, the additional information saved in the log was considered and given precedence over contradictory information automatically generated by the system.

Once the final status of each respondent was determined, cases considered out of scope or incomplete were removed from the database. The weights of respondents with complete questionnaires were adjusted upward to compensate for these losses (see section 6.1 for more information on weighting).

5.4 Recodes: Variable changes and multiple-response questions

This stage of processing involved changes at the level of individual variables. Variables could be dropped, recoded, re-sized or left as is. Formatting changes were intended to facilitate processing as well as analysis of the data by end-users. One such change at the variable level was the conversion of multiple-response questions (“Select-all-that-apply” questions) to corresponding sets of single-response variables which are easier to use. For each response category associated with the original question, a new variableNote was created with “yes/no” response values. An example is provided below. This process is called “destringing” the variables. 

Start of text box

Original multiple-response question

IU_Q15  During the past 12 months, did you use the Internet from :

Select all that apply.

  1. home
  2. personal smart phone, tablet or other wireless handheld device
  3. another person's home
  4. work
  5. school or training institute
  6. some other location
    e.g., public Wi-Fi, library, community center, etc.

End of text box

Start of text box

Final variables in single-response “yes/no” format

IU_15A During the past 12 months, did you use the Internet from - home

  1. Yes
  2. No

IU_15B During the past 12 months, did you use the Internet from - personal smart phone, tablet or other wireless handheld device

  1. Yes
  2. No

IU_15C During the past 12 months, did you use the Internet from - another person's home

  1. Yes
  2. No

... additional “yes/no” questions for each response category, including IU_15D for – work, IU_15E for - school or training institute, and for the last category:

IU_15F During the past 12 months, did you use the Internet from - some other location, e.g., public Wi-Fi, library, community center, etc.

  1. Yes
  2. No

End of text box

5.5 Flow edits: Response paths, valid skips and question non-response

Another set of data processing procedures applied to the 2017 CSD was the verification of questionnaire flows or skip patterns. All response paths and skip patterns that were built into the questionnaire were verified to ensure that the universe for each question was accurately captured during processing.

Different category types for question response and non-response are explained below in order to assist users to better understand question universes as well as statistical outputs for CSD survey variables.

Question response and non-response categories

The electronic questionnaire items were identical for both interviewers (iEQ) and self-completing survey respondents (rEQ). Respondents or interviewers were generally invited to select a response from among a set of answer categories provided on the screen. In some instances, survey questions were open-ended, requiring a write-in response. An optional response category of “Don’t know” was provided in a limited number of questions. In some situations, a respondent may have skipped past the question by hitting the Next button without having provided a response. For certain critical survey questions, a missed question would elicit an automated reminder to the respondent to complete the missed question. However, respondents always had the option to skip over a question.

Special numeric codes have been designated for each type of non-responseNote in order to facilitate user recognition and data analysis.

Response

Valid skip

Don’t know

Not stated

Non-response for derived variables (DVs)

The construction of derived variables (DVs) for the CSD database often involved combining or regrouping answers of more than one survey question. Among the component variables of a DV, it is possible that some may have had valid answers, while others may have had non-response values. Where components for a given DV included any non-response code of Don’t Know or Not Stated, DVs were coded to reflect the best possible understanding of the combination of responses involved.

Non-response for external census linked variables

In the case of external census variables linked to the CSD, it should be noted that these variables do not generally contain any missing data such as Don’t Know, Refusal or Not Stated responses, since census processing operations for most variables involved imputation of all missing responses before they were linked to the CSD. The only exception to this involves variables related to the Activities of Daily Living question on the census, where data were not imputed as these variables were intended only to provide a sampling frame for the post-censal CSD. As noted below, any missing values for these variables are coded to “Not stated”.

However, there were other categories of non-response for census variables as described below:

Not applicable
Suppressed
Not stated

More information on derived variables and census variables is provided in sections 5.9 and 5.10 below.

5.6 Coding

The next step of data processing involved the review and classification of write-in responses to questionnaire items, wherever applicable—a process called coding. Two types of questions required the application of coding procedures: “Other-specify” items and questions that were completely open-ended. These are described in more detail below.

“Other-specify” items

For most questions on the CSD questionnaire, a list of answer category options were presented to respondents for their consideration. These often included on-screen help text with explanations and examples to assist with respondent selection of the most appropriate category for their situation. However, in the event that a respondent’s answer could not be easily assigned to an existing category, many questions also allowed respondents or interviewers to enter a long-answer text response in the “Other-specify” category.

All questions with “Other-specify” categories were examined and coded during processing. A total of 30 questions were coded for ‘other-specify’ responses. Twenty-five of these involved multiple response questions (“mark all that apply”) and five involved single response questions. Based on coding guidelines prepared by subject-matter specialists, many of the long answers provided by respondents for these questions were recoded back into one of the existing answer categories. Responses that were unique and qualitatively different from existing categories were kept as “Other”. Where counts warranted, new categories were created to capture emerging themes in the data that were not reflected in existing categories. Appendix F presents the extra categories added for the 2017 CSD, all of which will be noted in the data dictionaries. These new categories will also be taken into account when refining the answer categories for future cycles of the survey.

Open-ended questions and standard classifications

An additional 27 questions on the 2017 CSD questionnaire were recorded in a completely open-ended format. These included questions related to the following:

  1. The respondent’s main medical condition which causes them the most difficulty or limits their activities the most (up to two conditions may be reported);
  2. Occupation and industry of work;
  3. Main reason for self-employment;
  4. Barriers in finding work;
  5. Major field of post-secondary study;
  6. Aspects of accessing government services that would be difficult because of their condition

For most of these questions, responses were coded using a combination of automated and interactive (manual) coding procedures. Where applicable, standardized classification systems were used. Coding for standardized classifications involved a team of experienced coders and quality control supervisors. Subject matter experts in data processing applied additional verification procedures. For other variables, the data were reviewed by subject matter specialists using systematic qualitative methods for identifying and coding relevant themes. See Appendix F for more detail on the classification systems used as well as the emergent qualitative codes generated for open-ended questions.

5.7 Consistency edits

A number of edits and imputations are required to ensure that survey data are consistent and complete. We examined the CSD data to check for inconsistencies between some of the survey variables. The data had already gone through various edits built into the electronic questionnaire. For instance, we programmed edits to compare the age provided for certain questions with the respondent’s age at the time of the interview and to alert the respondent if a discrepancy was found. Other edits were used to avoid receiving impossible values for the number of hours usually worked in a week (which necessarily had to be less than 168). Warning messages were also programmed to minimize the risk that certain key questions used to determine labour force status would inadvertently be left unanswered.

In addition to these integrated edits, we performed edits and imputations on the data we received to ensure the consistency of a number of important survey variables since identified inconsistencies may not have been corrected during the interview. The edits and imputations applied to the 2017 CSD are described below.

Age of onset of condition and age of limitation of daily activities

The first set of edits pertained to new questions in the DSQ module related to each disability type, which asked about both the age of onset of a difficulty or conditionNote and the age when that difficulty or condition began to limit their daily activities.Note A consistency edit ensured that the age of limitation was not younger than the age at which the difficulty began. The only exception was for developmental disabilities, where the survey asked for the age when the respondent was diagnosed with the problem,Note as opposed to the age of onset of the condition itself. In this case, the diagnosis could have been made at an older age than the age when their limitation in daily living beganNote and so no edit was performed. For all other disability types, where there was an inconsistency, the age of difficulty was changed to the age of limitation. This involved approximately 3,000 (9%) edited cases.

Once these edits were completed, a special program was run to verify the consistency between these two age variables and the respondent’s current age as reported at the beginning of the CSD questionnaire. For example, the age of onset or the age at which the person became limited could not be greater than the respondent’s current age. Fewer than 10 cases showed an inconsistency. In some cases, the age of onset or the age of limitation was corrected and in the other cases, the current age of the respondent was changed. It was often clear that these were typographical errors.

Reference age versus age at interview

The CSD analytical file contains two age variables: the reference age (REF_AGE), which is the respondent’s age on May 10, 2016 (Census Day), and age at the time of the interview (INT_AGE), which was derived on the day of the CSD interview (between March 1st and August 30, 2017). Both these age variables were derived so as to be consistent. Therefore, REF_AGE can be equal to or less than INT_AGE by no more than two years. In general, INT_AGE was always considered valid as it was self-reported by the respondent during the interview. However, in cases where respondents did not report their age during the interview, the application defaulted to the applicant’s age as of March 1st, 2017, based on the date of birth available on the survey frame.

When we examined the data post-collection, we compared the age reported on the day of the interview (INT_AGE) and the age on May 10, 2016 (REF_AGE) based on the survey frame data. If the gap between REF_AGE and INT_AGE was greater than two years, the date of the CSD interview and the birth date and month reported in the census were used to deduce the birth year, which was then used to recalculate the respondent’s age as of May 10, 2016 (REF_AGE).  

Age in the Census Dissemination Database

The CSD analytical file includes many variables from the 2016 Census long-form questionnaire. When we matched CSD respondents to the Census Dissemination Database, we found that the age of seven CSD respondents in the Census Dissemination Database had been imputedNote to a value of less than 15 years. This created a problem since several census values for these seven respondents were coded as “not applicable because of age,” even though an interview had been conducted with them and we knew that they were at least 15 years or older. To prevent this inconsistency in the analytical file, we decided to find donors for these seven respondents among the CSD respondents based on province, age group, and disability severity and type, where possible. We then replaced the values of all census variables for these seven respondents with those of their donors. The CSD variables were kept as is.

Aids and assistive devices

An inconsistency was found for some respondents who reported that they needed an assistive aid or device that they did not presently haveNote . When asked in a follow-up question which particular aid or device did they need but did not have, they responded “none”Note .  For such cases, the response to the initial question was changed to a “No”, indicating there was no aid or assistive device that they needed but did not have. This involved approximately 450 (1%) edited cases.

Job tenure and Past job attachment

An inconsistency was found for a small number of respondents who indicated that they started their current job or business in 2017 but the month when they started came later than the month in which survey completion took placeNote . In these few cases, it was assumed that the respondent started their current job or business in the month reported, but in 2016, not 2017; corresponding values were thus changed to 2016. Similarly, in the module on past job attachment, if a respondent indicated that they last worked at a job or business in 2017, but the month when they last worked came later than the month in which the survey was completedNote , it was assumed that the respondent started their current job or business in the month reported, but in 2016, not 2017; the corresponding variable was thus changed to 2016. Approximately 15 cases were edited in relation to these three sets of variables.

External edits

There are several indicators for labour market activity and level of education in both the CSD and the census whose concepts overlap significantly. Therefore, there may be some inconsistency in the data from these two sources. When inconsistencies were found in the 2012 CSD, the census data were suppressed. There was however a  drawback to doing so, as analysts comparing data for people with and without a disability would be faced with suppression rates for people with a disability that were much higher than for people without a disability, very few of the latter having responded to the survey. To prevent this from occurring in the 2017 CSD, it was decided that inconsistent data would not be suppressed. These inconsistencies can be explained in a number of ways: imputation of the census data, response error caused by proxy responses to the census or survey, response error caused by memory problems or misunderstanding of the question, data being switched between respondents in the same household in the census, etc. Consequently, we suggest that analysts not try to analyze the changes between census data and CSD data. Census data remain an important source of information to analyze the characteristics of people with and without a disability and should be used mainly for this purpose.

5.8 Variable conversion

At this stage, final variable names are established on the file. For example, the letter Q which appears in all question acronyms is removed from final variable names. All final variable names must respect an 8-character limit.

5.9 Derived variables

In order to facilitate more in-depth analysis of the rich CSD dataset, over 130 derived variables (DVs) were created by regrouping or combining answers from one or more questions on the questionnaire. All DV names have a “D” in the first character position of the name for quick identification. The 2017 CSD Data Dictionaries identify all DVs. DVs are also listed by theme in Appendix D along with other survey indicators.

5.10 External census-linked variables

In addition to the CSD variables, approximately 300 census variables were added to the final CSD processing file for 2017 through record linkage. At the outset of the 2017 CSD interview, all respondents were told about the plans to link the CSD survey data with the information that they provided on the census. All linked information is kept confidential and used for statistical purposes only.

For all census variables, the census variable name was preserved as much as possible on the CSD database. Some exceptions applied since CSD variable names are restricted to eight characters whereas census variable names sometimes exceeded eight characters in length. The 2017 CSD Data Dictionaries provide a complete listing of census variables. Highlights of the census variables are provided in section 2.6 above.

The final structure and content of the data files are described in Chapter 6.


Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

Privacy notice

Date modified: