Concepts and Methods Guide
5. Data processing

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Skip to text

Text begins

5.1 Data capture

Responses to survey questions were captured directly by the interviewer at the time of the interview using a computerized questionnaire. A computerized questionnaire reduces processing time and costs associated with data entry, transcription errors and it reduces data transmission.

Some editing of data was done directly at the time of the interview. Specifically, where a particular response appeared to be inconsistent with previous answers or outside of expected values, the interviewer was prompted, through message screens on the computer, to confirm answers with the respondent and, if needed, to modify the information.

5.2 Social survey processing steps

Data processing involves a series of steps to convert the electronic questionnaire responses from their initial raw format to a high-quality, user-friendly database involving a comprehensive set of variables for analysis. A series of data operations are executed to clean files of inadvertent errors, rigorously edit the data for consistency, code open-ended questions, create useful variables for data analysis, and finally to systematize and document the variables for ease of analytical usage.

The 2017 APS used a set of social survey processing tools developed at Statistics Canada called the “Social Survey Processing Environment” (SSPE). The SSPE involves SAS software programs, custom applications and manual processes for performing the following systematic steps:

Processing steps:

5.3 Receipt of raw data and record clean up

Following the receipt of raw data from the electronic questionnaire applications, a number of preliminary cleaning procedures were implemented for the 2017 APS at the level of individual records. These included the removal of all personal identifier information from the files, such as names and addresses, as part of a rigorous set of ongoing mechanisms for protecting the confidentiality of respondents. Duplicate records were resolved at this stage. Also, part of clean up procedures was the review of all respondent records to ensure each respondent was “in-scope” and had a sufficiently completed questionnaire. Note that the criteria to determine whether or not a respondent was in scope was applied before any edit or imputation was done. Specific criteria for determining who would be a final APS respondent and who would not be a final APS respondent are provided below.

5.3.1. Definition of a respondent

  1. To be “in scope”, respondents must have been at least 15 years of age as of January 15, 2017 and met a minimum of one Aboriginal identity criterion (see section 2.2 for complete criteria).
  2. To have a “complete” questionnaire, respondents aged 15 and over must have provided valid responses (i.e. not “Don’t know” or “Refused”) to specified key questions in the area of labour and either the areas of health or education.

Those that did not meet the above criteria were removed from the database. As per the rules above, all “partial” respondents, who were in-scope according to part 1 of the definition but who did not fulfill the content-completion requirements in part 2, were among those removed from the final database. Please refer to section 6.4 of this document for more information on partial respondents.

5.4 Variable recodes and multiple response questions

This stage of processing involved changes at the level of individual variables. Variables could be dropped, recoded, re-sized or left as is. Formatting changes were intended to facilitate processing as well as analysis of the data by end-users. One such change was the conversion of multiple-response questions (“Mark-all-that-apply” questions) to corresponding sets of single-response variables which are easier to use. For each response category associated with the original question, a variable was created with YES/NO response values. An example is provided below.

Original multiple-response question:

LW_Q05 - How did you go about looking for work?

  1. Contacted potential employer(s) directly
  2. Searched the Internet
  3. Through friend(s)/relative(s)
  4. Placed or answered newspaper ad(s)
  5. Contacted public employment agency (Service Canada Centre/Canada Employment Centre, provincial employment centre)
  6. Community bulletin boards/radio
  7. Contacted Aboriginal organization or Aboriginal employment agency
  8. Through co-worker(s)
  9. Was referred by another employer
  10. Was referred by a union
  11. Other - Specify

DK, RF

Final variables in single-response YES/NO format:

LW_Q05 A - How did you go about looking for work?

- Contacted potential employer(s) directly

  1. Yes
  2. No

DK, RF

LW_Q05 B - How did you go about looking for work?

- Searched the Internet

  1. Yes
  2. No

DK, RF

LW_Q05 C - How did you go about looking for work?

- Through friend(s)/relative(s)

  1. Yes
  2. No

DK, RF

LW_Q05K - How did you go about looking for work?

- Other - Specify

  1. Yes
  2. No

DK, RF

5.5 Flows: response paths, valid skips and question non-response

Another set of data processing procedures for the 2017 APS was the verification of questionnaire flows or skip patterns. All response paths and skip patterns built into the questionnaire were verified to ensure that the universe or “target population” for each question was accurately captured during processing. Special attention was paid to distinctions between valid skips and non-response. These concepts are explained below in order to assist users to better understand question universes as well as statistical outputs for APS survey variables.

Response

An answer directly relevant to the content of the question that can be categorized into pre-existing answer categories, including “Other-specify”.

Valid skip

Indicates that the question was skipped because it did not apply to the respondent’s situation, as determined by valid answers to an earlier question. In such cases, the respondent is not considered to be part of the target population or universe for that question. As noted below, where a question was skipped due to an undetermined path (that is, a “Don’t know” or “Refusal” to a previous question caused the skip), the respondent is coded to “Not stated” for that question.

Don’t know

The respondent was unable to provide a response for one or more reasons (due to lack of recall, or because they were responding for someone else, for example).

Refusal

The respondent refused to respond, perhaps due to the sensitivity of the question.

Not stated

This indicates that the question response is missing and there is an undetermined path for the respondent, such as when a respondent did not answer the previous filter question or where an inconsistency was found in a series of responses.

Special codes have been designated to each of these types of responses to facilitate user recognition and data analysis. For instance, “valid skip” codes are set to “6” as the last digit, with any preceding digits set to “9” (for example, code would be “996” for a 3-digit variable). All “Don’t know” responses end in “7”, with any preceding digits set to “9” (for example, “997”). Refusals end in “8”, with any preceding digits set to “9” (for example, “998”); and “Not stated” values end in 9, with any preceding digits set to “9” also (for example, “999”). Further, those who chose not to share their census data will have distinct reserve codes for their census records. These reserve codes are different for each variable, depending on how many categories the variable has and the length of the variable.

5.6 Coding

5.6.1. “Other-specify” items

Data processing also includes the coding of “Other-specify” items, also referred to as “write-in responses”. For most questions on the APS questionnaire, pre-coded answer categories were supplied and the interviewers were trained to assign a respondent’s answers to the appropriate category. However, in the event that a respondent’s answer could not be easily assigned to an existing category, many questions also allowed the interviewer to enter a long-answer text response in the “Other-specify” category.

All questions with “Other-specify” categories were closely examined during processing. Based on a qualitative review of the types of text responses given, coding guidelines were developed for each question. Based on these coding guidelines, many of the long answers provided were re-coded back into one of the pre-existing listed categories. Responses that were unique and different from existing categories were kept as “Other”. For some questions, one or more new categories were created when there were sufficient numbers of responses to warrant them. In the case of questions where “Other-specify” responses constituted less than roughly 5% of overall responses to the question, coding was not performed and responses were left in “Other”.

Approximately 18,000 responses across 31 questionnaire items were recorded under “Other-specify” and reviewed for coding. These will be taken into account when refining the answer categories for future cycles of the survey.

5.6.2. Open-ended questions and standard classifications

A few questions on the 2017 APS questionnaire were recorded by interviewers in a completely open-ended format. These included questions related to the respondent’s occupation and industry of work as well as their major field of postsecondary study, where applicable. These responses were coded using a combination of automated and interactive coding procedures. Standardized classification systems were used to code these responses. Appendix C provides details of these classifications.

A standardized classification was also used to code Aboriginal languages that respondents spoke or understood as well as the first language learned in childhood. For languages, interviewers had been provided a comprehensive drop-down menu of languages to choose from, but write-in responses were also captured as needed. Overall, 67 Aboriginal language categories were used to code APS language data.

 

5.7 Edit and imputation

After the coding stage of processing, a series of customized edits were performed on the data. These consisted of validity checks within and across variables to identify gaps, inconsistencies, extreme outliers and other problems in the data. To resolve the problematic data identified by the edits, corrections were performed based on logical edit rules. In some cases, corresponding data were taken from the respondent’s answers to the census. This is referred to as imputation.

An example of a validity check within a single question is the multiple jobs variable relating to the number of multiple jobs that a respondent had in the last week, which allowed for an interviewer to record a minimum of 2 jobs and a maximum of 20 jobs. To remove outlier responses that were suspected of being invalid, an edit was built to ensure that the reported numbers of multiple jobs did not exceed an upper limit of 20.

Additionally, many consistency edits across questions were performed to avoid any contradictions. For example, a person who had not reported ever having attended a specific postsecondary educational institution such as a university, a trade school, a college, CEGEP or other non-university institution, and then subsequently reported currently working toward a certificate, diploma or degree from one of these institutions, was assumed to have attended that type of institution. The response to the earlier question was changed from a “NO” to a “YES” for the specific type of institution where the edit was required.

For the 2017 APS, a series of important imputations was conducted in relation to Aboriginal identity classifications. These imputations were the following:

  1. First, those with missing data for questions ID_Q25 on Status Indian or ID_Q30 on membership in a First Nation or Indian band were imputed values based on their responses to the equivalent questions in the census;
  2. Next, those with missing data for question ID_Q05 on Aboriginal self-identification would not have been asked the next question ID_Q10 on Aboriginal identity group. Due to the APS respondent definition, these respondents would have had to identify as either a Status Indian in ID_Q25 or a member of a First Nation or Indian band in ID_Q30 in order to be considered an APS respondent. If these respondents had self-identified as Aboriginal persons on the census, then they were imputed to have Aboriginal identity in question ID_Q05 and their Aboriginal identity group(s) for question ID_Q10 were imputed from their identity group(s) on the census;
  3. Next, those with missing data for question ID_Q10 who nevertheless identified with any of the Aboriginal identity groups on the census were imputed values for ID_Q10 based on their census identity group(s);
  4. Next, respondents who self-identified as Aboriginal persons in ID_Q05 but had missing data for ID_Q10, did not identify as Status Indians nor members of a First Nation or Indian band, and did not self-identify as Aboriginal persons on the census but identified as having Aboriginal ancestry on the census, had values for ID_Q10 imputed from their Aboriginal ancestry group on the census;
  5. Finally, those who had missing data for question ID_Q05 or question ID_Q10 but who identified as being Status Indians in ID_Q25 or members of a First Nation or Indian band in ID_Q30, and did not self-identify as Aboriginal persons on the census, were imputed as not having Aboriginal identity in ID_Q05. These persons are still considered to be APS respondents due to their affirmative response(s) in ID_Q25 and/or ID_Q30 and the APS respondent definition, and in the derived variable for Aboriginal identity they are grouped as “Aboriginal responses not included elsewhere”.

Finally, although all of these edits across topics were performed systematically using computer programmed edits, there were some cases for which very complex combinations of information were reviewed and corrected manually.

5.8 Derived variables and Census linkage

In order to facilitate more in-depth analysis of the rich APS dataset, approximately 240 derived variables were created by combining items on the questionnaire. Derived variables (DVs) were created across all major content domains. In addition, approximately 230 variables from the 2016 Census were linked to the final 2017 APS analytical file.

Some simple DVs involved the collapsing of categories into broader categories. In other cases, two or more variables were combined to create a new or more complex variable which would be useful for data analysts. Some of the DVs were based on linked variables from the census, including multiple census geographies and Inuit regions. Aboriginal ancestry was also taken from the census since it is not measured directly by the 2017 APS. If a respondent refuses census linkage then their data are suppressed in census and census-based variables.

For most DVs, there is a residual category labelled “Not stated” for when the responses to the DV source questions do not meet the conditions to place a respondent in any of the valid categories for the DV. In many, but not all cases, a respondent is included in the “Not stated” category if any part of the equation was not answered (that is, if any question used in the DV had been coded to “Don’t know”, “Refused” or “Not stated”).

Most DV names have a “D” in the first character position of the name. One exception is the Geography DVs, since they reflect the corresponding census variable name. The other exceptions are the DVs indicating levels 1, 2 and 3 of the North American Industry Classification System (NAICS) Canada 2017, based on responses to the APS industry questions, and levels 1, 2, 3 and 4 of the National Occupation Classification (NOC) 2016, based on responses to the APS Occupation questions. For all linked census variables, the census variable name was preserved as much as possible on the APS database. However, some census variables were required to be renamed since APS variable names are restricted to eight characters whereas some census variable names exceed this limit. In these cases, there is a note in the Data Dictionary to indicate the original Census variable name that it was shortened from.

The 2017 APS Data Dictionary identifies in detail which variables were derived and indicates the source variables from which the DVs were derived. Highlights of DVs are listed by theme in Appendix A along with other survey indicators. A complete list of linked census variables and their accompanying notes are provided in the 2017 APS Data Dictionary which accompanies the APS analytical file.

5.9. Creation of final data files and Data Dictionary (codebook)

Four final data files are created in data processing:

The final processing file is an in-house file that includes a number of temporary variables used exclusively for processing purposes. The analytical file, PUMF and the Inuit share files are dissemination files which are processed further for release purposes. Dissemination files are scheduled for distribution at various points in time following the APS release day, November 26, 2018.

The analytical file is distributed in RDCs across Canada but can only be accessed by researchers who fulfill certain requirements. The analytical file is also used at Statistics Canada to produce data tables in response to client requests. The PUMF is constructed for wider public distribution. The Inuit share files are produced in accordance with data sharing agreements with the Inuit regions: Nunatsiavut, Nunavik, Nunavut and the Inuvialuit region. On all of these dissemination files, many steps are taken to ensure respondent confidentiality.

In order to transform the final, cleaned processing file to a final analytical file for researchers, a number of steps were taken. First, several measures were taken for the enhanced protection of respondent confidentiality. Next, person-level weights were added to the file. Finally, all temporary variables or variables used exclusively for processing purposes were removed from the final processing file.

Accompanying the 2017 APS analytical file is the record layout, SAS, SPSS and Stata syntax to load the file, and metadata in the form of a Data Dictionary that describes each variable and provides weighted and unweighted frequency counts.

In order to ensure the non-disclosure of confidential information, the level of detail of the PUMF is not as exhaustive as that of the analytical files kept by Statistics Canada. Actions are taken to protect against the recognition of respondents with potentially identifiable combinations of characteristics. These protective actions include the restriction of geographies included in the file, adjustments to survey weights, review of overlaps with other PUMFs being published, exclusion of variables, grouping of categories for some variables, capping of some extreme numerical values, as well as identification of unique records at risk and rare occurrences.


Date modified: