3.4 Processing
3.4.1 Coding

Text begins

Coding is any process that assign a value (code) to a response. It means that coding entails either assigning a code to a given response or comparing the response to a set of codes and selecting the one that best describe the response. The code can be a numeric value or a character string. There could be different ways to do this translation, but alternative coding approaches affect the quality and cost of data produced.

Questionnaires usually have two types of questions—closed questions and open questions. The responses to these questions affect the type of coding performed. The following question is an example of a closed question:

To what degree is sport important in providing you with the following benefits?

<1/> Very important

<2/> Somewhat important

<3/> Not important

The following code structure is an example of an open question:

What sports do you participate in?

Specify ______________

In the case of closed questions, the response categories are determined before collection, with the numerical code usually appearing on the questionnaire beside each response category. For open questions, coding occurs after collection and may be either manual or automated. For some questions, coding may be straightforward (e.g. marital status). In other cases, such as geography, industry and occupation, a standard coding system is strongly recommended when available. But for many questions where no standard coding system exists, determining a good coding scheme is a non-trivial task.

Automated coding systems

Manual coding requires interpretation and judgment on the part of the coder, and may vary between coders. Due to advances in technology, resource constraints and, most importantly, concerns about timeliness and quality, coding is becoming more and more automated.

In general, two files are input to an automated coding system. One file contains either the survey responses or administrative files, which are to be coded, referred as the input file. The other file is called the reference file, which contains the predetermined code set. Then, for each record from the input file, a search is performed in the reference file. If a match is found, the code in the reference file is assigned to the corresponding record from the input file. Otherwise, the code is left blank. Some of the advantages of an automated coding system are that the process increasingly becomes faster, consistent, and more economical.

There are already many automated systems in use at Statistics Canada. For example, the Labour Force Survey data files are collected from the Regional Offices of Statistics Canada and are run through an automated coding system that assigns industry and occupation codes based on the North American Industrial Classification System (NAICS) and the National Occupation Classification (NOC). The rejected records (those that do not have a match with the written response) are the only data to be manually coded.

Recently, machine learning techniques have been used for Statistics Canada’s Business Register to help assign industry codes by using business names and business addresses. This leads to improvement to the coverage of the Business Register, which is the sampling frame for the majority of business surveys in Statistics Canada, and, ultimately, to improvement to the data quality of many business surveys.  

