# 2 Sources of data 2.2 Type of data

Text begins

There are many methods to collect data, but agencies like Statistics Canada primarily use three methods of data collection: censuses, sample surveys, and administrative data. Each has advantages and disadvantages that will be presented in this section. Then other methods of data gathering will be described.

## Census

In general, census refers to data collection about every unit in a group or population. If you collected data about the height of everyone in your class, that would be regarded as a census of your class. Censuses are often used not only to collect data about all units of a population, but also to list and count all units of a population. If you wanted to know how many people live in your street, you would need to list all of the dwellings in the street and then all people living at each of these dwellings. As you do so, you could decide to collect other information, such as age, sex and mother tongue. That would allow you to count how many men, women and children are living in your street. So, a census would be a straightforward means to count the number of units and to produce statistics on their characteristics as well.

No sampling variability: There is no sampling variability attributed to the statistics produced from a census because they are calculated using the entire population.

High level of details: With a census, you would be able to produce statistics for small sub-groups of the population, as long as you collected the right classification variables.

Direct estimation of counts: A census allows for the direct estimation of the population counts, although some adjustments may be considered for units that couldn’t be reached.

High cost: Conducting a census of a large population can be very expensive.

Timeliness: A census generally takes longer to conduct than a sample survey, which means the lag between the reference date and the release of results could be much larger.

High response burden: Information needs to be received from every member of the population.

Less control on quality: If the size of the population is much larger than the sample size of a survey and resources are limited, it is possible that some compromises on quality control will be necessary. For example, it is possible that only part of the non-respondents will be reached for nonresponse follow-up.

Less detailed information: Due to cost, response burden and scale of activities needed to carry a census in a large population, the variables that can be measured for each unit are sometimes limited to a short list of identification and classification variables.

## Sample survey

A survey is any activity to collect information in an organized and methodical manner about the characteristics of the units of a population. At Statistics Canada, surveys use well-defined concepts, methods and procedures that will be described in the third section of this resource. A census can be seen as a type of survey, but the term survey is more often used to refer to a sample survey, which means a survey where the information is collected for some units of the target population only. If you collected data about the height of 10 students in a class of 30, that would be a sample survey of the class rather than a census. But ideally, you would want to select them randomly to make sure the 10 students are representative of all students in the class.

Here are some advantages and disadvantages of using a sample survey compared to using a census:

Lower cost: A sample survey costs less than a census because data is collected from only part of a population.

Faster results: Results are obtained far more quickly as fewer units need to be contacted and fewer data need to be processed.

Lower response burden: Fewer people have to respond to the survey.

More control on quality: The smaller scale of operations allows for better monitoring and quality control.

Sampling variability: If you selected multiple samples from the same population and computed statistics for each of those samples, the results would be a bit different from one sample to another. This source of uncertainty must be taken account in the estimation of statistics from a sample survey.

Lower level of details: The sample may not be large enough to produce information about sub-groups of the population or small geographical areas.

Administrative data is collected as a result of an organization's day-to-day operations. Examples include data on births, deaths, tax, car registrations and transactional data. These administrative files can be used as a substitute for a survey or to support surveys (as sampling frames, for imputation, to add new variables, etc.).

Here are some advantages and disadvantages of using administrative data compared to using a census or a sample survey:

Lower cost: Using administrative data is less expensive than censuses and surveys because there are no collection operations.

No sampling variance: There is no sampling variability attributed to the statistics produced from administrative data because they are calculated using entire groups of the population.

Time series: Data are collected on an ongoing basis, allowing for trend analysis.

No response burden: Since the data are already collected, there is no additional burden on the respondents.

High level of details: With administrative data, you would be able to produce statistics for small sub-groups of the population or small geographic area, as long as the right classification variables are present in the file and the subgroups have good coverage (i.e. most units in the subgroups are present in the file).

Less flexibility: Unlike a survey, data users have limited control over which variables are collected. In some case, variables can be limited to a few essential administrative information.

Lower coverage: Data is limited to the population on whom the administrative records are kept. Most of the time, this population is different from the target population which results in sources of under- and over-coverage.

Comparability over time: Definitions are created to serve specific purposes, but often change and evolve over time.

Concepts and definitions: The definitions are established by those who create and manage the file for their own purposes and these definitions might not be fit for use in another context.

Data quality: Data quality can be different from one data provider to another because they don’t give the same importance to the different dimensions of quality.

Ethics: With censuses and sample surveys, respondents are aware of what data is being collected. They usually give consent for that data to be used, since the vast majority of surveys are voluntary. With administrative data, it would be difficult to inform and ask consent to all units in the data set. This means individuals and organizations that use administrative data to produce statistical information have a greater responsibility to ensure that data are used in ways that will benefit society and that they have considered data ethics in all steps of the process.

## Alternative sources of data

These sources of data are increasingly being used in the production of statistical information to replace or complement traditional methods.

Crowdsourcing involves collecting information from a large community of users and relies on the principle that citizens are the experts of their local environment. Prior to the legalization of cannabis in 2018, the Canadian government needed information on the size and activity of the existing black market for dried cannabis. This information was difficult to gather using a traditional sample survey. For one thing, the characteristic being measured was rare; a probabilistic sample would include many nonusers since they outnumber cannabis users in the Canadian population. Furthermore, some respondents may be hesitant to report the details of their cannabis consumption to an interviewer. Statistics Canada therefore decided to use crowdsourcing to gather the information. The agency established StatsCannabis, an anonymous web application that allowed cannabis users to report information about their previous purchases. The Canadian government subsequently used this information to help plan the transition to a legal market for cannabis.

Web scraping is a process through which information is gathered and copied from the web for further analysis. Starting January 2021, web scraped data are being used to model the price of computers and laptops in the Computer Equipment, Software and Supplies Index of the Consumer Price Index. This change in data collection method aims to improve the coverage of products available on the market and the timeliness of the information on their prices, considering rapid changes to the digital economy. Like for administrative data, users of web scraped data have greater responsibility to consider the ethics of data collection and to follow best practices to avoid inadvertent collection of personal information.

Remote sensing is the acquisition of information about an object or phenomenon from a distant point. Statistics Canada uses remote sensing for its Crop Condition Assessment Program. The growth of vegetation on Canadian farms is observed on a weekly basis using satellite imagery; the data typically become available the same day that the satellite images are processed, allowing near-real-time monitoring of Canadian agriculture. This program provides valuable information while reducing collection costs and easing response burden on producers. Other examples of remote sensing include weather radar systems that track storms, and seismogram arrays that monitor vibrations in the earth.

Statistical registers are data sets created for statistical purposes that are continuously updated with information about all units of a population. They are usually created from the integration of multiple data sources through record linkage and use algorithms or machine learning techniques to consolidate data and derive new variables. Statistics Canada’s Business Register is an example of statistical register that is continuously updated from enterprise tax data and sample survey data that is being used as sampling frame for a large number of business surveys and to produce semi-annual Canadian Business Counts.

Finally, open data and big data are other terms used to describe some types of data. Open data are structured, machine-readable data that are freely shared and that can be used without restrictions. Big data is used to refer to data sets that have such a large number of records and variables that they exceed the capacity of traditional software to process the information with a reasonable time. They are also characterized by the three “v”: volume, variety and velocity.

﻿
Date modified: