# 1 Data, statistical information and statistics 1.1 Definitions

Text begins

Data, statistical information and statistics are closely related, but understanding the key differences between these concepts is important for anyone who needs to navigate the ever-rising ocean of information produced by modern society. Data are the raw materials for producing statistical information, of which statistics are a specific type.

## Data

Data are facts, figures, observations, or recordings that can take the form of image, sound, text or physical measurements (ex: distance, weight, wave lengths). Data can be gathered and processed in order to form conclusions. Data can come from many sources and it can be split in two groups based on the form it takes: structured data and unstructured data.

Structured data are data that are organized into pre-defined items that each relates to a specific concept or data item. A set of data gathered using a questionnaire or other fillable form is a good example of structured data: the questions on a questionnaire represent separate, well-defined concepts. In the case of a closed question, the answer will fit in one of multiple pre-defined categories. For an open question, it may take the form of a text or numerical values. If an answer was recorded for each question, the data are complete. If not, there are missing values.

For example, consider how each column in table 1.1.1 on Canadian universities relates to a single, separate concept:

﻿
Table 1.1.1
Example of structured data
Table summary
This table displays the results of Example of structured data. The information is grouped by Name of institution (appearing as row headers), City, Province, Established and Number of students (appearing as column headers).
Name of institution City Province Established Number of students
Université Laval Quebec QC 1852 43,000
University of Waterloo Waterloo ON 1955 30,000
Dalhousie University Halifax NS 1818 18,000
Simon Fraser University Burnaby BC 1965 30,000

Each row includes the values for one observation unit for which information was collected. Rows are referred to as observations or records. Concepts presented in each column are often called variables. Data sets are groupings of data that have common definitions of observation units and variables.

In order to be processed and analyzed, structured data need to be compiled in a digital data structure that naturally aligns with pre-defined concepts or variables such as a spreadsheet, a database or a delimited text file. Data can then be read by a statistical software that allows the data user to transform and summarize the data, to perform mathematical operations on the data or to visualize them.

Unstructured data are any data that are not arranged according to a pre-defined model. To produce statistical information based on unstructured data, additional processing is needed to organize the information contained in the data. Table 1.1.2 presents examples of how text, images and sounds can be transformed into structured data that can be used for text analysis and for pattern and speech recognition.

﻿
Table 1.1.2
Transforming unstructured data into structured data
Table summary
This table displays the results of Transforming unstructured data into structured data. The information is grouped by Unstructured data (appearing as row headers), Processing and Structured data (appearing as column headers).
Unstructured data Processing Structured data
A text Parsing, to split the text in a list of words; aggregation, to count how many times the same word occurs; use of dictionaries and rules to classify words. A spreadsheet: on each row there is one distinct word, the three columns present the word, the number of occurrences and the category of the word.
An image Assignment of RGB values to pixels; segmentation of the image into blocks of pixels based on red (R), green (G) and blue (B) components. A database: each record is a group of pixels and the variables summarize the colour components in each group.
A record of someone’s voice Segmentation of record in distinct sounds; measure of duration and frequencies. A list of segments with duration and frequencies.

With the increased use of computers and smartphones in all areas of our lives, a huge part of the digital data that is being created now is unstructured. Assessing the potential of this data and creating innovative ways of gathering, processing and analyzing it in order to produce valuable statistical information is one of the great challenges of the data revolution.

But what is the difference between statistical information and data?

## Statistical information

Statistical information is data that has been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges. Statistical information that is communicated to information users should help them understand the story told by the data and communicate to them the quality of the information that is presented. Statistical information can be presented in various formats: texts, tables, graphs, infographics, videos, or even databases.

Many examples of statistical information produced at Statistics Canada will be presented in the next page, but it is first important to understand one major part of the process of producing statistical information from data: the use of statistics!

## Statistics

In general, statistics relate to numerical data; in fact, the term “statistics” can refer to the science of dealing with numerical data itself. Statistics are also a type of information obtained through mathematical operations on data. Above all, statistics aim to provide useful information by means of numbers.

The most commonly used statistics to report statistical information are called descriptive statistics. For numeric variables, measures of central tendency provide the value that is the most representative of the units found in a data set. Measures of dispersion describe the spread of the data around the central tendency. For categorical variables, frequency distributions are used to summarize the data. Proportions, ratios and rates are also useful statistics to analyze the data.

When each row in a data set displays statistics that summarize the information for many units of observation, these data are called aggregate data. Inversely, when each row displays the information for a single unit of observation, the data are referred to as microdata.

﻿

Is something not working? Is there information outdated? Can't find what you're looking for?