# Statistics: Power from Data! Glossary

Text begins

The definitions below provide information for those who have questions about some terms used in statistics, but who do not need highly technical definitions . These definitions provided here are, in some cases, oversimplifications of highly complex concepts. For more detailed explanations, you can consult the references provided one the Bibliography page.

Data collected as a result of an organization’s day-to-day operations.

• ### Aggregate data

Data set in which one record represents a summary of multiple observation units.

• ### Big data

Data sets that have such a large number of records and variables that they exceed the capacity of traditional software to process the information within a reasonable time.

• ### Box and whisker plot

Type of graph used to visualize the five-number summary, i.e. the median, the lower and upper quartiles, the minimum and the maximum. Synonym: box plot.

• ### Categorical variable

Characteristic that isn’t quantifiable. Synonym: qualitative variable.

• ### Census

In general, survey that aims to collect information about every unit of a population. A census is also used to list and count all units of a population.

• ### Central tendency

Measure of the location of the middle or the centre of a distribution.

• ### Closed question

In a questionnaire, a closed question gives the respondent a list of predefined answers and the respondent is supposed to select one or more answers from the list.

• ### Coefficient of variation

Ratio of the standard error of the estimate to the average value of the estimate across all possible samples.

• ### Confidence interval

The range of values around the estimate that is likely to include the unknown population true value with a given probability.

• ### Continuous variable

Numeric variable that assumes an infinite number of real values within a given interval.

• ### Crowdsourcing

Collection of data information from a large community of users. It relies on the principle that citizens are the experts of their local environment.

• ### Data

Facts, figures, observations, or recordings that can take the form of image, sound, text or physical measurements (distance, weight, wave lengths, etc.). Data can be gathered and processed in order to form conclusions.

• ### Data capture

The process used to convert data in a machine-readable format.

• ### Data coding

The process that assigns a value (code) to a response. The code can be a numeric value or a character string.

• ### Data editing

Application of checks to detect missing, invalid or inconsistent values or to point to data records that are potentially in error.

• ### Data imputation

The process used to assign replacement values for missing, invalid or inconsistent data that have failed edits.

• ### Data item

The smallest piece of information that can be gathered from a source of information.

• ### Data processing

Transformation of raw data so they can be used to produce estimates or to carry other data analysis.

• ### Data provider

Individual or organization that collect and process data because information is needed for different purposes, and make these data accessible to data users.

• ### Data set

Grouping of data that have common definitions of observation units and variables.

• ### Database

Structured set of data items, generally presented as tables.

• ### Delimited text file

A text file used to store data, in which each line represents a unit, and each line has fields separated by a delimiter. The most common delimiters are commas, tab, and colon.

• ### Discrete variable

Numeric variable that assumes only a finite number of real values within a given interval. The possible values can be enumerated and counted.

• ### Dispersion

Measure of the spread of a distribution around the central tendency.

• ### Frequency

The number of times a value occurs in a data set. It can also be a number of events or items. Synonym: count.

• ### Frequency distribution

Chart or table showing how many times each value or range of values of a variable appear in a data set.

• ### Interquartile range

Range of the 50% of data that is central to the distribution, i.e. the difference between the upper quartile and the lower quartile.

• ### Lower quartile

Value under which 25% of data points are found when they are arranged in increasing order. Synonym: first quartile.

• ### Margin of error

Half the width of the confidence interval associated to an estimate.

• ### Mean

Measure of central tendency which is the sum of all values divided by the number of values.

• ### Median

Value in the middle of a data set, meaning that 50% of data points have a value smaller or equal to the median and 50% of data points have a value higher or equal to the median. Synonym: second quartile.

Data about data or data elements, including data descriptions, ownership, access paths, access rights, quality or other information that provides context to data.

• ### Microdata

Data set in which one record represents one unit of observation.

• ### Missing value

Blank or absent data point.

• ### Mode

For categorical or discrete variables, it is the value(s) for which the highest frequency is observed. For continuous variables, the modal-class intervals are the peaks of the histogram. When the mode is unique, it can be used as a measure of central tendency.

• ### Nominal variable

Categorical variable that describes a name, label or category without natural order.

• ### Non-sampling errors

All sources of error that are unrelated to sampling.

• ### Numeric variable

A quantifiable characteristic whose values are numbers. Synonym: quantitative variable.

• ### Open data

Structured, machine-readable data that are freely shared and that can be used without restrictions.

• ### Open question

In a questionnaire, an open question gives the respondent an opportunity to answer the question in their own words.

• ### Ordinal variable

Categorical variable whose values are defined by an order relation between the different categories.

• ### Primary source of information

Data from a primary source was collected for the purpose of producing statistics and statistical information.

• ### Questionnaire

Series of questions designed to elicit information on one or more topics from a respondent.

• ### Range

Difference between the largest value (maximum) and the smallest value (minimum).

The process by which records or units from different data sources are joined together into a single file using non-unique identifiers, such as names, date of birth, addresses and other characteristics. Synonyms: data matching, data linkage, entity resolution.

• ### Remote sensing

Acquisition of information about an object or phenomenon from a distant point.

• ### Sample

A subset of the units of a population.

• ### Sample survey

Survey for which the information is collected for some units of the target population only.

• ### Sampling error

Difference between the estimate derived from a sample survey and the true value that would result if a census of the whole population were taken under the same conditions.

• ### Sampling variation

Average of the squared differences between an estimate and the average of the estimates across all possible samples.

• ### Secondary source information

Data from a secondary source was collected for a purpose other than producing statistical information.

• ### Semi-interquartile range

Half the value of the interquartile range.

A software application that displays a table of cells arranged in rows and columns, in which the change of the contents of one cell can cause recalculation of other cells based on user-defined formulas.

• ### Standard deviation

Square root of the variance.

• ### Standard error

Square root of the sampling variance.

• ### Statistical information

Data that have been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges.

• ### Statistical register

Data sets created for statistical purposes that are continuously updated with information about all units of a population.

• ### Statistics

Type of information obtained through mathematical operations on data.

• ### Structured data

Data that are organized into pre-defined items that each relates to a specific concept or data item.

• ### Survey

Any activity to collect information in an organized and methodical manner about the characteristics of the units of a population. The word survey is often used to refer to a sample survey, as opposed to a census.

• ### Unstructured data

Unstructured data are any data that are not arranged according to a pre-defined model.

• ### Upper quartile

Value under which 75% of data points are found when arranged in increasing order. Synonym: Third quartile.

• ### Variable

Characteristic that can be measured and that can assume different values.

• ### Variance

Average of the squared differences between each data point and the centre of the distribution, measured using the mean.