Keyword search

Results

All (94)

All (94) (0 to 10 of 94 results)

1. A proposal for the problem of matching probabilities estimation in record linkage Archived
Articles and reports: 11-522-X202200100001
Description: Record linkage aims at identifying record pairs related to the same unit and observed in two different data sets, say A and B. Fellegi and Sunter (1969) suggest each record pair is tested whether generated from the set of matched or unmatched pairs. The decision function consists of the ratio between m(y) and u(y),probabilities of observing a comparison y of a set of k>3 key identifying variables in a record pair under the assumptions that the pair is a match or a non-match, respectively. These parameters are usually estimated by means of the EM algorithm using as data the comparisons on all the pairs of the Cartesian product ?=A×B. These observations (on the comparisons and on the pairs status as match or non-match) are assumed as generated independently of other pairs, assumption characterizing most of the literature on record linkage and implemented in software tools (e.g. RELAIS, Cibella et al. 2012). On the contrary, comparisons y and matching status in ? are deterministically dependent. As a result, estimates on m(y) and u(y) based on the EM algorithm are usually bad. This fact jeopardizes the effective application of the Fellegi-Sunter method, as well as automatic computation of quality measures and possibility to apply efficient methods for model estimation on linked data (e.g. regression functions), as in Chambers et al. (2015). We propose to explore ? by a set of samples, each one drawn so to preserve independence of comparisons among the selected record pairs. Simulations are encouraging.
Release date: 2024-03-25
2. A case study of using Splink: Census duplicate matching Archived
Articles and reports: 11-522-X202200100002
Description: The authors used the Splink probabilistic linkage package developed by the UK Ministry of Justice, to link census data from England and Wales to itself to find duplicate census responses. A large gold standard of confirmed census duplicates was available meaning that the results of the Splink implementation could be quality assured. This paper describes the implementation and features of Splink, gives details of the settings and parameters that we used to tune Splink for our particular project, and gives the results that we obtained.
Release date: 2024-03-25
3. Modelling intra-annual measurement in linked administrative and survey data Archived
Articles and reports: 11-522-X202200100012
Description: At Statistics Netherlands (SN) for some economic sectors two partly-independent intra-annual turnover index series are available: a monthly series based on survey data and a quarterly series based on value added tax data for the smaller units and re-used survey data for the other units. SN aims to benchmark the monthly turnover index series to the quarterly census data on a quarterly basis. This cannot currently be done because the tax data has a different quarterly pattern: the turnover is relatively large in the fourth quarter of the year and smaller in the first quarter. With the current study we aim to describe this deviating quarterly pattern at micro level. In the past we developed a mixture model using absolute turnover levels that could explain part of the quarterly patterns. Because the absolute turnover levels differ between the two series, in the current study we use a model based on relative quarterly turnover levels within a year.
Release date: 2024-03-25
4. Probabilistic or deterministic? Linkage methods tested for the Résil program Archived
Articles and reports: 11-522-X202200100019
Description: The purpose of this article is to compare the linkage results for individuals from French tax sources with those of the 2019 Enquête Annuelle de Recensement (EAR), obtained through different methods. Such a comparison will decide whether the Répertoires Statistiques d'Individus et de Logements (Résil) program should be equipped with a probabilistic matching tool for its administrative source identification and matching engine.
Release date: 2024-03-25
5. Record linkage techniques to identify 2021 Canadian Census dwellings in the new Statistical Building Register Archived
Articles and reports: 11-522-X202200100020
Description: The reconciliation of 2021 census dwellings with the new Statistical Building Register (SBgR) presented linkage challenges. The Census of Population collected information from various dwelling types. For a large proportion of the population, mailing addresses were at the centre: they were used for reaching out to people and collected as contact info. In parallel, the register environment has been evolving. The agency is transitioning from the Address Register (AR) to the SBgR holding both mailing and location addresses, while also covering non-residential buildings. The reconciliation was conducted using a combination of systems, notably the new Register Matching Engine (RME) for difficult cases. The RME holds an interesting range of sophisticated string comparators. A deterministic linkage approach was used, while incorporating some data knowledge like the entropy. Through metadata, the matching expert could also reduce the amounts of false positives and false negatives.
Release date: 2024-03-25
6. Emigration of Immigrants: Results from the Longitudinal Immigration Database
Articles and reports: 91F0015M2024002
Description: This paper examines the emigration of immigrants using the Longitudinal Immigration Database (IMDB). An indirect definition of emigration is proposed that leverages the information available in the IMDB. This study found that emigration of immigrants is a significant phenomenon. Certain characteristics of immigrants, such as having children, admission category and country of birth, have a strong correlation with emigration.
Release date: 2024-02-02
7. Examining the consistency of de facto marital status between tax data and the 2016 Census
Articles and reports: 91F0015M2023001
Description: Using record linkage, this article compares marital status as identified in the 2015 T1 tax data to what was provided in the 2016 Census using record linkage.
Release date: 2023-07-11
8. Maximum entropy classification for record linkage
Articles and reports: 12-001-X202200100007
Description:
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Release date: 2022-06-21
9. Measuring the undercoverage of two data sources with a nearly perfect coverage through capture and recapture in the presence of linkage errors Archived
Articles and reports: 11-522-X202100100006
Description:
In the context of its "admin-first" paradigm, Statistics Canada is prioritizing the use of non-survey sources to produce official statistics. This paradigm critically relies on non-survey sources that may have a nearly perfect coverage of some target populations, including administrative files or big data sources. Yet, this coverage must be measured, e.g., by applying the capture-recapture method, where they are compared to other sources with good coverage of the same populations, including a census. However, this is a challenging exercise in the presence of linkage errors, which arise inevitably when the linkage is based on quasi-identifiers, as is typically the case. To address the issue, a new methodology is described where the capture-recapture method is enhanced with a new error model that is based on the number of links adjacent to a given record. It is applied in an experiment with public census data.
Key Words: dual system estimation, data matching, record linkage, quality, data integration, big data.

Release date: 2021-10-22
10. Statistics Canada Quality Guidelines
Surveys and statistical programs – Documentation: 12-539-X
Description:
This document brings together guidelines and checklists on many issues that need to be considered in the pursuit of quality objectives in the execution of statistical activities. Its focus is on how to assure quality through effective and appropriate design or redesign of a statistical project or program from inception through to data evaluation, dissemination and documentation. These guidelines draw on the collective knowledge and experience of many Statistics Canada employees. It is expected that Quality Guidelines will be useful to staff engaged in the planning and design of surveys and other statistical projects, as well as to those who evaluate and analyze the outputs of these projects.
Release date: 2019-12-04

Data (2)

Data (2) ((2 results))

1. Agriculture-Population Linkage Data for the 2001 Census Archived
Table: 95F0303X
Description:
This product presents selected 2001 and historical data from the Census of Agriculture - Census of Population Linkage database. The data are available at the Canada and province levels for free. The data variables include: age; sex; marital status; mother tongue; highest level of schooling; net farm income; as well as farm population counts and income profiles for census farm families and households.
(No linkage databases were created for the 1966 and 1976 Census years, so historical comparisons are not possible for those years.)
Release date: 2003-12-02
2. Indicators and Detailed Statistics (Econnections: Linking the Environment and the Economy) Archived
Table: 16-200-X
Description:
Part of Statistics Canada's Econnections: linking the environment and the economy statistical series, this product consists of a printed publication combined with a CD-ROM. The product offers summary indicators plus detailed statistics that quantify the relationship between economic activity and the environment. Information is presented for issues ranging from greenhouse gas emissions, water and energy use, to natural resource wealth, environmental expenditures and beyond. The printed publication provides convenient reference to the summary indicators, including analysis of important trends, while the CD-ROM offers straightforward access to dozens of detailed statistical tables that underlie the indicators. An electronic version of the printed publication is included on the CD-ROM and each indicator in the publication is hypertext linked to a group of related statistical tables, allowing the user to easily select detailed statistics for viewing in association with any given indicator. Simple analysis of the statistics can be done directly within the CD-ROM's software. For those who carry out more complex analysis, downloading of data from the CD-ROM in standard spreadsheet format is easily accomplished.

Release date: 2001-02-23

Analysis (73)

Analysis (73) (0 to 10 of 73 results)

1. A proposal for the problem of matching probabilities estimation in record linkage Archived
Articles and reports: 11-522-X202200100001
Description: Record linkage aims at identifying record pairs related to the same unit and observed in two different data sets, say A and B. Fellegi and Sunter (1969) suggest each record pair is tested whether generated from the set of matched or unmatched pairs. The decision function consists of the ratio between m(y) and u(y),probabilities of observing a comparison y of a set of k>3 key identifying variables in a record pair under the assumptions that the pair is a match or a non-match, respectively. These parameters are usually estimated by means of the EM algorithm using as data the comparisons on all the pairs of the Cartesian product ?=A×B. These observations (on the comparisons and on the pairs status as match or non-match) are assumed as generated independently of other pairs, assumption characterizing most of the literature on record linkage and implemented in software tools (e.g. RELAIS, Cibella et al. 2012). On the contrary, comparisons y and matching status in ? are deterministically dependent. As a result, estimates on m(y) and u(y) based on the EM algorithm are usually bad. This fact jeopardizes the effective application of the Fellegi-Sunter method, as well as automatic computation of quality measures and possibility to apply efficient methods for model estimation on linked data (e.g. regression functions), as in Chambers et al. (2015). We propose to explore ? by a set of samples, each one drawn so to preserve independence of comparisons among the selected record pairs. Simulations are encouraging.
Release date: 2024-03-25
2. A case study of using Splink: Census duplicate matching Archived
Articles and reports: 11-522-X202200100002
Description: The authors used the Splink probabilistic linkage package developed by the UK Ministry of Justice, to link census data from England and Wales to itself to find duplicate census responses. A large gold standard of confirmed census duplicates was available meaning that the results of the Splink implementation could be quality assured. This paper describes the implementation and features of Splink, gives details of the settings and parameters that we used to tune Splink for our particular project, and gives the results that we obtained.
Release date: 2024-03-25
3. Modelling intra-annual measurement in linked administrative and survey data Archived
Articles and reports: 11-522-X202200100012
Description: At Statistics Netherlands (SN) for some economic sectors two partly-independent intra-annual turnover index series are available: a monthly series based on survey data and a quarterly series based on value added tax data for the smaller units and re-used survey data for the other units. SN aims to benchmark the monthly turnover index series to the quarterly census data on a quarterly basis. This cannot currently be done because the tax data has a different quarterly pattern: the turnover is relatively large in the fourth quarter of the year and smaller in the first quarter. With the current study we aim to describe this deviating quarterly pattern at micro level. In the past we developed a mixture model using absolute turnover levels that could explain part of the quarterly patterns. Because the absolute turnover levels differ between the two series, in the current study we use a model based on relative quarterly turnover levels within a year.
Release date: 2024-03-25
4. Probabilistic or deterministic? Linkage methods tested for the Résil program Archived
Articles and reports: 11-522-X202200100019
Description: The purpose of this article is to compare the linkage results for individuals from French tax sources with those of the 2019 Enquête Annuelle de Recensement (EAR), obtained through different methods. Such a comparison will decide whether the Répertoires Statistiques d'Individus et de Logements (Résil) program should be equipped with a probabilistic matching tool for its administrative source identification and matching engine.
Release date: 2024-03-25
5. Record linkage techniques to identify 2021 Canadian Census dwellings in the new Statistical Building Register Archived
Articles and reports: 11-522-X202200100020
Description: The reconciliation of 2021 census dwellings with the new Statistical Building Register (SBgR) presented linkage challenges. The Census of Population collected information from various dwelling types. For a large proportion of the population, mailing addresses were at the centre: they were used for reaching out to people and collected as contact info. In parallel, the register environment has been evolving. The agency is transitioning from the Address Register (AR) to the SBgR holding both mailing and location addresses, while also covering non-residential buildings. The reconciliation was conducted using a combination of systems, notably the new Register Matching Engine (RME) for difficult cases. The RME holds an interesting range of sophisticated string comparators. A deterministic linkage approach was used, while incorporating some data knowledge like the entropy. Through metadata, the matching expert could also reduce the amounts of false positives and false negatives.
Release date: 2024-03-25
6. Emigration of Immigrants: Results from the Longitudinal Immigration Database
Articles and reports: 91F0015M2024002
Description: This paper examines the emigration of immigrants using the Longitudinal Immigration Database (IMDB). An indirect definition of emigration is proposed that leverages the information available in the IMDB. This study found that emigration of immigrants is a significant phenomenon. Certain characteristics of immigrants, such as having children, admission category and country of birth, have a strong correlation with emigration.
Release date: 2024-02-02
7. Examining the consistency of de facto marital status between tax data and the 2016 Census
Articles and reports: 91F0015M2023001
Description: Using record linkage, this article compares marital status as identified in the 2015 T1 tax data to what was provided in the 2016 Census using record linkage.
Release date: 2023-07-11
8. Maximum entropy classification for record linkage
Articles and reports: 12-001-X202200100007
Description:
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Release date: 2022-06-21
9. Measuring the undercoverage of two data sources with a nearly perfect coverage through capture and recapture in the presence of linkage errors Archived
Articles and reports: 11-522-X202100100006
Description:
In the context of its "admin-first" paradigm, Statistics Canada is prioritizing the use of non-survey sources to produce official statistics. This paradigm critically relies on non-survey sources that may have a nearly perfect coverage of some target populations, including administrative files or big data sources. Yet, this coverage must be measured, e.g., by applying the capture-recapture method, where they are compared to other sources with good coverage of the same populations, including a census. However, this is a challenging exercise in the presence of linkage errors, which arise inevitably when the linkage is based on quasi-identifiers, as is typically the case. To address the issue, a new methodology is described where the capture-recapture method is enhanced with a new error model that is based on the number of links adjacent to a given record. It is applied in an experiment with public census data.
Key Words: dual system estimation, data matching, record linkage, quality, data integration, big data.

Release date: 2021-10-22
10. Economic Immigrants in Gateway Cities: Factors Involved in Their Initial Location and Onward Migration Decisions Archived
Articles and reports: 11F0019M2018411
Geography: Census metropolitan area
Description:
Immigrants tend to reside disproportionately in larger Canadian cities, which may challenge their absorptive capacity. This study uses the linked Longitudinal Immigration Database and T1 Family File to examine the initial location and onward migration decisions of immigrants who are economic principal applicants (EPAs) and who have landed since the Immigration and Refugee Protection Act was passed. The main objective of the study is to identify the factors associated with initially residing and remaining in Canada’s three largest gateway cities: Montréal, Toronto and Vancouver (referred to as MTV).
Release date: 2018-12-07

Reference (19)

Reference (19) (10 to 20 of 19 results)

11. An Overview of the Issues Related to the Use of Personal Identifiers Archived
Surveys and statistical programs – Documentation: 85-602-X
Description:
The purpose of this report is to provide an overview of existing methods and techniques making use of personal identifiers to support record linkage. Record linkage can be loosely defined as a methodology for manipulating and / or transforming personal identifiers from individual data records from one or more operational databases and subsequently attempting to match these personal identifiers to create a composite record about an individual. Record linkage is not intended to uniquely identify individuals for operational purposes; however, it does provide probabilistic matches of varying degrees of reliability for use in statistical reporting. Techniques employed in record linkage may also be of use for investigative purposes to help narrow the field of search against existing databases when some form of personal identification information exists.
Release date: 2000-12-05
12. Creation of an occupational surveillance system in Canada: Combining data for a unique Canadian study Archived
Surveys and statistical programs – Documentation: 11-522-X19990015652
Description:
Objective: To create an occupational surveillance system by collecting, linking, evaluating and disseminating data relating to occupation and mortality with the ultimate aim of reducing or preventing excess risk among workers and the general population.
Release date: 2000-03-02
13. Overview of record linkage Archived
Surveys and statistical programs – Documentation: 11-522-X19990015660
Description:
There are many different situations in which one or more files need to be linked. With one file the purpose of the linkage would be to locate duplicates within the file. When there are two files, the linkage is done to identify the units that are the same on both files and thus create matched pairs. Often records that need to be linked do not have a unique identifier. Hierarchical record linkage, probabilistic record linkage and statistical matching are three methods that can be used when there is no unique identifier on the files that need to be linked. We describe the major differences between the methods. We consider how to choose variables to link, how to prepare files for linkage and how the links are identified. As well, we review tips and tricks used when linking files. Two examples, the probabilistic record linkage used in the reverse record check and the hierarchical record linkage of the Business Number (BN) master file to the Statistical Universe File (SUF) of unincorporated tax filers (T1) will be illustrated.
Release date: 2000-03-02
14. Creating and enhancing a population-based linked health database: methods, challenges, and applications Archived
Surveys and statistical programs – Documentation: 11-522-X19990015662
Description:
As the availability of both health utilization and outcome information becomes increasingly important to health care researchers and policy makers, the ability to link person-specific health data becomes a critical objective. This type of linkage of population-based administrative health databases has been realized in British Columbia. The database was created by constructing an historical file of all persons registered with the health care system, and then by probabilistically linking various program files to this 'coordinating' file. The first phase of development included the linkage of hospital discharge data, physician billing data, continuing care data, data about drug costs for the elderly, births data and deaths data. The second phase of development has seen the addition data sources external to the Ministry of Health including cancer incidence data, workers' compensation data, and income assistance data.
Release date: 2000-03-02
15. A comparison of two record linkage procedures Archived
Surveys and statistical programs – Documentation: 11-522-X19990015664
Description:
Much work on probabilistic methods of linkage can be found in the statistical literature. However, although many groups undoubtedly still use deterministic procedures, not much literature is available on these strategies. Furthermore there appears to exist no documentation on the comparison of results for the two strategies. Such a comparison is pertinent in the situation where we have only non-unique identifiers like names, sex, race etc. as common identifiers on which the databases are to be linked. In this work we compare a stepwise deterministic linkage strategy with the probabilistic strategy, as implemented in AUTOMATCH, for such a situation. The comparison was carried out on a linkage between medical records from the Regional Perinatal Intensive Care Centers database and education records from the Florida Department of Education. Social security numbers, available in both databases, were used to decide the true status of the record pair after matching. Match rates and error rates for the two strategies are compared and a discussion of their similarities and differences, strengths and weaknesses is presented.
Release date: 2000-03-02
16. Estimation using the generalized weight share method: The case of record linkage Archived
Surveys and statistical programs – Documentation: 11-522-X19990015680
Description:
To augment the amount of available information, data from different sources are increasingly being combined. These databases are often combined using record linkage methods. When there is no unique identifier, a probabilistic linkage is used. In that case, a record on a first file is associated with a probability that is linked to a record on a second file, and then a decision is taken on whether a possible link is a true link or not. This usually requires a non-negligible amount of manual resolution. It might then be legitimate to evaluate if manual resolution can be reduced or even eliminated. This issue is addressed in this paper where one tries to produce an estimate of a total (or a mean) of one population, when using a sample selected from another population linked somehow to the first population. In other words, having two populations linked through probabilistic record linkage, we try to avoid any decision concerning the validity of links and still be able to produce an unbiased estimate for a total of the one of two populations. To achieve this goal, we suggest the use of the Generalised Weight Share Method (GWSM) described by Lavallée (1995).
Release date: 2000-03-02
17. Combining data sources: Air pollution and asthma consultations in 59 general practices throughout England and Wales - A case study Archived
Surveys and statistical programs – Documentation: 11-522-X19990015688
Description:
The geographical and temporal relationship between outdoor air pollution and asthma was examined by linking together data from multiple sources. These included the administrative records of 59 general practices widely dispersed across England and Wales for half a million patients and all their consultations for asthma, supplemented by a socio-economic interview survey. Postcode enabled linkage with: (i) computed local road density; (ii) emission estimates of sulphur dioxide and nitrogen dioxides, (iii) measured/interpolated concentration of black smoke, sulphur dioxide, nitrogen dioxide and other pollutants at practice level. Parallel Poisson time series analysis took into account between-practice variations to examine daily correlations in practices close to air quality monitoring stations. Preliminary analyses show small and generally non-significant geographical associations between consultation rates and pollution markers. The methodological issues relevant to combining such data, and the interpretation of these results will be discussed.
Release date: 2000-03-02
18. Proposed Linkage of Occupation and Major Field of Study for SLID Archived
Surveys and statistical programs – Documentation: 75F0002M1996005
Description:
This paper examines a new variable which would show whether a person's job is related to his or her postsecondary education. This variable would help to explain other characteristics measured in the Survey of Labour and Income Dynamics (SLID), such as wages, supervisory roles, and job stability.
Release date: 1997-12-31
19. Data Quality of Income Data Using Computer-assisted Interviewing: SLID Experience Archived
Surveys and statistical programs – Documentation: 75F0002M1994015
Description:
This paper describes how the computer-assisted interviewing (CAI) income application was programmed for a Survey of Labour and Income Dynamics (SLID) test conducted in 1993.
Release date: 1995-12-30

Report a problem or mistake on this page

Date modified:: 2024-04-19

Language selection

Search and menus

Search

Keyword search

Filter results by

Keyword(s)

Subject

Type

Year of publication

Geography

Survey or statistical program

Portal

Content

Results

All (94) (0 to 10 of 94 results)

Data (2) ((2 results))

Analysis (73) (0 to 10 of 73 results)

Reference (19) (10 to 20 of 19 results)

Keyword search

Filter results by

Keyword(s)

Subject

Type

Year of publication

Geography

Survey or statistical program

Portal

Content

Results

All (94) (0 to 10 of 94 results)

Data (2) ((2 results))

Analysis (73) (0 to 10 of 73 results)

Reference (19) (10 to 20 of 19 results)

How do I use the filters and the search box?

How do I refine my search?

How does the search work?

How are the results ordered?

How are the results ordered?