Survey Methodology
Estimating the false negatives due to blocking in record linkage

by Abel Dasylva and Arthur GoussanouNote 1

  • Release date: January 6, 2022

Abstract

When linking massive data sets, blocking is used to select a manageable subset of record pairs at the expense of losing a few matched pairs. This loss is an important component of the overall linkage error, because blocking decisions are made early on in the linkage process, with no way to revise them in subsequent steps. Yet, measuring this contribution is still a major challenge because of the need to model all the pairs in the Cartesian product of the sources, not just those satisfying the blocking criteria. Unfortunately, previous error models are of little use because they typically do not meet this requirement. This paper addresses the issue with a new finite mixture model, which dispenses with clerical reviews, training data, or the assumption that the linkage variables are conditionally independent. It applies when applying a standard blocking procedure for the linkage of a file to a register or a census with complete coverage, where both sources are free of duplicate records.

Key Words:   Indexing; Massive data sets; Entity resolution; Data integration; Machine learning; Classification.

Table of contents

How to cite

Dasylva, A., and Goussanou, A. (2021). Estimating the false negatives due to blocking in record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 47, No. 2. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2021002/article/00002-eng.htm.

Note


Date modified: