Survey Methodology
Maximum entropy classification for record linkage
by Danhyang Lee, Li-Chun Zhang and Jae Kwang KimNote 1
- Release date: June 21, 2022
Abstract
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Key Words: Probabilistic linkage; Density ratio; False link; Missing match; Survey sampling.
Table of contents
- Section 1. Introduction
- Section 2. Problems with the classical approach
- Section 3. Maximum entropy classification: Supervised
- Section 4. MEC for unsupervised record linkage
- Section 5. Discussion
- Section 6. Simulation study
- Section 7. Final remarks
- Acknowledgements
- Supplementary material
- References
How to cite
Lee, D., Zhang, L.-C. and Kim, J.K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2022001/article/00007-eng.htm.
Note
- Date modified: