Maximum entropy classification for record linkage
Section 4. MEC for unsupervised record linkage
Let be the -vector of key
variables, which may be imperfect for two reasons: it is not rich enough if the
true -values are not
unique for each distinct entity underlying the two files to be linked, or it
may be subjected to errors if the observed is not equal to
its true value. Let contain only
the distinct -vectors from
the first file, after removing any other record that has a duplicated -vector to some
record that is retained in In other words,
if the first file initially contains two or more records with exactly the same
value of the combined key, then only one of them will be retained in for record
linkage to the second file. Similarly let contain the
unique records from the second file. The reason for separate deduplication
of keys is that no comparisons between the two files can distinguish among
the duplicated in either file,
which is an issue to be resolved otherwise.
Given and preprocessed as
above, the maximal MEC set only consists
of the record pairs with the perfect agreement of all the key variables. For
probabilistic linkage beyond one can follow
the same scheme of MEC in the supervised setting, as long as one is able to
obtain an estimate of the probability ratio, given which one can form the MEC
set of any chosen size. Nevertheless, to estimate the associated FLR (3.5) and
MMR (3.6), an estimate of is also needed.
4.1 Algorithm of unsupervised MEC
The idea now is to apply (3.7)
and (3.9) jointly. Since setting and associated with
the maximal MEC set satisfies (3.7) and (3.9) automatically, probabilistic
linkage requires one to assume and for at least
some of Moreover,
unless there is external information that dictates it otherwise, one can only
assume common support in the
unsupervised setting. Let
where the probability of observing is by (3.3) given that a randomly selected record
pair from belongs to and otherwise, similarly given by (3.3) with
parameters instead of An iterative algorithm of unsupervised MEC is
given below.
- I. Set and where is the maximal
MEC set.
- II. For the iteration, let if and 0
otherwise.
i. Update by using (4.4), which is discussed below, given and calculate
-
which maximize in (3.4) for given and Once and are obtained, we
can update where
- ii. For given and we find the MEC
set such that by deduplication
in the descending order of over It maximizes the
entropy denoted by
-
with respect to
- III. Iterate until or
, where
is a small positive value.
A theoretical convergence
property of the proposed algorithm and its proof are presented in the supplementary
materials.
Notice that, insofar as is highly
imbalanced, where the prevalence of is very close
to 0, one could simply ignore the contributions from and use
under the model (3.3) of in which case there is no updating of Other possibilities of estimating will be discussed in Section 5.2.
Table 4.1 provides
an overview of MEC for record linkage in the supervised or unsupervised
setting. In the supervised setting, one observes for the matched
record pairs in so that the
probability can be
estimated from them directly. Whereas, for MEC in the unsupervised setting, one
cannot separate the estimation of and
Table 4.1
MEC for record linkage in supervised or unsupervised setting
Table summary
This table displays the results of MEC for record linkage in supervised or unsupervised setting Supervised and Unsupervised (appearing as column headers).
|
Supervised |
Unsupervised |
|
Observed
| Unobserved
|
Probability ratio |
generally applicable
given
|
generally
assuming
|
Model of
|
Multinomial if only discrete comparison scores
Directly or via key variables and measurement errors |
MEC set |
Guided by FLR and MMR
Require estimate of
in addition |
Estimation |
from
in
by (3.7) outside
|
and
jointly by (3.7) and (3.9) |
4.2 Error rates
MEC for record linkage
should generally be guided by the error rates, FLR and MMR, without being
restricted to the estimate of
Note that of any MEC set are among the largest
ones over because MEC
follows the descending order of except for
necessary deduplication when there are multiple pairs involving a given record.
To exercise greater control of the FLR, let be the target
FLR, and consider the following bisection procedure.
- i. Choose a
threshold value and form the
corresponding MEC set where for any
- ii. Calculate
the estimated FLR of the resulting MEC set as
- If then increase if then reduce
Iteration between the two
steps would eventually lead to a value of that makes as close as
possible to for the given
probability ratio
The final MEC set can be chosen
in light of the corresponding FLR estimate It is also
possible to take into consideration the estimated MMR given by
where is given by unsupervised MEC algorithm. Note
that if then we shall have but not if is guided by a given target value of FLR or
MMR.
In Section 6.2, we
investigate the performance of the MEC sets guided by the error rates through
simulations.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© His Majesty the King in Right of Canada as represented by the Minister of Industry, 2022
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa