# A generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problem

Let $\mathcal{G}$ be a finite set of allowed edit operations for a given application of automatic editing. Informally, I propose to generalize the error localization problem of Fellegi and Holt (1976) by replacing “the smallest subset of variables that can be imputed to make the record consistent” with “the shortest sequence of allowed edit operations that can be applied to make the record consistent”. To give a formal definition of this generalized error localization problem, some new notation and concepts need to be introduced.

Consider a sequence of points
$x={x}_{0},{x}_{1},\dots ,{x}_{t}=y$
in
${\mathbb{R}}^{p}.$
A *path* from
$x$
to
$y$
is defined as a sequence of *distinct* edit operations
${g}_{1},\dots ,{g}_{t}\in \mathcal{G}$
such that
${x}_{n}={g}_{n}\left({x}_{n-1}\right)$
for all
$n\in \left\{1,\dots ,t\right\}.$
(Note: In the case that
${g}_{n}$
contains free parameters, one should interpret
this equality as “there exist feasible parameter values such that
${g}_{n}$
maps
${x}_{n-1}$
to
${x}_{n}\u201d.)$
A path is denoted by
$P=\left[{g}_{1},\dots ,{g}_{t}\right].$
The set of all possible paths from
$x$
to
$y$
is denoted by
$\mathcal{P}\left(x,y\right).$ This set may be empty. Later, I will use
$\mathcal{P}\left(x;G\right)$ to denote, for a given subset
$G\subseteq \mathcal{G},$
the set of all paths starting in
$x$
that consist of the edit operations in
$G$
in some order (without specifying the free
parameters); if
$G$
contains
$t$
elements,
$\mathcal{P}\left(x;G\right)$ contains
$t!$
paths.

To each edit operation
$g\in \mathcal{G},$
one can associate a weight
${w}_{g}>0$
that expresses the costs of applying edit
operation
$g.$
In particular, the weight of an FH operation is
to be chosen equal to the confidence weight of the variable that it imputes.
Now the *length* of a path
$P=\left[{g}_{1},\dots ,{g}_{t}\right]$
can be defined as the sum of the weights of
its constituent edit operations:
$\ell \left(P\right)={\displaystyle {\sum}_{n=1}^{t}{w}_{{g}_{n}}},$
where, by convention, the empty path has
length zero. The *distance* from
$x$
to
$y$
is defined as the length of the shortest path
that connects
$x$
to
$y:$

$$d\left(x,y\right)=\{\begin{array}{ll}\mathrm{min}\left\{\ell \left(P\right)\text{|}P\in \mathcal{P}\left(x,y\right)\right\}\hfill & \text{if}\mathcal{P}\left(x,y\right)\ne \varnothing ,\hfill \\ \infty \hfill & \text{otherwise}\text{.}\hfill \end{array}$$

In
general,
$d\left(x,y\right)$
satisfies the standard axioms of a metric *except* that it need not be symmetric in
$x$
and
$y;$
it is a so-called *quasimetric* (Scholtus 2014). Accordingly,
$d\left(x,y\right)$
represents “the distance from
$x$
to
$y\u201d$
rather than “the distance between
$x$
and
$y\u201d.$

The distance from $x$ to any closed, non-empty subset $D\subseteq {\mathbb{R}}^{p}$ is defined as the distance to the nearest $y\in D:$ $d\left(x,D\right)=\mathrm{min}\left\{d\left(x,y\right)\text{|}y\in D\right\}.$ For the purpose of error localization, the closed, non-empty subset of ${\mathbb{R}}^{p}$ that is of particular interest is the set ${D}_{0}$ of all points that satisfy (2.1).

I can now formulate the generalized error localization problem.

**Problem.** Consider a given set of consistent records
${D}_{0},$
a given set of allowed edit operations
$\mathcal{G},$
and a given record
$x.$
If
$d\left(x,{D}_{0}\right)=\infty ,$
then the error localization problem for
$x$
is infeasible. Otherwise, any shortest path leading
to a record
$y\in {D}_{0}$
such that
$d\left(x,y\right)<\infty $
is called a *feasible
solution* to the error localization problem for
$x.$
A feasible solution is called *optimal* if it leads to a record
${x}^{*}\in {D}_{0}$
such that

$$d\left(x,{x}^{*}\right)=d\left(x,{D}_{0}\right).\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(4.1)$$

Formally, then, the generalized error localization problem consists of finding an optimal path of edit operations.

**Remark 1.** In general, there may be infinitely many
records
${x}^{*}$
in
${D}_{0}$
that satisfy (4.1) and can be reached by the
same path of edit operations. To solve the error localization problem, it is
sufficient to find an optimal path. Constructing an associated record
${x}^{*}\in {D}_{0}$
may then be regarded as a generalization of
the consistent imputation problem; cf. the discussion on imputation at the end
of Section 3.

**Remark 2.** The above error localization problem is
infeasible for records that cannot be mapped onto
${D}_{0}$
by any combination of distinct edit operations
in
$\mathcal{G}.$
To avoid this situation,
$\mathcal{G}$
should be chosen sufficiently large so that
$d\left(x,{D}_{0}\right)<\infty $
for all
$x\in {\mathbb{R}}^{p}.$
In what follows, I tacitly assume that
$\mathcal{G}$
has this property. An easy way
$\u2013$
not necessarily the only way
$\u2013$
to achieve this is by letting
$\mathcal{G}$
contain at least all FH operations. That this
is sufficient follows from the fact that any two points in
${\mathbb{R}}^{p}$
are connected by a path that concatenates the
FH operations associated with the coordinates on which they differ.

**Remark 3.** It is not difficult to see that the above error
localization problem reduces to the original problem of Fellegi and Holt (1976)
in the special case that
$\mathcal{G}$
contains only the FH operations.

**Remark 4.** As with the original Fellegi-Holt-based error
localization problem, it can be shown that, under certain assumptions,
minimizing
$d\left(x,y\right)$
over all
$y\in {D}_{0}$
for a given observed record
$x$
is approximately equivalent to maximizing the
likelihood of the associated unobserved error-free record. The argument closely
follows that of Kruskal (1983, pages 38-39) for the so-called Levenshtein
distance in the context of approximate string matching. This requires first of
all that the edits (2.1) be hard edits, i.e., failed only by erroneous values.
In addition, it must be assumed that the stochastic “error generating process”
$\mathcal{E}$
introduced in Section 3 has the following
properties:

- There exists a one-to-one correspondence between the set of errors that can occur under $\mathcal{E}$ and the set of allowed edit operations $\mathcal{G}$ that correct them.
- The errors in $\mathcal{E}$ occur independently of each other.
- The error corresponding to operation $g$ occurs with known probability ${p}_{g}.$

Finally, analogous to (2.3), the weights ${w}_{g}$ should be chosen according to

$${w}_{g}=-\mathrm{log}\left(\frac{{p}_{g}}{1-{p}_{g}}\right).\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(4.2)$$

Under these assumptions, Scholtus (2014) adapted the argument of Kruskal (1983) to show that the optimal solution to error localization problem (4.1) can be justified as an approximate maximum likelihood estimator. [Note: The derivation in Scholtus (2014) assumed in addition that all ${p}_{g}\ll 1,$ in which case ${w}_{g}\approx -\mathrm{log}{p}_{g}.$ This assumption is unnecessary; cf. Liepins (1980).]

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: