# A generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problemA generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problem

Let $\mathcal{G}$ be a finite set of allowed edit operations for a given application of automatic editing. Informally, I propose to generalize the error localization problem of Fellegi and Holt (1976) by replacing “the smallest subset of variables that can be imputed to make the record consistent” with “the shortest sequence of allowed edit operations that can be applied to make the record consistent”. To give a formal definition of this generalized error localization problem, some new notation and concepts need to be introduced.

Consider a sequence of points $x={x}_{0},{x}_{1},\dots ,{x}_{t}=y$ in ${ℝ}^{p}.$ A path from $x$ to $y$ is defined as a sequence of distinct edit operations ${g}_{1},\dots ,{g}_{t}\in \mathcal{G}$ such that ${x}_{n}={g}_{n}\left({x}_{n-1}\right)$ for all $n\in \left\{1,\dots ,t\right\}.$ (Note: In the case that ${g}_{n}$ contains free parameters, one should interpret this equality as “there exist feasible parameter values such that ${g}_{n}$ maps ${x}_{n-1}$ to ${x}_{n}”.\right)$ A path is denoted by $P=\left[{g}_{1},\dots ,{g}_{t}\right].$ The set of all possible paths from $x$ to $y$ is denoted by $\mathcal{P}\left(x,y\right).$ This set may be empty. Later, I will use $\mathcal{P}\left(x;G\right)$ to denote, for a given subset $G\subseteq \mathcal{G},$ the set of all paths starting in $x$ that consist of the edit operations in $G$ in some order (without specifying the free parameters); if $G$ contains $t$ elements, $\mathcal{P}\left(x;G\right)$ contains $t!$ paths.

To each edit operation $g\in \mathcal{G},$ one can associate a weight ${w}_{g}>0$ that expresses the costs of applying edit operation $g.$ In particular, the weight of an FH operation is to be chosen equal to the confidence weight of the variable that it imputes. Now the length of a path $P=\left[{g}_{1},\dots ,{g}_{t}\right]$ can be defined as the sum of the weights of its constituent edit operations: $\ell \left(P\right)={\sum }_{n=1}^{t}{w}_{{g}_{n}},$ where, by convention, the empty path has length zero. The distance from $x$ to $y$ is defined as the length of the shortest path that connects $x$ to $y:$

In general, $d\left(x,y\right)$ satisfies the standard axioms of a metric except that it need not be symmetric in $x$ and $y;$ it is a so-called quasimetric (Scholtus 2014). Accordingly, $d\left(x,y\right)$ represents “the distance from $x$ to $y”$ rather than “the distance between $x$ and $y”.$

The distance from $x$ to any closed, non-empty subset $D\subseteq {ℝ}^{p}$ is defined as the distance to the nearest $y\in D:$ $d\left(x,D\right)=\mathrm{min}\left\{d\left(x,y\right)\text{|}y\in D\right\}.$ For the purpose of error localization, the closed, non-empty subset of ${ℝ}^{p}$ that is of particular interest is the set ${D}_{0}$ of all points that satisfy (2.1).

I can now formulate the generalized error localization problem.

Problem. Consider a given set of consistent records ${D}_{0},$ a given set of allowed edit operations $\mathcal{G},$ and a given record $x.$ If $d\left(x,{D}_{0}\right)=\infty ,$ then the error localization problem for $x$ is infeasible. Otherwise, any shortest path leading to a record $y\in {D}_{0}$ such that $d\left(x,y\right)<\infty$ is called a feasible solution to the error localization problem for $x.$ A feasible solution is called optimal if it leads to a record ${x}^{*}\in {D}_{0}$ such that

$d\left(x,{x}^{*}\right)=d\left(x,{D}_{0}\right).\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.1\right)$

Formally, then, the generalized error localization problem consists of finding an optimal path of edit operations.

Remark 1. In general, there may be infinitely many records ${x}^{*}$ in ${D}_{0}$ that satisfy (4.1) and can be reached by the same path of edit operations. To solve the error localization problem, it is sufficient to find an optimal path. Constructing an associated record ${x}^{*}\in {D}_{0}$ may then be regarded as a generalization of the consistent imputation problem; cf. the discussion on imputation at the end of Section 3.

Remark 2. The above error localization problem is infeasible for records that cannot be mapped onto ${D}_{0}$ by any combination of distinct edit operations in $\mathcal{G}.$ To avoid this situation, $\mathcal{G}$ should be chosen sufficiently large so that $d\left(x,{D}_{0}\right)<\infty$ for all $x\in {ℝ}^{p}.$ In what follows, I tacitly assume that $\mathcal{G}$ has this property. An easy way $–$ not necessarily the only way $–$ to achieve this is by letting $\mathcal{G}$ contain at least all FH operations. That this is sufficient follows from the fact that any two points in ${ℝ}^{p}$ are connected by a path that concatenates the FH operations associated with the coordinates on which they differ.

Remark 3. It is not difficult to see that the above error localization problem reduces to the original problem of Fellegi and Holt (1976) in the special case that $\mathcal{G}$ contains only the FH operations.

Remark 4. As with the original Fellegi-Holt-based error localization problem, it can be shown that, under certain assumptions, minimizing $d\left(x,y\right)$ over all $y\in {D}_{0}$ for a given observed record $x$ is approximately equivalent to maximizing the likelihood of the associated unobserved error-free record. The argument closely follows that of Kruskal (1983, pages 38-39) for the so-called Levenshtein distance in the context of approximate string matching. This requires first of all that the edits (2.1) be hard edits, i.e., failed only by erroneous values. In addition, it must be assumed that the stochastic “error generating process” $ℰ$ introduced in Section 3 has the following properties:

• There exists a one-to-one correspondence between the set of errors that can occur under $ℰ$ and the set of allowed edit operations $\mathcal{G}$ that correct them.
• The errors in $ℰ$ occur independently of each other.
• The error corresponding to operation $g$ occurs with known probability ${p}_{g}.$

Finally, analogous to (2.3), the weights ${w}_{g}$ should be chosen according to

${w}_{g}=-\mathrm{log}\left(\frac{{p}_{g}}{1-{p}_{g}}\right).\text{ }\text{ }\text{ }\text{ }\text{ }\left(4.2\right)$

Under these assumptions, Scholtus (2014) adapted the argument of Kruskal (1983) to show that the optimal solution to error localization problem (4.1) can be justified as an approximate maximum likelihood estimator. [Note: The derivation in Scholtus (2014) assumed in addition that all ${p}_{g}\ll 1,$ in which case ${w}_{g}\approx -\mathrm{log}{p}_{g}.$ This assumption is unnecessary; cf. Liepins (1980).]

Is something not working? Is there information outdated? Can't find what you're looking for?