# A generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithm

In this section, I propose a relatively simple algorithm to solve the error localization problem of Section 4, using the theoretical result from the previous section.

## Description for Figure 6.1

Figure describing an algorithm that finds all optimal paths of edit operations for problem (4.1).

**Step 0 :** Let $x$
be a given record and $\mathcal{G}$ a given set of allowed edit operations. Initialize $\mathcal{L}:=\varnothing ;$
${\mathcal{B}}_{0}:=\{\varnothing \};$
$W:=\infty ;$ and $t:=1.$

**Step 1 : **Determine all subsets $G\subseteq \mathcal{G}$
of cardinality $t$ that satisfy these conditions:

- Every subset of $t-1$ elements in $G$ is part of ${\mathcal{B}}_{t-1}.$
- It holds that ${\sum}_{g\in G}{w}_{g}\le W}.$

**Step 2 : **For each $G$ found in step 1, construct $\mathcal{P}\left(x;G\right)$
and, for each path $P\in \mathcal{P}\left(x;G\right),$ evaluate whether it can lead to a consistent record. If so, then:

- if $\ell \left(P\right)<W,$ define $\mathcal{L}:=\left\{P\right\}$ and $W:=\ell \left(P\right);$
- if $\ell \left(P\right)=W,$ define $\mathcal{L}:=\mathcal{L}\cup \left\{P\right\}.$

If *none* of the paths $P\in \mathcal{P}\left(x;G\right)$ lead to a consistent record, add $G$ to ${\mathcal{B}}_{t}.$

**Step 3 : **If $t<R$
and ${\mathcal{B}}_{t}\ne \varnothing ,$
define $t:=t+1$ and return to step 1.

In practical applications of error localization in official statistics, it is not unusual to have records of over 100 variables. To obtain a problem that is computationally feasible, existing applications of automatic editing based on the Fellegi-Holt paradigm usually specify an upper bound $M$ on the number of variables that may be imputed in a single record (e.g., $M=12$ or $M=15).$ de Waal and Coutinho (2005) argued that the introduction of such an upper bound is reasonable because a record that requires more than, say, fifteen imputations should be considered unfit for automatic editing anyway. Following this tradition, one can also introduce an upper bound $R$ on the number of distinct edit operations that may be applied to a single record. Even with this additional restriction, the search space of potential solutions to (4.1) will usually be too large in practice to find the optimal solution by an exhaustive search.

Figure 6.1 summarizes the proposed error localization
algorithm. Its basic set-up was inspired by the *apriori algorithm* of Agrawal and Srikant (1994) for data mining.
Upon completion, the algorithm returns a set
$\mathcal{L}$
containing all paths of allowed edit
operations that correspond to an optimal solution to (4.1), as well as the
optimal path length
$W.$
[Note: An error localization problem may have
multiple optimal solutions, and it may be beneficial to find all of them (Giles
1988; de Waal et al. 2011, pages 66-67).]

After initialization in step 0, the algorithm cycles through steps 1, 2, and 3 at most $R$ times. In step 1 of the algorithm, the search space is limited by using the following fact: if $G$ has a proper subset $H\subset G$ for which $\mathcal{P}\left(x;H\right)$ contains a path that leads to a consistent record, then $\mathcal{P}\left(x;G\right)$ can contain only suboptimal solutions. Thus, any set $G$ that has such a subset may be ignored by the algorithm. Similarly, $G$ may also be ignored whenever the total weight of the edit operations in $G$ exceeds the path length of the best feasible solution found so far.

During the ${t}^{\text{th}}$ iteration, the number of subsets $G$ encountered in step 1 of the algorithm equals $\left(\begin{array}{c}N\\ t\end{array}\right).$ For each of these subsets, the conditions in step 1 have to be checked. If a subset $G$ passes these checks, in step 2 all $t!$ paths in $\mathcal{P}\left(x;G\right)$ are evaluated using the theory of Section 5. The idea behind the apriori algorithm is that, as $t$ becomes larger, the majority of subsets will not pass the checks in the first step, so that the total amount of computational work remains limited. In the context of data mining, this desirable behavior has indeed been observed in practice. Whether it also occurs in the context of error localization remains to be seen.

One possible improvement to the algorithm can be made
by observing that the order in which edit operations are applied does not
matter in all cases. Sometimes two paths in
$\mathcal{P}\left(x;G\right)$ are *equivalent* in the sense
that any record that can be reached from
$x$
by the first path can also be reached by the
second path, and vice versa. This property defines an equivalence relation on
$\mathcal{P}\left(x;G\right).$ Let
$\tilde{\mathcal{P}}\left(x;G\right)$ be a set that contains one representative from each equivalence class
of
$\mathcal{P}\left(x;G\right)$ under this relation. Clearly, the algorithm in Figure 6.1 remains
correct if in step 2 the search is limited to
$\tilde{\mathcal{P}}\left(x;G\right)$ instead of
$\mathcal{P}\left(x;G\right).$ Scholtus (2014) provides a simple method for constructing
$\tilde{\mathcal{P}}\left(x;G\right)$ from
$\mathcal{P}\left(x;G\right).$

A detailed example illustrating the above algorithm can be found in Scholtus (2014).

- Date modified: