A generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithm
In this section, I propose a relatively simple algorithm to solve the error localization problem of Section 4, using the theoretical result from the previous section.
Description for Figure 6.1
Figure describing an algorithm that finds all optimal paths of edit operations for problem (4.1).
Step 0 : Let be a given record and a given set of allowed edit operations. Initialize and
Step 1 : Determine all subsets of cardinality that satisfy these conditions:
- Every subset of elements in is part of
- It holds that
Step 2 : For each found in step 1, construct and, for each path evaluate whether it can lead to a consistent record. If so, then:
- if define and
- if define
If none of the paths lead to a consistent record, add to
Step 3 : If and define and return to step 1.
In practical applications of error localization in official statistics, it is not unusual to have records of over 100 variables. To obtain a problem that is computationally feasible, existing applications of automatic editing based on the Fellegi-Holt paradigm usually specify an upper bound on the number of variables that may be imputed in a single record (e.g., or de Waal and Coutinho (2005) argued that the introduction of such an upper bound is reasonable because a record that requires more than, say, fifteen imputations should be considered unfit for automatic editing anyway. Following this tradition, one can also introduce an upper bound on the number of distinct edit operations that may be applied to a single record. Even with this additional restriction, the search space of potential solutions to (4.1) will usually be too large in practice to find the optimal solution by an exhaustive search.
Figure 6.1 summarizes the proposed error localization algorithm. Its basic set-up was inspired by the apriori algorithm of Agrawal and Srikant (1994) for data mining. Upon completion, the algorithm returns a set containing all paths of allowed edit operations that correspond to an optimal solution to (4.1), as well as the optimal path length [Note: An error localization problem may have multiple optimal solutions, and it may be beneficial to find all of them (Giles 1988; de Waal et al. 2011, pages 66-67).]
After initialization in step 0, the algorithm cycles through steps 1, 2, and 3 at most times. In step 1 of the algorithm, the search space is limited by using the following fact: if has a proper subset for which contains a path that leads to a consistent record, then can contain only suboptimal solutions. Thus, any set that has such a subset may be ignored by the algorithm. Similarly, may also be ignored whenever the total weight of the edit operations in exceeds the path length of the best feasible solution found so far.
During the iteration, the number of subsets encountered in step 1 of the algorithm equals For each of these subsets, the conditions in step 1 have to be checked. If a subset passes these checks, in step 2 all paths in are evaluated using the theory of Section 5. The idea behind the apriori algorithm is that, as becomes larger, the majority of subsets will not pass the checks in the first step, so that the total amount of computational work remains limited. In the context of data mining, this desirable behavior has indeed been observed in practice. Whether it also occurs in the context of error localization remains to be seen.
One possible improvement to the algorithm can be made by observing that the order in which edit operations are applied does not matter in all cases. Sometimes two paths in are equivalent in the sense that any record that can be reached from by the first path can also be reached by the second path, and vice versa. This property defines an equivalence relation on Let be a set that contains one representative from each equivalence class of under this relation. Clearly, the algorithm in Figure 6.1 remains correct if in step 2 the search is limited to instead of Scholtus (2014) provides a simple method for constructing from
A detailed example illustrating the above algorithm can be found in Scholtus (2014).
- Date modified: