A generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithmA generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithm

In this section, I propose a relatively simple algorithm to solve the error localization problem of Section 4, using the theoretical result from the previous section.

Figure 1 of the article 14538

Description for Figure 6.1

Figure describing an algorithm that finds all optimal paths of edit operations for problem (4.1).

Step 0 : Let $x$ be a given record and $G$ a given set of allowed edit operations. Initialize $ℒ : = \emptyset;$ $ℬ_{0} : = {\emptyset};$ $W : = \infty;$ and $t : = 1.$

Step 1 : Determine all subsets $G \subseteq G$ of cardinality $t$ that satisfy these conditions:

Every subset of $t - 1$ elements in $G$ is part of $ℬ_{t - 1} .$
It holds that $\sum_{g \in G} w_{g} \leq W .$

Step 2 : For each $G$ found in step 1, construct $P (x; G)$ and, for each path $P \in P (x; G),$ evaluate whether it can lead to a consistent record. If so, then:

if $ℓ (P) < W,$ define $ℒ : = {P}$ and $W : = ℓ (P);$
if $ℓ (P) = W,$ define $ℒ : = ℒ \cup {P} .$

If none of the paths $P \in P (x; G)$ lead to a consistent record, add $G$ to $ℬ_{t} .$

Step 3 : If $t < R$ and $ℬ_{t} \neq \emptyset,$ define $t : = t + 1$ and return to step 1.

In practical applications of error localization in official statistics, it is not unusual to have records of over 100 variables. To obtain a problem that is computationally feasible, existing applications of automatic editing based on the Fellegi-Holt paradigm usually specify an upper bound $M$ on the number of variables that may be imputed in a single record (e.g., $M = 12$ or $M = 15) .$ de Waal and Coutinho (2005) argued that the introduction of such an upper bound is reasonable because a record that requires more than, say, fifteen imputations should be considered unfit for automatic editing anyway. Following this tradition, one can also introduce an upper bound $R$ on the number of distinct edit operations that may be applied to a single record. Even with this additional restriction, the search space of potential solutions to (4.1) will usually be too large in practice to find the optimal solution by an exhaustive search.

Figure 6.1 summarizes the proposed error localization algorithm. Its basic set-up was inspired by the apriori algorithm of Agrawal and Srikant (1994) for data mining. Upon completion, the algorithm returns a set $ℒ$ containing all paths of allowed edit operations that correspond to an optimal solution to (4.1), as well as the optimal path length $W .$ [Note: An error localization problem may have multiple optimal solutions, and it may be beneficial to find all of them (Giles 1988; de Waal et al. 2011, pages 66-67).]

After initialization in step 0, the algorithm cycles through steps 1, 2, and 3 at most $R$ times. In step 1 of the algorithm, the search space is limited by using the following fact: if $G$ has a proper subset $H \subset G$ for which $P (x; H)$ contains a path that leads to a consistent record, then $P (x; G)$ can contain only suboptimal solutions. Thus, any set $G$ that has such a subset may be ignored by the algorithm. Similarly, $G$ may also be ignored whenever the total weight of the edit operations in $G$ exceeds the path length of the best feasible solution found so far.

During the $t^{th}$ iteration, the number of subsets $G$ encountered in step 1 of the algorithm equals $(\begin{matrix} N \\ t \end{matrix}) .$ For each of these subsets, the conditions in step 1 have to be checked. If a subset $G$ passes these checks, in step 2 all $t!$ paths in $P (x; G)$ are evaluated using the theory of Section 5. The idea behind the apriori algorithm is that, as $t$ becomes larger, the majority of subsets will not pass the checks in the first step, so that the total amount of computational work remains limited. In the context of data mining, this desirable behavior has indeed been observed in practice. Whether it also occurs in the context of error localization remains to be seen.

One possible improvement to the algorithm can be made by observing that the order in which edit operations are applied does not matter in all cases. Sometimes two paths in $P (x; G)$ are equivalent in the sense that any record that can be reached from $x$ by the first path can also be reached by the second path, and vice versa. This property defines an equivalence relation on $P (x; G) .$ Let $\tilde{P} (x; G)$ be a set that contains one representative from each equivalence class of $P (x; G)$ under this relation. Clearly, the algorithm in Figure 6.1 remains correct if in step 2 the search is limited to $\tilde{P} (x; G)$ instead of $P (x; G) .$ Scholtus (2014) provides a simple method for constructing $\tilde{P} (x; G)$ from $P (x; G) .$

A detailed example illustrating the above algorithm can be found in Scholtus (2014).

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2016-06-22

Language selection

Search and menus

Search

A generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithmA generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithm