# A generalized Fellegi-Holt paradigm for automatic error localization 8. Conclusion

In this article, a new formulation was proposed of the error localization problem in automatic editing. It was suggested to find the (weighted) minimal number of edit operations needed to make an observed record consistent with the edits. The new error localization problem can be seen as a generalization of the problem proposed in a seminal paper by Fellegi and Holt (1976), because the operation that imputes a new value for one variable at a time is an important special case of an edit operation.

The main focus here has been on developing the mathematical theory behind the new error localization problem. It turns out that FM elimination, a technique that has been used in the past to solve the Fellegi-Holt-based error localization problem, can be applied also in the context of the new problem (Section 5). Nevertheless, the task of solving the new error localization problem is challenging from a computational point of view, at least for the numbers of variables, edits, and edit operations that would be encountered in practical applications at statistical institutes. A possible error localization algorithm was outlined in Section 6. More efficient algorithms probably could and should be developed. Similarly to FM elimination, it may be possible to adapt other ideas that have been used to solve the Fellegi-Holt-based problem to the generalized problem considered here.

The discussion in this article was restricted to numerical data and linear edits. The original Fellegi-Holt paradigm has been applied also to categorical and mixed data. Several authors, including Bruni (2004) and de Jonge and van der Loo (2014), have shown that a large class of edits for mixed data can be re-formulated in terms of numerical data and linear edits, with the additional restriction that some of the variables have to be integer-valued. In principle, this means that the results in this article could be applied also to mixed data. To accommodate the fact that some variables are integer-valued, Pugh’s (1992) extension of FM elimination to integers could be used; see also de Waal et al. (2011) for a discussion of this extended elimination technique in the context of Fellegi-Holt-based error localization. It remains to be seen whether this approach is computationally feasible.

Remark 4 in Section 4 hinted at an analogy between error localization in statistical microdata and the field of approximate string matching. In approximate string matching, text strings are compared under the assumption that they may have been partially corrupted (Navarro 2001). Various distance functions have been proposed for this task. The Hamming distance, which counts the number of positions on which two strings differ, may be seen as an analogue of the Fellegi-Holt-based target function (2.2). The generalized error localization problem defined in this paper has its counterpart in the use of the Levenshtein distance or “edit distance” for approximate string matching. It may be interesting to explore this analogy further. In particular, efficient algorithms have been developed for computing edit distances between strings; it might be possible to apply some of the underlying ideas also to the generalized error localization problem.

The new error localization algorithm was applied successfully to a small synthetic data set (Section 7). Overall, the results of this simulation study suggest that the new error localization approach has the potential to achieve a substantial improvement of the quality of automatic editing compared to the approach that is currently used in practice. However, this does require that sufficient information be available to identify all $\u2013$ or at least most $\u2013$ of the relevant edit operations in a particular application. Possible gains in the quality of error localization also have to be weighed in practice against the higher computational demands of the generalized error localization problem.

An obvious candidate for applying the new methodology in practice would be the SBS. However, more research is needed before this method could be applied during regular production. To apply the method in a particular context, it is necessary first to specify the relevant edit operations. Ideally, each edit operation should correspond to a combination of amendments to the data that human editors consider to be a correction for one particular error. In addition, a suitable set of weights ${w}_{g}$ has to be determined for these edit operations. This would require information about the relative frequencies of the most common types of amendments made during manual editing. Both aspects could be investigated based on historical data before and after manual editing, editing instructions and other documentation used by the editors, and interviews with editors and/or supervisors of editing.

On a more fundamental level, a question of demarcation arises between deductive correction methods and automatic editing under the new error localization problem. In principle, many known types of error could be resolved either by automatic correction rules or by error localization using edit operations. Each approach has its own advantages and disadvantages (Scholtus 2014). It is likely that some compromise will produce the best results, with some errors handled deductively and others by edit operations. However, it is not obvious how best to make this division in practice.

Ultimately, the aim of the new methodology proposed in this article is to improve the usefulness of automatic editing in practice. So far, the results are promising.

## Acknowledgements

The views expressed in this article are those of the
author and do not necessarily reflect the policies of *Statistics Netherlands*. The author would like to thank Jeroen
Pannekoek, Ton de Waal, and Mark van der Loo for their comments on
earlier versions of this article, as well as the Associate Editor and two
anonymous referees.

## Appendix

### Fourier-Motzkin elimination

Consider a system of linear constraints (2.1) and let ${x}_{f}$ be the variable to be eliminated. First, suppose that ${x}_{f}$ is involved only in inequalities. For ease of exposition, suppose that the edits are normalized so that all inequalities use the $\ge $ operator. The FM elimination method considers all pairs $\left(r,s\right)$ of inequalities in which the coefficients of ${x}_{f}$ have opposite signs; that is, ${a}_{rf}{a}_{sf}<0.$ Suppose without loss of generality that ${a}_{rf}<0$ and ${a}_{sf}>0.$ From the original pair of edits, the following implied constraint is derived:

$$\sum _{j=1}^{p}{a}_{j}^{*}{x}_{j}+{b}^{*}\ge 0},\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}\text{.1})$$

with ${a}_{j}^{*}={a}_{sf}{a}_{rj}-{a}_{rf}{a}_{sj}$ and ${b}^{*}={a}_{sf}{b}_{r}-{a}_{rf}{b}_{s}.$ Note that ${a}_{f}^{*}=0,$ so ${x}_{f}$ is not involved in (A.1). An inequality of the form (A.1) is derived from each of the above-mentioned pairs $\left(r,s\right).$ The full implied system of constraints obtained by FM elimination now consists of these derived constraints, together with all original constraints that do not involve ${x}_{f}.$

If there are linear equalities that involve ${x}_{f},$ the above technique could be applied after replacing each linear equality with two equivalent linear inequalities. de Waal and Quere (2003) suggested a more efficient alternative for this case. Suppose that the ${r}^{\text{th}}$ constraint in (2.1) is an equality that involves ${x}_{f}.$ This constraint can be rewritten as

$${x}_{f}=\frac{-1}{{a}_{rf}}\left({b}_{r}+{\displaystyle {\sum}_{j\ne f}{a}_{rj}{x}_{j}}\right).\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}(\text{A}\text{.2})$$

By substituting the expression on the right-hand-side of (A.2) for ${x}_{f}$ in all other constraints, one again obtains an implied system of constraints that does not involve ${x}_{f}$ and that can be rewritten in the form (2.1).

For a proof that FM elimination has the fundamental property mentioned in Section 2, see, e.g., de Waal et al. (2011, pages 69-70).

## References

Agrawal, R., and Srikant, R. (1994). *Fast
Algorithms for Mining Association Rules*. Technical report, IBM Almaden
Research Center, San Jose, California.

Bruni, R. (2004). Discrete models for data imputation. *Discrete Applied
Mathematics*, 144, 59-69.

Chen, B., Thibaudeau, Y. and Winkler, W.E. (2003). *A Comparison Study of ACS If-Then-Else, NIM, DISCRETE Edit and Imputation
Systems Using ACS Data*. Working Paper No. 7, UN/ECE Work Session on
Statistical Data Editing, Madrid.

de Jonge, E., and van der Loo, M. (2014). *Error Localization as a Mixed Integer Problem with the Editrules
Package*. Discussion Paper 2014-07, Statistics Netherlands, The Hague.
Available at: http://www.cbs.nl.

de Waal, T. (2003). Solving the error localization problem by means of vertex
generation. *Survey Methodology*,
29, 1, 71-79.

de Waal, T., and Coutinho, W. (2005). Automatic editing for business surveys:
An assessment for selected algorithms. *International Statistical Review*, 73, 73-102.

de Waal, T., and Quere, R. (2003). A fast and simple algorithm for
automatic editing of mixed data. *Journal of Official Statistics*, 19, 383-402.

de Waal, T., Pannekoek, J. and Scholtus, S. (2011). *Handbook of
Statistical Data Editing and Imputation*. Hoboken, New Jersey: John Wiley
& Sons, Inc.

EDIMBUS (2007). *Recommended Practices for Editing and Imputation in
Cross-Sectional Business Surveys*. Eurostat manual prepared by ISTAT,
Statistics Netherlands, and SFSO.

Fellegi, I.P., and Holt, D. (1976). A systematic approach to automatic edit
and imputation. *Journal of the American Statistical Association*, 71, 17-35.

Garfinkel, R.S., Kunnathur, A.S. and Liepins, G.E. (1988). Error localization
for erroneous data: Continuous data, linear constraints. *SIAM Journal on
Scientific and Statistical Computing*,
9, 922-931.

Ghosh-Dastidar, B., and Schafer, J.L. (2006). Outlier detection and editing
procedures for continuous multivariate data. *Journal of Official Statistics*, 22, 487-506.

Giles, P. (1988). A model for generalized edit and imputation of survey data. *The Canadian Journal of Statistics*,
16, 57-73.

Granquist, L. (1995). Improving the traditional editing process. In *Business
Survey Methods*, (Eds., B.G. Cox, D.A. Binder, B.N. Chinnappa,
A. Christianson, M.J. Colledge and P.S. Kott), John Wiley &
Sons, Inc., 385-401.

Granquist, L. (1997). The new view on editing. *International Statistical
Review*, 65, 381-387.

Granquist, L., and Kovar, J. (1997). Editing of survey data: How much is enough?
In *Survey Measurement and Process Quality*, (Eds., L.E. Lyberg, P. Biemer,
M. Collins, E.D. de Leeuw, C. Dippo, N. Schwartz and
D. Trewin), John Wiley & Sons, Inc., 415-435.

Hedlin, D. (2003). Score functions to reduce business survey editing at the
U.K. Office for National Statistics. *Journal of Official Statistics*, 19, 177-199.

Hidiroglou, M.A., and Berthelot, J.-M. (1986). Statistical editing and imputation
for periodic business surveys. *Survey
Methodology*, 12, 1, 73-83.

Kovar, J., and Whitridge, P. (1990). Generalized edit and imputation system;
Overview and applications. *Revista Brasileira de Estadistica*, 51, 85-100.

Kruskal, J.B. (1983). An overview of sequence comparison. In *Time Warps,
String Edits, and Macromolecules: The Theory and Practice of Sequence
Comparison*, (Eds., D. Sankoff and J.B. Kruskal), Addison-Wesley,
1-44.

Lawrence, D., and McKenzie, R. (2000). The general application of significance
editing. *Journal of Official Statistics*, 16, 243-253.

Liepins, G.E. (1980). *A Rigorous,
Systematic Approach to Automatic Data Editing and its Statistical Basis*.
Report ORNL/TM-7126, Oak Ridge National Laboratory.

Liepins, G.E., Garfinkel, R.S. and Kunnathur, A.S. (1982). Error localization
for erroneous data: A survey. *TIMS/Studies
in the Management Sciences*, 19, 205-219.

Little, R.J.A., and Smith, P.J. (1987). Editing and imputation of quantitative
survey data. *Journal of the American Statistical Association*, 82, 58-68.

Naus, J.I., Johnson, T.G. and Montalvo, R. (1972). A probabilistic model
for identifying errors in data editing. *Journal of the American Statistical
Association*, 67, 943-950.

Navarro, G. (2001). A guided tour to approximate string matching. *ACM Computing
Surveys*, 33, 31-88.

Pannekoek, J., Scholtus, S. and van der Loo, M. (2013). Automated and manual
data editing: A view on process design and methodology. *Journal of Official
Statistics*, 29, 511-537.

Pugh, W. (1992). The omega test: A fast and practical integer programming algorithm
for data dependence analysis. *Communications of the ACM*, 35, 102-114.

R Development Core Team (2015). *R: A Language and Environment for
Statistical Computing*. Vienna, Austria: R Foundation for Statistical
Computing. URL: http://www.R-project.org/.

Ragsdale, C.T., and McKeown, P.G. (1996). On solving the continuous data editing
problem. *Computers & Operations Research*, 23, 263-273.

Riera-Ledesma, J., and Salazar-González, J.J. (2003). *New Algorithms for the Editing and Imputation Problem*. Working
Paper No. 5, UN/ECE Work Session on Statistical Data Editing, Madrid.

Riera-Ledesma, J., and Salazar-González, J.J. (2007). A branch-and-cut algorithm
for the continuous error localization problem in data cleaning. *Computers & Operations Research*, 34,
2790-2804.

Schaffer, J. (1987). Procedure for solving the data-editing problem with both
continuous and discrete data types. *Naval Research Logistics*, 34, 879-890.

Scholtus, S. (2011). Algorithms for correcting sign errors and rounding errors
in business survey data. *Journal of Official Statistics*, 27, 467-490.

Scholtus, S. (2014). *Error
Localisation using General Edit Operations*. Discussion Paper 2014-14,
Statistics Netherlands, The Hague. Available at: http://www.cbs.nl.

Tempelman, D.C.G. (2007). *Imputation
of Restricted Data*. Ph. D. Thesis, University of Groningen.
Available at: http://www.cbs.nl.

van der Loo, M., and de Jonge,
E. (2012). *Automatic Data Editing with Open Source R*. Working Paper No.
33, UN/ECE Work Session on Statistical Data Editing, Oslo.

Williams, H.P. (1986). Fourier’s method of linear programming and its dual. *The American Mathematical Monthly*,
93, 681-695.

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: