A generalized Fellegi-Holt paradigm for automatic error localization 8. ConclusionA generalized Fellegi-Holt paradigm for automatic error localization 8. Conclusion

In this article, a new formulation was proposed of the error localization problem in automatic editing. It was suggested to find the (weighted) minimal number of edit operations needed to make an observed record consistent with the edits. The new error localization problem can be seen as a generalization of the problem proposed in a seminal paper by Fellegi and Holt (1976), because the operation that imputes a new value for one variable at a time is an important special case of an edit operation.

The main focus here has been on developing the mathematical theory behind the new error localization problem. It turns out that FM elimination, a technique that has been used in the past to solve the Fellegi-Holt-based error localization problem, can be applied also in the context of the new problem (Section 5). Nevertheless, the task of solving the new error localization problem is challenging from a computational point of view, at least for the numbers of variables, edits, and edit operations that would be encountered in practical applications at statistical institutes. A possible error localization algorithm was outlined in Section 6. More efficient algorithms probably could and should be developed. Similarly to FM elimination, it may be possible to adapt other ideas that have been used to solve the Fellegi-Holt-based problem to the generalized problem considered here.

The discussion in this article was restricted to numerical data and linear edits. The original Fellegi-Holt paradigm has been applied also to categorical and mixed data. Several authors, including Bruni (2004) and de Jonge and van der Loo (2014), have shown that a large class of edits for mixed data can be re-formulated in terms of numerical data and linear edits, with the additional restriction that some of the variables have to be integer-valued. In principle, this means that the results in this article could be applied also to mixed data. To accommodate the fact that some variables are integer-valued, Pugh’s (1992) extension of FM elimination to integers could be used; see also de Waal et al. (2011) for a discussion of this extended elimination technique in the context of Fellegi-Holt-based error localization. It remains to be seen whether this approach is computationally feasible.

Remark 4 in Section 4 hinted at an analogy between error localization in statistical microdata and the field of approximate string matching. In approximate string matching, text strings are compared under the assumption that they may have been partially corrupted (Navarro 2001). Various distance functions have been proposed for this task. The Hamming distance, which counts the number of positions on which two strings differ, may be seen as an analogue of the Fellegi-Holt-based target function (2.2). The generalized error localization problem defined in this paper has its counterpart in the use of the Levenshtein distance or “edit distance” for approximate string matching. It may be interesting to explore this analogy further. In particular, efficient algorithms have been developed for computing edit distances between strings; it might be possible to apply some of the underlying ideas also to the generalized error localization problem.

The new error localization algorithm was applied successfully to a small synthetic data set (Section 7). Overall, the results of this simulation study suggest that the new error localization approach has the potential to achieve a substantial improvement of the quality of automatic editing compared to the approach that is currently used in practice. However, this does require that sufficient information be available to identify all $-$ or at least most $-$ of the relevant edit operations in a particular application. Possible gains in the quality of error localization also have to be weighed in practice against the higher computational demands of the generalized error localization problem.

An obvious candidate for applying the new methodology in practice would be the SBS. However, more research is needed before this method could be applied during regular production. To apply the method in a particular context, it is necessary first to specify the relevant edit operations. Ideally, each edit operation should correspond to a combination of amendments to the data that human editors consider to be a correction for one particular error. In addition, a suitable set of weights $w_{g}$ has to be determined for these edit operations. This would require information about the relative frequencies of the most common types of amendments made during manual editing. Both aspects could be investigated based on historical data before and after manual editing, editing instructions and other documentation used by the editors, and interviews with editors and/or supervisors of editing.

On a more fundamental level, a question of demarcation arises between deductive correction methods and automatic editing under the new error localization problem. In principle, many known types of error could be resolved either by automatic correction rules or by error localization using edit operations. Each approach has its own advantages and disadvantages (Scholtus 2014). It is likely that some compromise will produce the best results, with some errors handled deductively and others by edit operations. However, it is not obvious how best to make this division in practice.

Ultimately, the aim of the new methodology proposed in this article is to improve the usefulness of automatic editing in practice. So far, the results are promising.

Acknowledgements

The views expressed in this article are those of the author and do not necessarily reflect the policies of Statistics Netherlands. The author would like to thank Jeroen Pannekoek, Ton de Waal, and Mark van der Loo for their comments on earlier versions of this article, as well as the Associate Editor and two anonymous referees.

Appendix

Fourier-Motzkin elimination

Consider a system of linear constraints (2.1) and let $x_{f}$ be the variable to be eliminated. First, suppose that $x_{f}$ is involved only in inequalities. For ease of exposition, suppose that the edits are normalized so that all inequalities use the $\geq$ operator. The FM elimination method considers all pairs $(r, s)$ of inequalities in which the coefficients of $x_{f}$ have opposite signs; that is, $a_{r f} a_{s f} < 0.$ Suppose without loss of generality that $a_{r f} < 0$ and $a_{s f} > 0.$ From the original pair of edits, the following implied constraint is derived:

$\sum_{j = 1}^{p} a_{j}^{*} x_{j} + b^{*} \geq 0, (A .1)$

with $a_{j}^{*} = a_{s f} a_{r j} - a_{r f} a_{s j}$ and $b^{*} = a_{s f} b_{r} - a_{r f} b_{s} .$ Note that $a_{f}^{*} = 0,$ so $x_{f}$ is not involved in (A.1). An inequality of the form (A.1) is derived from each of the above-mentioned pairs $(r, s) .$ The full implied system of constraints obtained by FM elimination now consists of these derived constraints, together with all original constraints that do not involve $x_{f} .$

If there are linear equalities that involve $x_{f},$ the above technique could be applied after replacing each linear equality with two equivalent linear inequalities. de Waal and Quere (2003) suggested a more efficient alternative for this case. Suppose that the $r^{th}$ constraint in (2.1) is an equality that involves $x_{f} .$ This constraint can be rewritten as

$x_{f} = \frac{- 1}{a_{r f}} (b_{r} + \sum_{j \neq f} a_{r j} x_{j}) . (A .2)$

By substituting the expression on the right-hand-side of (A.2) for $x_{f}$ in all other constraints, one again obtains an implied system of constraints that does not involve $x_{f}$ and that can be rewritten in the form (2.1).

For a proof that FM elimination has the fundamental property mentioned in Section 2, see, e.g., de Waal et al. (2011, pages 69-70).

References

Agrawal, R., and Srikant, R. (1994). Fast Algorithms for Mining Association Rules. Technical report, IBM Almaden Research Center, San Jose, California.

Bruni, R. (2004). Discrete models for data imputation. Discrete Applied Mathematics, 144, 59-69.

Chen, B., Thibaudeau, Y. and Winkler, W.E. (2003). A Comparison Study of ACS If-Then-Else, NIM, DISCRETE Edit and Imputation Systems Using ACS Data. Working Paper No. 7, UN/ECE Work Session on Statistical Data Editing, Madrid.

de Jonge, E., and van der Loo, M. (2014). Error Localization as a Mixed Integer Problem with the Editrules Package. Discussion Paper 2014-07, Statistics Netherlands, The Hague. Available at: http://www.cbs.nl.

de Waal, T. (2003). Solving the error localization problem by means of vertex generation. Survey Methodology, 29, 1, 71-79.

de Waal, T., and Coutinho, W. (2005). Automatic editing for business surveys: An assessment for selected algorithms. International Statistical Review, 73, 73-102.

de Waal, T., and Quere, R. (2003). A fast and simple algorithm for automatic editing of mixed data. Journal of Official Statistics, 19, 383-402.

de Waal, T., Pannekoek, J. and Scholtus, S. (2011). Handbook of Statistical Data Editing and Imputation. Hoboken, New Jersey: John Wiley & Sons, Inc.

EDIMBUS (2007). Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys. Eurostat manual prepared by ISTAT, Statistics Netherlands, and SFSO.

Fellegi, I.P., and Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71, 17-35.

Garfinkel, R.S., Kunnathur, A.S. and Liepins, G.E. (1988). Error localization for erroneous data: Continuous data, linear constraints. SIAM Journal on Scientific and Statistical Computing, 9, 922-931.

Ghosh-Dastidar, B., and Schafer, J.L. (2006). Outlier detection and editing procedures for continuous multivariate data. Journal of Official Statistics, 22, 487-506.

Giles, P. (1988). A model for generalized edit and imputation of survey data. The Canadian Journal of Statistics, 16, 57-73.

Granquist, L. (1995). Improving the traditional editing process. In Business Survey Methods, (Eds., B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge and P.S. Kott), John Wiley & Sons, Inc., 385-401.

Granquist, L. (1997). The new view on editing. International Statistical Review, 65, 381-387.

Granquist, L., and Kovar, J. (1997). Editing of survey data: How much is enough? In Survey Measurement and Process Quality, (Eds., L.E. Lyberg, P. Biemer, M. Collins, E.D. de Leeuw, C. Dippo, N. Schwartz and D. Trewin), John Wiley & Sons, Inc., 415-435.

Hedlin, D. (2003). Score functions to reduce business survey editing at the U.K. Office for National Statistics. Journal of Official Statistics, 19, 177-199.

Hidiroglou, M.A., and Berthelot, J.-M. (1986). Statistical editing and imputation for periodic business surveys. Survey Methodology, 12, 1, 73-83.

Kovar, J., and Whitridge, P. (1990). Generalized edit and imputation system; Overview and applications. Revista Brasileira de Estadistica, 51, 85-100.

Kruskal, J.B. (1983). An overview of sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, (Eds., D. Sankoff and J.B. Kruskal), Addison-Wesley, 1-44.

Lawrence, D., and McKenzie, R. (2000). The general application of significance editing. Journal of Official Statistics, 16, 243-253.

Liepins, G.E. (1980). A Rigorous, Systematic Approach to Automatic Data Editing and its Statistical Basis. Report ORNL/TM-7126, Oak Ridge National Laboratory.

Liepins, G.E., Garfinkel, R.S. and Kunnathur, A.S. (1982). Error localization for erroneous data: A survey. TIMS/Studies in the Management Sciences, 19, 205-219.

Little, R.J.A., and Smith, P.J. (1987). Editing and imputation of quantitative survey data. Journal of the American Statistical Association, 82, 58-68.

Naus, J.I., Johnson, T.G. and Montalvo, R. (1972). A probabilistic model for identifying errors in data editing. Journal of the American Statistical Association, 67, 943-950.

Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33, 31-88.

Pannekoek, J., Scholtus, S. and van der Loo, M. (2013). Automated and manual data editing: A view on process design and methodology. Journal of Official Statistics, 29, 511-537.

Pugh, W. (1992). The omega test: A fast and practical integer programming algorithm for data dependence analysis. Communications of the ACM, 35, 102-114.

R Development Core Team (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. URL: http://www.R-project.org/.

Ragsdale, C.T., and McKeown, P.G. (1996). On solving the continuous data editing problem. Computers & Operations Research, 23, 263-273.

Riera-Ledesma, J., and Salazar-González, J.J. (2003). New Algorithms for the Editing and Imputation Problem. Working Paper No. 5, UN/ECE Work Session on Statistical Data Editing, Madrid.

Riera-Ledesma, J., and Salazar-González, J.J. (2007). A branch-and-cut algorithm for the continuous error localization problem in data cleaning. Computers & Operations Research, 34, 2790-2804.

Schaffer, J. (1987). Procedure for solving the data-editing problem with both continuous and discrete data types. Naval Research Logistics, 34, 879-890.

Scholtus, S. (2011). Algorithms for correcting sign errors and rounding errors in business survey data. Journal of Official Statistics, 27, 467-490.

Scholtus, S. (2014). Error Localisation using General Edit Operations. Discussion Paper 2014-14, Statistics Netherlands, The Hague. Available at: http://www.cbs.nl.

Tempelman, D.C.G. (2007). Imputation of Restricted Data. Ph. D. Thesis, University of Groningen. Available at: http://www.cbs.nl.

van der Loo, M., and de Jonge, E. (2012). Automatic Data Editing with Open Source R. Working Paper No. 33, UN/ECE Work Session on Statistical Data Editing, Oslo.

Williams, H.P. (1986). Fourier’s method of linear programming and its dual. The American Mathematical Monthly, 93, 681-695.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2016-06-22

Language selection

Search and menus

Search

A generalized Fellegi-Holt paradigm for automatic error localization 8. ConclusionA generalized Fellegi-Holt paradigm for automatic error localization 8. Conclusion

Acknowledgements

Appendix

Fourier-Motzkin elimination

References