A generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problemA generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problem

Let $G$ be a finite set of allowed edit operations for a given application of automatic editing. Informally, I propose to generalize the error localization problem of Fellegi and Holt (1976) by replacing “the smallest subset of variables that can be imputed to make the record consistent” with “the shortest sequence of allowed edit operations that can be applied to make the record consistent”. To give a formal definition of this generalized error localization problem, some new notation and concepts need to be introduced.

Consider a sequence of points $x = x_{0}, x_{1}, \dots, x_{t} = y$ in $ℝ^{p} .$ A path from $x$ to $y$ is defined as a sequence of distinct edit operations $g_{1}, \dots, g_{t} \in G$ such that $x_{n} = g_{n} (x_{n - 1})$ for all $n \in {1, \dots, t} .$ (Note: In the case that $g_{n}$ contains free parameters, one should interpret this equality as “there exist feasible parameter values such that $g_{n}$ maps $x_{n - 1}$ to $x_{n} ” .)$ A path is denoted by $P = [g_{1}, \dots, g_{t}] .$ The set of all possible paths from $x$ to $y$ is denoted by $P (x, y) .$ This set may be empty. Later, I will use $P (x; G)$ to denote, for a given subset $G \subseteq G,$ the set of all paths starting in $x$ that consist of the edit operations in $G$ in some order (without specifying the free parameters); if $G$ contains $t$ elements, $P (x; G)$ contains $t!$ paths.

To each edit operation $g \in G,$ one can associate a weight $w_{g} > 0$ that expresses the costs of applying edit operation $g .$ In particular, the weight of an FH operation is to be chosen equal to the confidence weight of the variable that it imputes. Now the length of a path $P = [g_{1}, \dots, g_{t}]$ can be defined as the sum of the weights of its constituent edit operations: $ℓ (P) = \sum_{n = 1}^{t} w_{g_{n}},$ where, by convention, the empty path has length zero. The distance from $x$ to $y$ is defined as the length of the shortest path that connects $x$ to $y :$

$d (x, y) = {\begin{array}{l} \min {ℓ (P) | P \in P (x, y)} & if P (x, y) \neq \emptyset, \\ \infty & otherwise . \end{array}$

In general, $d (x, y)$ satisfies the standard axioms of a metric except that it need not be symmetric in $x$ and $y;$ it is a so-called quasimetric (Scholtus 2014). Accordingly, $d (x, y)$ represents “the distance from $x$ to $y ”$ rather than “the distance between $x$ and $y ” .$

The distance from $x$ to any closed, non-empty subset $D \subseteq ℝ^{p}$ is defined as the distance to the nearest $y \in D :$ $d (x, D) = \min {d (x, y) | y \in D} .$ For the purpose of error localization, the closed, non-empty subset of $ℝ^{p}$ that is of particular interest is the set $D_{0}$ of all points that satisfy (2.1).

I can now formulate the generalized error localization problem.

Problem. Consider a given set of consistent records $D_{0},$ a given set of allowed edit operations $G,$ and a given record $x .$ If $d (x, D_{0}) = \infty,$ then the error localization problem for $x$ is infeasible. Otherwise, any shortest path leading to a record $y \in D_{0}$ such that $d (x, y) < \infty$ is called a feasible solution to the error localization problem for $x .$ A feasible solution is called optimal if it leads to a record $x^{*} \in D_{0}$ such that

$d (x, x^{*}) = d (x, D_{0}) . (4.1)$

Formally, then, the generalized error localization problem consists of finding an optimal path of edit operations.

Remark 1. In general, there may be infinitely many records $x^{*}$ in $D_{0}$ that satisfy (4.1) and can be reached by the same path of edit operations. To solve the error localization problem, it is sufficient to find an optimal path. Constructing an associated record $x^{*} \in D_{0}$ may then be regarded as a generalization of the consistent imputation problem; cf. the discussion on imputation at the end of Section 3.

Remark 2. The above error localization problem is infeasible for records that cannot be mapped onto $D_{0}$ by any combination of distinct edit operations in $G .$ To avoid this situation, $G$ should be chosen sufficiently large so that $d (x, D_{0}) < \infty$ for all $x \in ℝ^{p} .$ In what follows, I tacitly assume that $G$ has this property. An easy way $-$ not necessarily the only way $-$ to achieve this is by letting $G$ contain at least all FH operations. That this is sufficient follows from the fact that any two points in $ℝ^{p}$ are connected by a path that concatenates the FH operations associated with the coordinates on which they differ.

Remark 3. It is not difficult to see that the above error localization problem reduces to the original problem of Fellegi and Holt (1976) in the special case that $G$ contains only the FH operations.

Remark 4. As with the original Fellegi-Holt-based error localization problem, it can be shown that, under certain assumptions, minimizing $d (x, y)$ over all $y \in D_{0}$ for a given observed record $x$ is approximately equivalent to maximizing the likelihood of the associated unobserved error-free record. The argument closely follows that of Kruskal (1983, pages 38-39) for the so-called Levenshtein distance in the context of approximate string matching. This requires first of all that the edits (2.1) be hard edits, i.e., failed only by erroneous values. In addition, it must be assumed that the stochastic “error generating process” $ℰ$ introduced in Section 3 has the following properties:

There exists a one-to-one correspondence between the set of errors that can occur under $ℰ$ and the set of allowed edit operations $G$ that correct them.
The errors in $ℰ$ occur independently of each other.
The error corresponding to operation $g$ occurs with known probability $p_{g} .$

Finally, analogous to (2.3), the weights $w_{g}$ should be chosen according to

$w_{g} = - \log (\frac{p_{g}}{1 - p_{g}}) . (4.2)$

Under these assumptions, Scholtus (2014) adapted the argument of Kruskal (1983) to show that the optimal solution to error localization problem (4.1) can be justified as an approximate maximum likelihood estimator. [Note: The derivation in Scholtus (2014) assumed in addition that all $p_{g} ≪ 1,$ in which case $w_{g} \approx - \log p_{g} .$ This assumption is unnecessary; cf. Liepins (1980).]

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2016-06-22

Language selection

Search and menus

Search

A generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problemA generalized Fellegi-Holt paradigm for automatic error localization 4. A generalized error localization problem