A generalized Fellegi-Holt paradigm for automatic error localization 3. Edit operationsA generalized Fellegi-Holt paradigm for automatic error localization 3. Edit operations

Continuing with the notation from Section 2, I define an edit operation $g$ to be an affine function of the general form

$g (x) = T x + S α + c, (3.1)$

where $T$ and $S$ are known coefficient matrices of dimensions $p \times p$ and $p \times m,$ respectively, $α = (α_{1}, \dots, α_{m})'$ is a vector of free parameters that may occur in $g,$ and $c$ is a $p -$ vector of known constants. In the special case that $g$ does not involve any free parameters $(m = 0),$ the second term in (3.1) vanishes. Sometimes, it may be useful to impose one or several linear constraints on the free parameters in $g :$

$R α + d ⊙ 0, (3.2)$

with $R$ a known matrix, and $d$ a known vector of constants. (Note: Matrix-vector notation will be used throughout this article because it leads to a concise description of results; however, using matrices to represent edits and edit operations is probably not the most efficient way to implement these results on a computer.)

As a first example, consider the operation that replaces one of the original values in $x$ by an arbitrary new value (imputation). I will call this an FH operation, in view of its central role in automatic editing based on the Fellegi-Holt paradigm. Let $I$ denote the $p \times p$ identity matrix and $e_{i}$ the $i^{th}$ standard basis vector in $ℝ^{p} .$ The FH operation that imputes the variable $x_{j}$ is given by (3.1) with $T = I - e_{j} {e^{'}}_{j},$ $S = e_{j},$ and $c = 0 .$ This yields: $g (x) = x + e_{j} (α - x_{j}) = (x_{1}, \dots, x_{j - 1}, α, x_{j + 1}, \dots, x_{p})',$ with $α \in ℝ$ a free parameter that represents the imputed value. It should be noted that for a record of $p$ variables, $p$ distinct FH operations can be defined.

To further illustrate the concept of an edit operation, some other examples will now be given. For notational convenience, I restrict attention to the case $p = 3.$

An edit operation that changes the sign of one of the variables:
$g ((\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix})) = (\begin{array}{r} - 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}) (\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}) + (\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}) = (\begin{matrix} - x_{1} \\ x_{2} \\ x_{3} \end{matrix}) .$
An edit operation that interchanges the values of two adjacent items:
$g ((\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix})) = (\begin{array}{r} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{array}) (\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}) + (\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}) = (\begin{matrix} x_{2} \\ x_{1} \\ x_{3} \end{matrix}) .$
An edit operation that transfers an amount between two items, where the amount transferred may equal at most $K$ units in either direction:
$g ((\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix})) = (\begin{array}{r} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}) (\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}) + (\begin{array}{r} 1 \\ 0 \\ - 1 \end{array}) α + (\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}) = (\begin{matrix} x_{1} + α \\ x_{2} \\ x_{3} - α \end{matrix}) .$
with the constraint that $- K \leq α \leq K .$
An edit operation that imputes two variables simultaneously using a fixed ratio:
$g ((\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix})) = (\begin{array}{r} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{array}) (\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}) + (\begin{array}{r} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{array}) (\begin{matrix} α_{1} \\ α_{2} \end{matrix}) + (\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}) = (\begin{matrix} α_{1} \\ α_{2} \\ x_{3} \end{matrix}) .$
with the constraint that $α = (α_{1}, α_{2})'$ satisfies $10 α_{1} - α_{2} = 0.$

Intuitively, an edit operation is supposed to “reverse the effects” of a particular type of error that may have occurred in the observed data. That is to say, if the error associated with edit operation $g$ actually occurred in the observed record $x,$ then $g (x)$ is the record that would have been observed if that error had not occurred. Somewhat more formally, it is assumed here that errors occurring in the data can be modeled by a stochastic “error generating process” $ℰ,$ and that each edit operation acts as a “corrector” for one particular error that can occur under $ℰ$ (see Remark 4 in the next section).

If the edit operation $g$ contains free parameters, the record $g (x)$ might not be determined uniquely even when the restrictions (2.1) and (3.2) are taken into account. In that case, one has to “impute” values for the free parameters that occur in an edit operation, which in turn means that some of the variables in $x$ are imputed via the affine transformation given by (3.1). As in traditional Fellegi-Holt-based editing, finding appropriate “imputations” for the free parameters will not be considered part of the error localization problem here. On the other hand, if $g$ does not contain any free parameters, the imputed values in $g (x)$ follow directly from the edit operation itself and the distinction between error localization and imputation is blurred.

In any particular application, only a small subset of potential edit operations of the form (3.1) would have a substantively meaningful interpretation, in the sense that the associated types of errors are known to occur. In what follows, I assume that a finite set of specific edit operations of the form (3.1) has been identified as relevant for a particular application. This will be called the set of allowed edit operations for that application. Some suggestions on how to construct this set will be given in Section 8.

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2016-06-22

Language selection

Search and menus

Search

A generalized Fellegi-Holt paradigm for automatic error localization 3. Edit operationsA generalized Fellegi-Holt paradigm for automatic error localization 3. Edit operations