A generalized Fellegi-Holt paradigm for automatic error localization 3. Edit operationsA generalized Fellegi-Holt paradigm for automatic error localization 3. Edit operations

Continuing with the notation from Section 2, I define an edit operation $g$ to be an affine function of the general form

$g\left(x\right)=Tx+S\alpha +c,\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.1\right)$

where $T$ and $S$ are known coefficient matrices of dimensions $p×p$ and $p×m,$ respectively, is a vector of free parameters that may occur in $g,$ and $c$ is a $p-$ vector of known constants. In the special case that $g$ does not involve any free parameters $\left(m=0\right),$ the second term in (3.1) vanishes. Sometimes, it may be useful to impose one or several linear constraints on the free parameters in $g:$

$R\alpha +d\odot 0,\text{ }\text{ }\text{ }\text{ }\text{ }\left(3.2\right)$

with $R$ a known matrix, and $d$ a known vector of constants. (Note: Matrix-vector notation will be used throughout this article because it leads to a concise description of results; however, using matrices to represent edits and edit operations is probably not the most efficient way to implement these results on a computer.)

As a first example, consider the operation that replaces one of the original values in $x$ by an arbitrary new value (imputation). I will call this an FH operation, in view of its central role in automatic editing based on the Fellegi-Holt paradigm. Let $I$ denote the $p×p$ identity matrix and ${e}_{i}$ the ${i}^{\text{th}}$ standard basis vector in ${ℝ}^{p}.$ The FH operation that imputes the variable ${x}_{j}$ is given by (3.1) with $T=I-{e}_{j}{{e}^{\prime }}_{j},$ $S={e}_{j},$ and $c=0.$ This yields: with $\alpha \in ℝ$ a free parameter that represents the imputed value. It should be noted that for a record of $p$ variables, $p$ distinct FH operations can be defined.

To further illustrate the concept of an edit operation, some other examples will now be given. For notational convenience, I restrict attention to the case $p=3.$

• An edit operation that changes the sign of one of the variables:
• $g\left(\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)\right)=\left(\begin{array}{rrr}\hfill -1& \hfill 0& \hfill 0\\ \hfill 0& \hfill 1& \hfill 0\\ \hfill 0& \hfill 0& \hfill 1\end{array}\right)\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)+\left(\begin{array}{c}0\\ 0\\ 0\end{array}\right)=\left(\begin{array}{c}-{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right).$
• An edit operation that interchanges the values of two adjacent items:
• $g\left(\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)\right)=\left(\begin{array}{rrr}\hfill 0& \hfill 1& \hfill 0\\ \hfill 1& \hfill 0& \hfill 0\\ \hfill 0& \hfill 0& \hfill 1\end{array}\right)\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)+\left(\begin{array}{c}0\\ 0\\ 0\end{array}\right)=\left(\begin{array}{c}{x}_{2}\\ {x}_{1}\\ {x}_{3}\end{array}\right).$
• An edit operation that transfers an amount between two items, where the amount transferred may equal at most $K$ units in either direction:
• $g\left(\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)\right)=\left(\begin{array}{rrr}\hfill 1& \hfill 0& \hfill 0\\ \hfill 0& \hfill 1& \hfill 0\\ \hfill 0& \hfill 0& \hfill 1\end{array}\right)\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)+\left(\begin{array}{r}\hfill 1\\ \hfill 0\\ \hfill -1\end{array}\right)\alpha +\left(\begin{array}{c}0\\ 0\\ 0\end{array}\right)=\left(\begin{array}{c}{x}_{1}+\alpha \\ {x}_{2}\\ {x}_{3}-\alpha \end{array}\right).$
• with the constraint that $-K\le \alpha \le K.$
• An edit operation that imputes two variables simultaneously using a fixed ratio:
• $g\left(\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)\right)=\left(\begin{array}{rrr}\hfill 0& \hfill 0& \hfill 0\\ \hfill 0& \hfill 0& \hfill 0\\ \hfill 0& \hfill 0& \hfill 1\end{array}\right)\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right)+\left(\begin{array}{rr}\hfill 1& \hfill 0\\ \hfill 0& \hfill 1\\ \hfill 0& \hfill 0\end{array}\right)\left(\begin{array}{c}{\alpha }_{1}\\ {\alpha }_{2}\end{array}\right)+\left(\begin{array}{c}0\\ 0\\ 0\end{array}\right)=\left(\begin{array}{c}{\alpha }_{1}\\ {\alpha }_{2}\\ {x}_{3}\end{array}\right).$
• with the constraint that satisfies $10{\alpha }_{1}-{\alpha }_{2}=0.$

Intuitively, an edit operation is supposed to “reverse the effects” of a particular type of error that may have occurred in the observed data. That is to say, if the error associated with edit operation $g$ actually occurred in the observed record $x,$ then $g\left(x\right)$ is the record that would have been observed if that error had not occurred. Somewhat more formally, it is assumed here that errors occurring in the data can be modeled by a stochastic “error generating process” $ℰ,$ and that each edit operation acts as a “corrector” for one particular error that can occur under $ℰ$ (see Remark 4 in the next section).

If the edit operation $g$ contains free parameters, the record $g\left(x\right)$ might not be determined uniquely even when the restrictions (2.1) and (3.2) are taken into account. In that case, one has to “impute” values for the free parameters that occur in an edit operation, which in turn means that some of the variables in $x$ are imputed via the affine transformation given by (3.1). As in traditional Fellegi-Holt-based editing, finding appropriate “imputations” for the free parameters will not be considered part of the error localization problem here. On the other hand, if $g$ does not contain any free parameters, the imputed values in $g\left(x\right)$ follow directly from the edit operation itself and the distinction between error localization and imputation is blurred.

In any particular application, only a small subset of potential edit operations of the form (3.1) would have a substantively meaningful interpretation, in the sense that the associated types of errors are known to occur. In what follows, I assume that a finite set of specific edit operations of the form (3.1) has been identified as relevant for a particular application. This will be called the set of allowed edit operations for that application. Some suggestions on how to construct this set will be given in Section 8.

Date modified: