A generalized FellegiHolt paradigm for automatic error localization 7. Simulation study
To test the potential usefulness of the new error localization approach, I conducted a small simulation study, using the R environment for statistical computing (R Development Core Team 2015). A prototype implementation was created in R of the algorithm in Figure 6.1. This prototype made liberal use of the existing functionality for FellegiHoltbased automatic editing available in the editrules package (van der Loo and de Jonge 2012; de Jonge and van der Loo 2014). The program was not optimized for computational efficiency, but it turned out to work sufficiently fast for the relatively small error localization problems encountered in this simulation study. (Note: The R code used in this study is available from the author upon request.)
The simulation study involved records of five numerical variables that should satisfy the following nine linear edit rules:
$$\begin{array}{lll}{x}_{1}+{x}_{2}\hfill & ={x}_{3},\hfill & \hfill \\ {x}_{3}{x}_{4}\hfill & ={x}_{5},\hfill & \hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{j}\hfill & \ge 0,\hfill & j\in \left\{1,2,3,4\right\},\hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{1}\hfill & \ge {x}_{2},\hfill & \hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{5}\hfill & \ge 0.1{x}_{3},\hfill & \hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{5}\hfill & \le 0.5{x}_{3}.\hfill & \hfill \end{array}$$
Edits of this form might typically be encountered for SBS, as part of a much larger set of edit rules (Scholtus 2014).
I created a random errorfree data set of 2,000 records by drawing from a multivariate normal distribution (using the mvtnorm package) with the following parameters:
$$\mu =\left(\begin{array}{c}500\\ 250\\ 750\\ 600\\ 150\end{array}\right)\text{and}\Sigma =\left(\begin{array}{rrrrr}\hfill \text{10,000}& \hfill \text{1,250}& \hfill \text{8,750}& \hfill \text{7,500}& \hfill \text{1,250}\\ \hfill \text{1,250}& \hfill \text{5,000}& \hfill \text{3,750}& \hfill \text{4,000}& \hfill \text{250}\\ \hfill \text{8,750}& \hfill \text{3,750}& \hfill \text{12,500}& \hfill \text{11,500}& \hfill \text{1,000}\\ \hfill \text{7,500}& \hfill \text{4,000}& \hfill \text{11,500}& \hfill \text{11,750}& \hfill \text{250}\\ \hfill \text{1,250}& \hfill \text{250}& \hfill \text{1,000}& \hfill \text{250}& \hfill \text{1,250}\end{array}\right).$$
Only records that satisfied all of the above edits were added to the data set. Note that $\Sigma $ is a singular covariance matrix that incorporates the two equality edits. Technically, the resulting data follow a socalled truncated multivariate singular normal distribution; see de Waal et al. (2011, pages 318ff) or Tempelman (2007).
Table 7.1 lists the nine allowed edit operations that were considered in this study. Note that the first five lines contain the FH operations for this data set. As indicated in the table, each edit operation has an associated type of error. A synthetic data set to be edited was created by randomly adding errors of these types to the abovementioned errorfree data set. The probability of each type of error is listed in the fourth column of Table 7.1. The associated “ideal” weight according to (4.2) is shown in the last column.
To limit the amount of computational work, I only considered records that required three edit operations or less. Records without errors were also removed. This left 1,025 records to be edited, each containing one, two, or three of the errors listed in Table 7.1.
name  operation  associated type of error  ${p}_{g}$  ${w}_{g}$ 

FH1  impute ${x}_{1}$  erroneous value of ${x}_{1}$  0.10  2.20 
FH2  impute ${x}_{2}$  erroneous value of ${x}_{2}$  0.08  2.44 
FH3  impute ${x}_{3}$  erroneous value of ${x}_{3}$  0.06  2.75 
FH4  impute ${x}_{4}$  erroneous value of ${x}_{4}$  0.04  3.18 
FH5  impute ${x}_{5}$  erroneous value of ${x}_{5}$  0.02  3.89 
IC34  interchange ${x}_{3}$ and ${x}_{4}$  true values of ${x}_{3}$ and ${x}_{4}$ interchanged  0.07  2.59 
TF21  transfer an amount from ${x}_{2}$ to ${x}_{1}$  part of the true value of ${x}_{1}$ reported as part of ${x}_{2}$  0.09  2.31 
CS4  change the sign of ${x}_{4}$  sign error in ${x}_{4}$  0.11  2.09 
CS5  change the sign of ${x}_{5}$  sign error in ${x}_{5}$  0.13  1.90 
Several error localization approaches were applied to this data set. First of all, I tested error localization according to the FellegiHolt paradigm (i.e., using only the edit operations FH1 $\u2013$ FH5) and according to the new paradigm (i.e., using all edit operations in Table 7.1). Both approaches were tested once using the “ideal” weights listed in Table 7.1 and once with all weights equal to 1 (“no weights”). The latter case simulates a situation where the relevant edit operations would be known, but not their respective frequencies. Finally, to test the robustness of the new error localization approach to a lack of information about relevant edit operations, I also applied this approach with one of the nonFH operations in Table 7.1 missing from the set of allowed edit operations.
The quality of error localization was evaluated in two ways. Firstly, I evaluated how well the optimal paths of edit operations found by the algorithm matched the true distribution of errors, using the following contingency table for all $\text{1,025}\times 9=\text{9,225}$ combinations of records and edit operations:
edit operation was suggested  edit operation was not suggested  

associated error occurred  $TP$  $FN$ 
associated error did not occur  $FP$  $TN$ 
From this table, I computed indicators that measure the proportion of false negatives, false positives, and overall wrong decisions, respectively:
$$\alpha =\frac{FN}{TP+FN};\beta =\frac{FP}{FP+TN};\delta =\frac{FN+FP}{TP+FN+FP+TN}.$$
Similar indicators are discussed by de Waal et al. (2011, pages 410411). I also computed $\overline{\rho}=1\rho ,$ with $\rho $ the fraction of records in the data set for which the error localization algorithm found exactly the right solution. A good error localization algorithm should have low scores on all four indicators.
It should be noted that the above quality indicators put the original FellegiHolt approach at a disadvantage, as this approach does not use all the edit operations listed in Table 7.1. Therefore, I also calculated a second set of quality indicators $\alpha ,\beta ,\delta ,$ and $\overline{\rho}$ that look at erroneous values rather than edit operations. In this case, $\alpha $ measures the proportion of values in the data set that were affected by errors but left unchanged by the optimal solution of the error localization problem, and similarly for the other measures.
Table 7.3 displays the results of the simulation study for both sets of quality indicators. In both cases, a considerable improvement in the quality of the error localization results is seen for the approach that used all edit operations, compared to the approach that used only FH operations. In addition, leaving one relevant edit operation out of the set of allowed edit operations had a negative effect on the quality of error localization. In some cases this effect was quite large $\u2013$ particularly in terms of edit operations used $\u2013$ , but the results of the new error localization approach still remained substantially better than those of the FellegiHolt approach. Contrary to expectation, not using different confidence weights actually improved the quality of the error localization results somewhat for this data set under the FellegiHolt approach (both sets of indicators) and to some extent also under the new approach (only the second set of indicators). Finally, it is seen that using all edit operations led to an increase in computing time compared to using only FH operations, but this increase was not dramatic.
approach  quality indicators (edit operations)  quality indicators (erroneous values)  time^{Note *}  

$\alpha $  $\beta $  $\delta $  $\overline{\rho}$  $\alpha $  $\beta $  $\delta $  $\overline{\rho}$  
FellegiHolt (weights)  74%  12%  23%  80%  19%  10%  13%  32%  46 
FellegiHolt (no weights)  70%  12%  21%  74%  13%  8%  9%  24%  33 
all operations (weights)  14%  3%  5%  24%  10%  5%  7%  17%  98 
except IC34  29%  5%  9%  35%  15%  9%  11%  29%  113 
except TF21  34%  5%  10%  37%  10%  5%  7%  18%  80 
except CS4  28%  6%  9%  39%  10%  5%  7%  17%  80 
except CS5  35%  7%  10%  47%  11%  6%  7%  18%  82 
all operations (no weights)  27%  5%  8%  36%  6%  4%  5%  13%  99 

 Date modified: