# A generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation studyA generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation study

To test the potential usefulness of the new error localization approach, I conducted a small simulation study, using the R environment for statistical computing (R Development Core Team 2015). A prototype implementation was created in R of the algorithm in Figure 6.1. This prototype made liberal use of the existing functionality for Fellegi-Holt-based automatic editing available in the editrules package (van der Loo and de Jonge 2012; de Jonge and van der Loo 2014). The program was not optimized for computational efficiency, but it turned out to work sufficiently fast for the relatively small error localization problems encountered in this simulation study. (Note: The R code used in this study is available from the author upon request.)

The simulation study involved records of five numerical variables that should satisfy the following nine linear edit rules:

$\begin{array}{lll}{x}_{1}+{x}_{2}\hfill & ={x}_{3},\hfill & \hfill \\ {x}_{3}-{x}_{4}\hfill & ={x}_{5},\hfill & \hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{j}\hfill & \ge 0,\hfill & j\in \left\{1,2,3,4\right\},\hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{1}\hfill & \ge {x}_{2},\hfill & \hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{5}\hfill & \ge -0.1{x}_{3},\hfill & \hfill \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{x}_{5}\hfill & \le 0.5{x}_{3}.\hfill & \hfill \end{array}$

Edits of this form might typically be encountered for SBS, as part of a much larger set of edit rules (Scholtus 2014).

I created a random error-free data set of 2,000 records by drawing from a multivariate normal distribution (using the mvtnorm package) with the following parameters:

Only records that satisfied all of the above edits were added to the data set. Note that $\Sigma$ is a singular covariance matrix that incorporates the two equality edits. Technically, the resulting data follow a so-called truncated multivariate singular normal distribution; see de Waal et al. (2011, pages 318ff) or Tempelman (2007).

Table 7.1 lists the nine allowed edit operations that were considered in this study. Note that the first five lines contain the FH operations for this data set. As indicated in the table, each edit operation has an associated type of error. A synthetic data set to be edited was created by randomly adding errors of these types to the above-mentioned error-free data set. The probability of each type of error is listed in the fourth column of Table 7.1. The associated “ideal” weight according to (4.2) is shown in the last column.

To limit the amount of computational work, I only considered records that required three edit operations or less. Records without errors were also removed. This left 1,025 records to be edited, each containing one, two, or three of the errors listed in Table 7.1.

Table 7.1
Allowed edit operations for the simulation study
Table summary
This table displays the results of Allowed edit operations for the simulation study. The information is grouped by name (appearing as row headers), operation, associated type of error and XXXX (appearing as column headers).
name operation associated type of error ${p}_{g}$ ${w}_{g}$
FH1 impute ${x}_{1}$ erroneous value of ${x}_{1}$ 0.10 2.20
FH2 impute ${x}_{2}$ erroneous value of ${x}_{2}$ 0.08 2.44
FH3 impute ${x}_{3}$ erroneous value of ${x}_{3}$ 0.06 2.75
FH4 impute ${x}_{4}$ erroneous value of ${x}_{4}$ 0.04 3.18
FH5 impute ${x}_{5}$ erroneous value of ${x}_{5}$ 0.02 3.89
IC34 interchange ${x}_{3}$ and ${x}_{4}$ true values of ${x}_{3}$ and ${x}_{4}$ interchanged 0.07 2.59
TF21 transfer an amount from ${x}_{2}$ to ${x}_{1}$ part of the true value of ${x}_{1}$ reported as part of ${x}_{2}$ 0.09 2.31
CS4 change the sign of ${x}_{4}$ sign error in ${x}_{4}$ 0.11 2.09
CS5 change the sign of ${x}_{5}$ sign error in ${x}_{5}$ 0.13 1.90

Several error localization approaches were applied to this data set. First of all, I tested error localization according to the Fellegi-Holt paradigm (i.e., using only the edit operations FH1 $–$ FH5) and according to the new paradigm (i.e., using all edit operations in Table 7.1). Both approaches were tested once using the “ideal” weights listed in Table 7.1 and once with all weights equal to 1 (“no weights”). The latter case simulates a situation where the relevant edit operations would be known, but not their respective frequencies. Finally, to test the robustness of the new error localization approach to a lack of information about relevant edit operations, I also applied this approach with one of the non-FH operations in Table 7.1 missing from the set of allowed edit operations.

The quality of error localization was evaluated in two ways. Firstly, I evaluated how well the optimal paths of edit operations found by the algorithm matched the true distribution of errors, using the following contingency table for all $\text{1,025}×9=\text{9,225}$ combinations of records and edit operations:

edit operation was suggested edit operation was not suggested $TP$ $FN$ $FP$ $TN$

From this table, I computed indicators that measure the proportion of false negatives, false positives, and overall wrong decisions, respectively:

Similar indicators are discussed by de Waal et al. (2011, pages 410-411). I also computed $\overline{\rho }=1-\rho ,$ with $\rho$ the fraction of records in the data set for which the error localization algorithm found exactly the right solution. A good error localization algorithm should have low scores on all four indicators.

It should be noted that the above quality indicators put the original Fellegi-Holt approach at a disadvantage, as this approach does not use all the edit operations listed in Table 7.1. Therefore, I also calculated a second set of quality indicators $\alpha ,\beta ,\delta ,$ and $\overline{\rho }$ that look at erroneous values rather than edit operations. In this case, $\alpha$ measures the proportion of values in the data set that were affected by errors but left unchanged by the optimal solution of the error localization problem, and similarly for the other measures.

Table 7.3 displays the results of the simulation study for both sets of quality indicators. In both cases, a considerable improvement in the quality of the error localization results is seen for the approach that used all edit operations, compared to the approach that used only FH operations. In addition, leaving one relevant edit operation out of the set of allowed edit operations had a negative effect on the quality of error localization. In some cases this effect was quite large $–$ particularly in terms of edit operations used $–$ , but the results of the new error localization approach still remained substantially better than those of the Fellegi-Holt approach. Contrary to expectation, not using different confidence weights actually improved the quality of the error localization results somewhat for this data set under the Fellegi-Holt approach (both sets of indicators) and to some extent also under the new approach (only the second set of indicators). Finally, it is seen that using all edit operations led to an increase in computing time compared to using only FH operations, but this increase was not dramatic.

Table 7.3
Quality of error localization in terms of edit operations used and identified erroneous values; computing time required
Table summary
This table displays the results of Quality of error localization in terms of edit operations used and identified erroneous values; computing time required. The information is grouped by approach (appearing as row headers), quality indicators (edit operations), quality indicators (erroneous values) and time*, calculated using XXXX units of measure (appearing as column headers).
approach quality indicators (edit operations) quality indicators (erroneous values) timeNote *
$\alpha$ $\beta$ $\delta$ $\overline{\rho }$ $\alpha$ $\beta$ $\delta$ $\overline{\rho }$
Fellegi-Holt (weights) 74% 12% 23% 80% 19% 10% 13% 32% 46
Fellegi-Holt (no weights) 70% 12% 21% 74% 13% 8% 9% 24% 33
all operations (weights) 14% 3% 5% 24% 10% 5% 7% 17% 98
except IC34 29% 5% 9% 35% 15% 9% 11% 29% 113
except TF21 34% 5% 10% 37% 10% 5% 7% 18% 80
except CS4 28% 6% 9% 39% 10% 5% 7% 17% 80
except CS5 35% 7% 10% 47% 11% 6% 7% 18% 82
all operations (no weights) 27% 5% 8% 36% 6% 4% 5% 13% 99
Date modified: