A generalized Fellegi-Holt paradigm for automatic error localization 6. An error localization algorithm

In this section, I propose a relatively simple algorithm to solve the error localization problem of Section 4, using the theoretical result from the previous section.

Figure 1 of the article 14538

Description for Figure 6.1

Figure describing an algorithm that finds all optimal paths of edit operations for problem (4.1).

Step 0 : Let x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaahIhaaaa@38C0@  be a given record and G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8Nb XFeaaa@430C@  a given set of allowed edit operations. Initialize :=; MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8Ne HWKaaiOoaiabg2da9iabgwGiglaacUdaaaa@4649@   0 :={ }; MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8hl Hi0damaaBaaaleaapeGaaGimaaWdaeqaaOGaaiOoaiabg2da98qada GadaWdaeaapeGaeyybIymacaGL7bGaayzFaaGaai4oaaaa@49D7@   W:=; MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEfacaGG6aGaeyypa0JaeyOhIuQaai4oaaaa@3C8F@  and t:=1. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadshacaGG6aGaeyypa0JaaGymaiaac6caaaa@3BE9@

Step 1 : Determine all subsets GG MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeacqGHgksZtuuDJXwAK1uy0HwmaeHbfv3ySLgzG0uy0Hgi p5wzaGqbaiab=zq8hbaa@45D9@  of cardinality t MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaadshaaaa@38B8@  that satisfy these conditions:

  1. Every subset of t1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadshacqGHsislcaaIXaaaaa@3A60@  elements in G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@388B@  is part of t1 . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8hl Hi0damaaBaaaleaapeGaamiDaiabgkHiTiaaigdaa8aabeaakiaac6 caaaa@4614@
  2. It holds that gG w g W . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbmaaqababaGaam4Da8aadaWgaaWcbaWdbiaadEgaa8aabeaak8qa cqGHKjYOcaWGxbaaleaacaWGNbGaeyicI4Saam4raaqab0GaeyyeIu oakiaac6caaaa@4287@

Step 2 : For each G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@388B@  found in step 1, construct P( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8Nb XF0aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaaa aa@4740@  and, for each path PP( x;G ), MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadcfacqGHiiIZtuuDJXwAK1uy0HwmaeHbfv3ySLgzG0uy0Hgi p5wzaGqbaiab=zq8hnaabmaapaqaa8qacaWH4bGaai4oaiaadEeaai aawIcacaGLPaaacaGGSaaaaa@4A49@  evaluate whether it can lead to a consistent record. If so, then:

If none of the paths PP( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadcfacqGHiiIZtuuDJXwAK1uy0HwmaeHbfv3ySLgzG0uy0Hgi p5wzaGqbaiab=zq8hnaabmaapaqaa8qacaWH4bGaai4oaiaadEeaai aawIcacaGLPaaaaaa@4999@  lead to a consistent record, add G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@388B@  to t . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8hl Hi0damaaBaaaleaapeGaamiDaaWdaeqaaOGaaiOlaaaa@446C@

Step 3 : If t<R MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadshacqGH8aapcaWGsbaaaa@3A93@  and t , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8hl Hi0damaaBaaaleaapeGaamiDaaWdaeqaaOWdbiabgcMi5kabgwGigl aacYcaaaa@47BA@  define t:=t+1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadshacaGG6aGaeyypa0JaamiDaiabgUcaRiaaigdaaaa@3D12@  and return to step 1.

In practical applications of error localization in official statistics, it is not unusual to have records of over 100 variables. To obtain a problem that is computationally feasible, existing applications of automatic editing based on the Fellegi-Holt paradigm usually specify an upper bound M MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaad2eaaaa@389B@ on the number of variables that may be imputed in a single record (e.g., M = 12 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaad2eacqGH9aqpcaaIXaGaaGOmaaaa@3B18@ or M = 15 ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbmaabiaabaGaamytaiabg2da9iaaigdacaaI1aaacaGLPaaacaGG Uaaaaa@3C95@ de Waal and Coutinho (2005) argued that the introduction of such an upper bound is reasonable because a record that requires more than, say, fifteen imputations should be considered unfit for automatic editing anyway. Following this tradition, one can also introduce an upper bound R MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaadkfaaaa@38A0@ on the number of distinct edit operations that may be applied to a single record. Even with this additional restriction, the search space of potential solutions to (4.1) will usually be too large in practice to find the optimal solution by an exhaustive search.

Figure 6.1 summarizes the proposed error localization algorithm. Its basic set-up was inspired by the apriori algorithm of Agrawal and Srikant (1994) for data mining. Upon completion, the algorithm returns a set MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae8Ne HWeaaa@4257@ containing all paths of allowed edit operations that correspond to an optimal solution to (4.1), as well as the optimal path length W . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEfacaGGUaaaaa@3957@ [Note: An error localization problem may have multiple optimal solutions, and it may be beneficial to find all of them (Giles 1988; de Waal et al. 2011, pages 66-67).]

After initialization in step 0, the algorithm cycles through steps 1, 2, and 3 at most R MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaadkfaaaa@38A0@ times. In step 1 of the algorithm, the search space is limited by using the following fact: if G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@3895@ has a proper subset H G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIeacqGHckcZcaWGhbaaaa@3B5E@ for which P( x;H ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaamisaaGaayjkaiaawMcaaa aa@476D@ contains a path that leads to a consistent record, then P( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaaa aa@476C@ can contain only suboptimal solutions. Thus, any set G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@3895@ that has such a subset may be ignored by the algorithm. Similarly, G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@3895@ may also be ignored whenever the total weight of the edit operations in G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@3895@ exceeds the path length of the best feasible solution found so far.

During the t th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadshapaWaaWbaaSqabeaapeGaaeiDaiaabIgaaaaaaa@3AF0@ iteration, the number of subsets G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@3895@ encountered in step 1 of the algorithm equals ( N t ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbmaabmaapaqaaKqzaeqbaeqabiqaaaGcbaqcLbqapeGaamOtaaGc paqaaKqzaeWdbiaadshaaaaakiaawIcacaGLPaaajugybiaac6caaa a@3E45@ For each of these subsets, the conditions in step 1 have to be checked. If a subset G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadEeaaaa@3895@ passes these checks, in step 2 all t ! MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadshacaGGHaaaaa@3967@ paths in P( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaaa aa@476C@ are evaluated using the theory of Section 5. The idea behind the apriori algorithm is that, as t MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaadshaaaa@38B2@ becomes larger, the majority of subsets will not pass the checks in the first step, so that the total amount of computational work remains limited. In the context of data mining, this desirable behavior has indeed been observed in practice. Whether it also occurs in the context of error localization remains to be seen.

One possible improvement to the algorithm can be made by observing that the order in which edit operations are applied does not matter in all cases. Sometimes two paths in P( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaaa aa@475C@ are equivalent in the sense that any record that can be reached from x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaahIhaaaa@38BA@ by the first path can also be reached by the second path, and vice versa. This property defines an equivalence relation on P( x;G ). MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaai aac6caaaa@480E@ Let P ˜ ( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaieaaqaaaaa aaaaWdbiqa=bfapaGbaGaapeWaaeWaa8aabaWdbiaahIhacaGG7aGa am4raaGaayjkaiaawMcaaaaa@3CF5@ be a set that contains one representative from each equivalence class of P( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaaa aa@475C@ under this relation. Clearly, the algorithm in Figure 6.1 remains correct if in step 2 the search is limited to P ˜ ( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaieaaqaaaaa aaaaWdbiqa=bfapaGbaGaapeWaaeWaa8aabaWdbiaahIhacaGG7aGa am4raaGaayjkaiaawMcaaaaa@3CF5@ instead of P( x;G ). MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaai aac6caaaa@480E@ Scholtus (2014) provides a simple method for constructing P ˜ ( x;G ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaieaaqaaaaa aaaaWdbiqa=bfapaGbaGaapeWaaeWaa8aabaWdbiaahIhacaGG7aGa am4raaGaayjkaiaawMcaaaaa@3CF5@ from P( x;G ). MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpepeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaatuuDJXwAK1 uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGqbaabaaaaaaaaapeGae83f XJ1aaeWaa8aabaWdbiaahIhacaGG7aGaam4raaGaayjkaiaawMcaai aac6caaaa@480E@

A detailed example illustrating the above algorithm can be found in Scholtus (2014).

Date modified: