Survey Methodology
A method for estimating the effect of classification errors on statistics for two domains

by Yanzhe Li, Sander Scholtus and Arnout van DeldenNote 1

  • Release date: January 3, 2024

Abstract

Being able to quantify the accuracy (bias, variance) of published output is crucial in official statistics. Output in official statistics is nearly always divided into subpopulations according to some classification variable, such as mean income by categories of educational level. Such output is also referred to as domain statistics. In the current paper, we limit ourselves to binary classification variables. In practice, misclassifications occur and these contribute to the bias and variance of domain statistics. Existing analytical and numerical methods to estimate this effect have two disadvantages. The first disadvantage is that they require that the misclassification probabilities are known beforehand and the second is that the bias and variance estimates are biased themselves. In the current paper we present a new method, a Gaussian mixture model estimated by an Expectation-Maximisation (EM) algorithm combined with a bootstrap, referred to as the EM bootstrap method. This new method does not require that the misclassification probabilities are known beforehand, although it is more efficient when a small audit sample is used that yields a starting value for the misclassification probabilities in the EM algorithm. We compared the performance of the new method with currently available numerical methods: the bootstrap method and the SIMEX method. Previous research has shown that for non-linear parameters the bootstrap outperforms the analytical expressions. For nearly all conditions tested, the bias and variance estimates that are obtained by the EM bootstrap method are closer to their true values than those obtained by the bootstrap and SIMEX methods. We end this paper by discussing the results and possible future extensions of the method.

Key Words: Bias; Variance; Misclassification; Binary classifier; Gaussian mixture model; EM algorithm.

Table of contents

How to cite

Li, Y., Scholtus, S. and van Delden, A. (2023). A method for estimating the effect of classification errors on statistics for two domains. Survey Methodology, Statistics Canada, Catalogue No. 12‑001‑X, Vol. 49, No. 2. Paper available at http://www.statcan.gc.ca/pub/12-001-x/2023002/article/00002-eng.htm.

Note

Date modified: