Accuracy estimation with clustered dataset - ARCHIVED

Articles and reports: 11-522-X20020016737

Description:

If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.

Issue Number: 2002001
Author(s): Chauchat, Jean-Hughes; Pellegrino, François; Rakotomalala, Ricco
FormatRelease dateMore information
CD-ROMSeptember 13, 2004
PDFSeptember 13, 2004