Statistical Methodology Research and Development Program Achievements, 2022/2023
1.  Data integration

1.1  Integration of probability and non-probability samples

PROJECT: Handling non-probability samples through inverse probability weighting

Non-probability samples are being increasingly explored at Statistics Canada and other National Statistical Offices as an alternative to probability samples. However, it is well known that the use of a non-probability sample alone may produce estimates with significant bias due to the unknown nature of the underlying selection mechanism. To reduce this bias, data from a non-probability sample can be integrated with data from a probability sample provided that both samples contain auxiliary variables in common.

In this research, we focused on inverse probability weighting methods, which involve modelling the probability of participation in the non-probability sample. As a starting point, we considered the logistic model along with the pseudo maximum likelihood method of Chen, Li and Wu (2020). In previous years, we proposed a variable selection procedure based on a modified Akaike Information Criterion (AIC) that properly accounts for the data structure and the probability sampling design. We also proposed a simple rank-based method of forming homogeneous post-strata. In addition, we extended the Classification and Regression Trees (CART) algorithm to this data integration scenario, while again properly accounting for the probability sampling design. Our modified version of CART is called nppCART. Finally, we proposed a bootstrap variance estimator that reflects two sources of variability: the probability sampling design and the participation model.

We applied different inverse probability weighting methods to real probability and non-probability survey data collected by Statistics Canada. A main conclusion of our experiments is that inverse probability weighting methods are successful at bias reduction, but often some bias remains. We also observed the importance of forming homogeneous groups to stabilize estimates. The nppCART algorithm performed well with these data. Logistic regression with main effects only is also a reasonable option provided the estimated participation probabilities from the logistic model are used to form homogeneous groups.

Progress:

In the current year, we completed the writing of a paper that was submitted to Survey Methodology. After revision, taking into account the reviewers’ comments, the paper was accepted for publication in the journal (Beaumont, Bosa, Brennan, Charlebois and Chu, 2023).

For more information, please contact:
Jean-François Beaumont (613-863-9024, jean-francois.beaumont@statcan.gc.ca).

References

Beaumont, J.-F., Bosa, K., Brennan, A., Charlebois, J. and Chu, K. (2023). Handling non-probability samples through inverse probability weighting with an application to Statistics Canada’s crowdsourcing data. Survey Methodology (accepted in 2023 and expected to appear in 2024).

Chen, Y., Li, P. and Wu, C. (2020). Doubly robust inference with non-probability survey samples. Journal of the American Statistical Association, 115, 2011-2021.

PROJECT: An Approximate Bayesian Approach to Integrating Data from Probability and Non-Probability Samples

In this data integration project, we consider the scenario where the survey and auxiliary variables are observed in both a probability and non-probability sample. Our objective is to use data from the non-probability sample to improve the efficiency of survey-weighted estimates obtained from the probability sample. Recently, Sakshaug, Wiśniowski, Ruiz and Blom (2019) and Wiśniowski, Sakshaug, Ruiz and Blom (2020) proposed a Bayesian approach to integrating data from both samples for the estimation of model parameters. In their approach, the non-probability sample data are used to determine the prior distribution of model parameters, and the posterior distribution is obtained under the assumption that the probability sampling design is ignorable (or not informative). The goal of this project was to extend this Bayesian approach to the prediction of finite population parameters under a non-ignorable (or informative) probability sampling design.

In previous years, we proposed an approximate Bayesian procedure that accounts for the probability sampling design by conditioning on appropriate survey-weighted statistics, following Wang, Kim and Yang (2018), and conducted simulation experiments. The main conclusion of our experiments was that our Bayesian approach may yield efficiency gains over survey-weighted estimators, even in a situation where the non-probability sample is highly informative, provided the prior variance of model parameters is carefully chosen. However, it also led to efficiency losses in a scenario where the correlation between the survey and auxiliary variables was weak.

Progress:

This project was presented at the June 2022 meeting of Statistics Canada’s Advisory Committee on Statistical Methods (ACSM) and at the 2022 Joint Statistical Meetings. Following ACSM advice, we conducted additional simulation experiments to strengthen our conclusions. We are also finalizing the writing of a paper that we plan to submit to a peer-reviewed statistical journal (You, DaSylva and Beaumont, 2023).

For more information, please contact:
Jean-François Beaumont (613-863-9024, jean-francois.beaumont@statcan.gc.ca).

References

Sakshaug, J.W., Wiśniowski, A., Ruiz, D.A.P. and Blom, A.G. (2019). Supplementing small probability samples with nonprobability samples: A Bayesian approach. Journal of Official Statistics, 35, 653-681.

Wang, Z., Kim, J.K. and Yang, S. (2018). Approximate Bayesian inference under informative sampling. Biometrika, 105, 91-102.

Wiśniowski, A., Sakshaug, J.W., Ruiz, D.A.P. and Blom, A.G. (2020). Integrating probability and nonprobability samples for survey inference. Journal of Survey Statistics and Methodology, 8, 120-147.

You, Y., DaSylva, A. and Beaumont, J.-F. (2023). An approximate Bayesian approach to estimation of population means by integrating data from probability and non-probability samples. Draft manuscript to be submitted to a peer-reviewed statistical journal.

PROJECT: Statistical data integration using a prediction approach

We investigated how a big non-probability database can be used to improve estimates from a small probability sample through data integration techniques. In the situation where the survey variable is observed in both data sources, Kim and Tam (2021) proposed two design-consistent estimators that can be justified through dual frame survey theory. In the previous year, we determined conditions ensuring that these estimators are more efficient than the Horvitz-Thompson estimator when the probability sample is selected using either Poisson sampling or simple random sampling without replacement. We also studied through a simulation study a class of predictors, following Särndal and Wright (1984), that handles the case where the non-probability database contains auxiliary variables but no survey variable. The probability survey collects the survey variables, and we assume that the non-probability database can be linked to the probability sample. This case is relevant to a survey on postal traffic conducted by La Poste in France. This project involves a collaboration with La Poste as well as the Toulouse School of Economics and the university of Besançon.

Progress:

The initial version of our paper was submitted to Survey Methodology. During the year, we revised our paper following reviewers’ recommendations. The paper has recently been accepted in the journal (Medous, Goga, Ruiz-Gazen, Beaumont, Dessertaine and Puech, 2023).

For more information, please contact:
Jean-François Beaumont (613-863-9024, jean-francois.beaumont@statcan.gc.ca).

References

Kim, J.-K., and Tam, S.-M. (2021). Data integration by combining big data and survey sample data for finite population inference. International Statistical Review, 89, 382-401.

Medous, E., Goga, C., Ruiz-Gazen, A., Beaumont, J.-F., Dessertaine, A. and Puech, P. (2023). QR prediction for statistical data integration. Survey Methodology, 49 (to appear).

Särndal, C.-E., and Wright, R. (1984). Cosmetic form of estimators in survey sampling. Scandinavian Journal of Statistics, 11, 146-156.

PROJECT: Disaggregation Plan for the Survey of Household Spending Food Data

The Survey of Household Spending (SHS) collects detailed information on household expenditures, using both a questionnaire and a one-week expenditure diary. SHS is conducted every other year since 2017 and data collection is continuous throughout the collection year to account for seasonal variations in spending. Food purchased from stores is collected via the SHS diary and it represents a fairly large portion of a household budget (11% in 2019). It is made up of about 250 different food categories, and since respondents only report their food purchases for a short 1-week period, it strongly limits the domains for which statistics can be produced (due to high CVs). Currently, the survey releases a biennial table of detailed food spending at the national and provincial level, and for other domains, it often limits the food information to the 8 highest-level food categories.

The objective of this research is to develop a disaggregation plan for the SHS Food data, with the aim that a similar approach could be used by other programs that use continuous collection throughout the year and having access to alternative data sources. Two alternative data sources now available at Statistics Canada were identified (retail scanner data, and non-probabilistic (volunteer) household panel spending data). The project aims to apply data integration methods to obtain adjusted estimates from these alternative data sources. Then, small area estimation methods would be used, with these adjusted estimates as auxiliary variables, to improve the precision of the released information with respect to three potential dimensions: time (monthly or quarterly instead of biennial), location (metropolitan areas and regions within provinces instead of national/provincial) and contents (more detailed food categories instead of only the 8 highest-level food categories).

Progress:

The progress made during 2022-2023 consists of the initial step of the evaluation of using non-probability inverse propensity weighting methods in order to improve the representativity of the food spending data from the non-probabilistic panel alternative data source. Modifications to inverse propensity weighting methods to account for non-probabilistic datasets, developed by Chen, Li and Wu (2020) and extended by Beaumont, Bosa, Brennan, Charlebois and Chu (2023) were adapted to the panel data. Auxiliary variables potentially able to explain frequent participation in the panel and that were available in both the panel and in a probability sample were identified. Logistic regression with variable selection methods was applied, with various model options including the allowance or not of pairwise interactions between auxiliary variables and the creation or not of homogenous classes based on the predicted propensities.

These methods did not succeed in notably reducing the bias in the 2019 panel data when compared to SHS 2019 overall and along the domains of province, income quintile, and household size. As the investigation of this panel data began it became clear there were significant limitations and that participation bias was not the main contributor to the bias observed; rather, widespread incomplete reporting and variations in reporting among different sub-types of food items (e.g., with a scannable bar code or without) were even more significant as far as suitability of this data source for the purposes envisaged. An internal report has been written (Charlebois, 2023) detailing the attempt to apply the data integration methods to the panel data source. The next steps for this project include investigating the second alternative data source, retail scanner data, with a view towards implementing small area estimates with the Fay-Herriot model using SHS data.

For more information, please contact:
Joanne Charlebois (613-875-5407, joanne.charlebois@statcan.gc.ca).

References

Beaumont, J.-F., Bosa, K., Brennan, A., Charlebois, J. and Chu, K. (2023). Handling non-probability samples through inverse probability weighting with an application to Statistics Canada’s crowdsourcing data. Survey Methodology (accepted in 2023 and expected to appear in 2024).

Charlebois, J. (2023). Nielsen Homescan Spending Data: Weighting by inverse propensity modelling of “frequent participation” in the Panel. Internal report, Social Survey Methods Division, Statistics Canada.

Chen, Y., Li, P. and Wu, C. (2020). Doubly robust inference with non-probability survey samples. Journal of the American Statistical Association, 115, 2011-2021.

1.2  Record linkage

PROJECT: Identification of duplicate records

Duplicate records are records from the same unit in a data source, regardless of whether they are identical. Their identification is required when the source is used to produce official statistics, such as a sampling frame or a census. Fortini, Liseo, Nuccitelli and Scanu (2001), Tancredi and Liseo (2011), Sadinle (2017) and Steorts, Hall and Fienberg (2016) have described Bayesian models to perform this task in an automated manner, i.e., without clerical-reviews that are expensive. Yet, they involve computer-intensive procedures and tend to assume that the linkage variables are conditionally independent, when this is seldom the case in practice.

Progress:

A new model has been proposed for applications, where one can reasonably assume that each unit is associated with at most two records because duplication is rare, as in the private dwellings of the census of population. The duplication is modeled through the number of links from a given record as in a recent model of linkage errors (Dasylva and Goussanou, 2022a), while extending the latter to account for the multiplicity of false positives from some other units. The model has been presented at the Annual meeting of the Statistical Society of Canada, with a corresponding proceedings paper (Dasylva and Goussanou, 2022b), which includes Monte Carlo simulations based on public census data.

For more information, please contact:
Abel Dasylva (613-408-4850, abel.dasylva@statcan.gc.ca).

References

Dasylva, A., and Goussanou, A. (2022a). On the consistent estimation of linkage errors without training data. Japanese Journal of Statistics and Data Science. Available at https://doi.org/10.1007/s42081-022-00153-3, doi: 10.1007/s42081-022-00153-3.

Dasylva, A., and Goussanou, A. (2022b). A new model for the automated identification of duplicate records. In Proceedings of the Survey Methods Section, Statistical Society of Canada. Available at https://ssc.ca/sites/default/files/imce/dasylva_ssc2022.pdf.

Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2001). On Bayesian record linkage. Research in Official Statistics, 4, 185-198.

Sadinle, M. (2017). Bayesian estimation of bipartite matchings for record linkage. Journal of the American Statistical Association, 112, 600-612.

Steorts, R., Hall, R. and Fienberg, S. (2016). A Bayesian approach to graphical record linkage and de-duplication. Journal of the American Statistical Association, 111, 1660-1672.

Tancredi, A., and Liseo, B. (2011). A hierarchical Bayesian approach to record linkage and population size problems. Annals of Applied Statistics, 5, 1553-1585.

PROJECT: Capture-recapture with linkage errors

To reduce the response burden and costs, Statistics Canada is prioritizing the use of administrative sources for the production of official statistics (Rancourt, 2018). To do so, the Agency has developed a set of related quality indicators, including the under-coverage and over-coverage of a given source (Sirois, 2021), which are currently measured through costly field operations or clerical reviews (Oyarzun and Rodrigue, 2023). The capture-recapture method may offer a cost-effective alternative for measuring the under-coverage through a comparison to another source under standard assumptions, which include the independence of the sources and their perfect linkage. However, it must be modified when the linkage is imperfect, typically because there is no unique identifier that is common to the two sources. Many such adaptations have been proposed under the standard capture-recapture assumptions except for the imperfect linkage. Ding and Fienberg (1994), Di Consiglio and Tuoto (2015) and de Wolf, van der Laan and Zult (2019) have described solutions that still require clerical reviews. Racinskij, Smith and van der Heijden (2019) have instead proposed a solution that dispenses with clerical reviews, at the expense of making the strong assumption that the linkage variables are conditionally independent. Dasylva, Goussanou and Nambeu (2021) have proposed a solution, which builds on the error model described by Dasylva and Goussanou (2022), while dispensing with clerical reviews and allowing more general forms of dependence among the linkage variables. However, when the linkage does not have a perfect recall, the coverage must be estimated by fitting this model repeatedly based on a log-linear specification of the interactions in the matched pairs, which is inefficient.

Progress:

The model described by Dasylva and Goussanou (2022) has been generalized into a multivariate model. Whereas the original model is based on the number of links from a given record for a single linkage rule, the multivariate extension is based on a vector of such variables, each corresponding to a distinct linkage rule from a set of mutually exclusive rules, i.e., each record pair is linked by at most one rule. This extension is used to estimate the coverage based on a log-linear specification of the interactions in the matched pairs, and it leads to a more efficient estimator of the coverage than that obtained by fitting the univariate model repeatedly, when the recall is not perfect. The new model is described in a report (Dasylva, Goussanou and Nambeu, 2023) and it has been submitted to a peer-reviewed journal.

For more information, please contact:
Abel Dasylva (613-408-4850, abel.dasylva@statcan.gc.ca).

References

Dasylva, A., and Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Japanese Journal of Statistics and Data Science. Available at https://doi.org/10.1007/s42081-022-00153-3, doi: 10.1007/s42081-022-00153-3.

Dasylva, A., Goussanou, A. and Nambeu, C.-O. (2021). Measuring the undercoverage of two data sources with a nearly perfect coverage through capture and recapture in the presence of linkage errors. Proceedings: Symposium 2021, Adopting Data Science in Official Statistics to Meet Society’s Emerging Needs, Statistics Canada. Available at https://www150.statcan.gc.ca/n1/pub/11-522-x/2021001/article/00006-eng.pdf.

Dasylva, A., Goussanou, A. and Nambeu, C.-O. (2023). Measuring the coverage of two data sources through capture-recapture with linkage errors. Internal report, Statistics Canada.

de Wolf, P.-P., van der Laan, J. and Zult, D. (2019). Connection correction methods for linkage error in capture-recapture. Journal of Official Statistics, 35, 577-597.

Di Consiglio, L., and Tuoto, T. (2015). Coverage evaluation on probabilistically linked data. Journal of Official Statistics, 31, 415-429.

Ding, Y., and Fienberg, S.E. (1994). Dual system estimation of Census undercount in the presence of matching error. Survey Methodology, 20, 2, 149-158. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1994002/article/14422-eng.pdf.

Oyarzun, J., and Rodrigue, J.-F. (2023). Building key quality indicators for the Statistical Building Register. Presentation to the Advisory Committee on Statistical Methods, Statistics Canada, May 2023.

Racinskij, V., Smith, P. and van der Heijden, P. (2019). Linkage free dual system estimation. Available at https://arxiv.org/abs/1903.10894.

Rancourt, E. (2018). Admin-first as a statistical paradigm for Canadian official statistics: Meaning, challenges and opportunities. Proceedings: Symposium 2018, Combine to Conquer: Innovations in the Use of Multiple Sources of Data, Statistics Canada. Available at https://www.statcan.gc.ca/eng/conferences/symposium2018/program/03a2_rancourt-eng.pdf.

Sirois, M. (2021). Coverage quality indicators. Internal presentation, Statistics Canada, June 2021.

PROJECT: Probabilistic record linkage based on recursive partitioning without training data

The probabilistic method of record linkage (Fellegi and Sunter, 1969) is often preferred when linking records without a unique identifier (Enamorado, FiField and Imai, 2019; Bianchi Santiago, Colon Jordan and Valdes, 2020). For example, at Statistics Canada, it is intensively used in the Social Data Linkage Environment (Statistics Canada, 2022), which provides the linked data for many social, health and environmental studies. The probabilistic method is appealing because it provides a principled way of minimizing the linkage errors for a given set of features. However, it falls short of prescribing these features, which are still selected manually from experience; a labour-intensive process that does not guarantee the optimality of the chosen features. Another challenge is the estimation of the decision parameters because of the features interactions. Finally, the probabilistic method is often perceived as complex and non-intuitive by those familiar with the deterministic or hierarchical methods of record linkage. Recursive partitioning is ideally placed to address these limitations of the probabilistic method, but its application to record linkage has been hampered by the common lack of training data, with earlier attempts at using decision trees for record linkage by Elfeky, Verykios, Elmagarmid, Ghanem and Huwait (2003), and Feigenbaum (2016). Elfeky et al. (2003) train the tree on the result of an unsupervised two-means clustering procedure, which has its own limitations (Quadir and Bao, 2016; Quadir, 2017), while Feingenbaum trains the tree on a clerical review sample, which may be costly to source.

Progress:

A new record linkage methodology has been developed, which blends a particular form of recursive partitioning with the probabilistic method, while dispensing with training data or the ground truth (Chen, 2022; Dasylva and Chen, 2022). In this methodology, recursive partitioning is used to select the features and partition the record pairs into groups where the conditional match probability (i.e., the conditional probability that two records represent the same unit given the observed features) is as homogeneous as possible. The actual procedure is a variation of that proposed by Breiman, Friedman, Olshen and Stone (1984), where the tree cost function is estimated with the model described by Dasylva and Goussanou (2022), while implicitly accounting for all interactions among the selected features, at each node. The resulting tree is then used to establish optimal links, by assigning a weight to each leaf and linking the pairs with a positive probability where this weight is no less than a threshold. The methodology has been implemented in R with the Rpart package (Therneau, Atkinson, and Ripley, 2009) and a custom splitting function.

For more information, please contact:
Abel Dasylva (613-408-4850, abel.dasylva@statcan.gc.ca).

References

Bianchi Santiago, J., Colon Jordan, H. and Valdes, D. (2020). Record linkage of crashes with injuries and medical cost in Puerto Rico. Transportation Research Record, 2674(10), 739-748.

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth.

Chen, W. (2022). Optimal feature extraction for probabilistic record linkage with model-based trees. Internal report, Statistics Canada.

Dasylva, A., and Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Japanese Journal of Statistics and Data Science. Available at https://doi.org/10.1007/s42081-022-00153-3, doi: 10.1007/s42081-022-00153-3.

Dasylva, A., and Chen, W. (2022). Probabilistic record linkage through recursive partitioning without training data. Presentation at the monthly meeting of the ONS-UNECE Machine Learning group, April 2022.

Elfeky, M., Verykios, V., Elmagarmid, A., Ghanem, T. and Huwait, A. (2003). Record linkage: A machine learning approach, a toolbox, and a digital government web service. Available at http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=504FD1F7CC64E71A806D75F14453621E?doi=10.1.1.11.1113&rep=rep1&type=pdf.

Enamorado, T., Fifield, B. and Imai, K. (2019). Using a probabilistic model to assist merging of large-scale administrative records. American Political Science Review, 113, 353-371.

Feigenbaum, J.J. (2016). A machine learning approach to Census record linking. Working paper, available at https://scholar.harvard.edu/files/jfeigenbaum/files/feigenbaum-censuslink.pdf.

Fellegi, I., and Sunter, A. (1969). A theory of record linkage. Journal of the American Statistical Association, 64, 1183-1210.

Quadir, T., and Bao, C. (2016). Application of Machine Learning Algorithms in G-Link. Internal Report, Statistics Canada.

Quadir, T. (2017). Automated/semi-automated Estimation of Thresholds for Weights in G-Link. Internal Report, Statistics Canada.

Statistics Canada (2022). Social Data Linkage Environment. https://www.statcan.gc.ca/en/sdle/overview.

Therneau, T., Atkinson, B. and Ripley, B. (2009). Rpart: recursive partitioning and regression trees [Computer software manual]. Available at http://CRAN.R-project.org/package=rpart.

PROJECT: Private set intersection with linkage errors

Many important questions about international trade are not addressed currently because national statistical organizations cannot easily link their corresponding micro-data due to privacy concerns and regulatory requirements. For example, one could study the use of preferential tariffs by different types of Dutch exporters in the context of the Comprehensive Economic Trade Agreement (CETA) between Canada and the EU, if one could link Dutch exports to Canadian imports. Privacy enhancing techonologies such as secure multi-party computation can enable such a linkage, while mitigating the privacy risks (United Nations, 2023). Thus, they may be the difference between doing and not doing a study, under the appropriate governance and regulatory framework. In particular, the linkage may be based on a previously proposed three-party private set intersection protocol (Bruno, De Cubellis, De Fausti, Scannapieco, and Vaccari, 2021), and the target totals may be computed without any bias from the linked data set, if there is a unique identifier. However, without such an identifier, linkage errors may arise and bias the computed totals.

Progress:

A statistical methodology has been developed to assess the linkage errors and adjust the estimated totals accordingly, when the private set intersection is based on quasi-identifiers. This methodology estimates the rates of linkage error according to Dasylva and Goussanou (2022), and it adjusts the totals using a variation of the weight adjustment scheme proposed by Judson, Parker, and Larsen (2013), where the false positives are also accounted for. The methodology is described in the final report of the UNECE project on Input Privacy Preservation (UNECE, 2023), along with simulations that demonstrate the effectiveness of the proposed approach.

For more information, please contact:
Abel Dasylva (613-408-4850, abel.dasylva@statcan.gc.ca).

References

Bruno, M., De Cubellis, M., De Fausti, F., Scannapieco, M. and Vaccari, C. (2021). Privacy set intersection with analytics - An experimental protocol (PSI De Cristofaro). UNECE-IPP presentation.

Dasylva, A., and Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Japanese Journal of Statistics and Data Science. Available at https://doi.org/10.1007/s42081-022-00153-3, doi: 10.1007/s42081-022-00153-3.

Judson, D.H., Parker, J. and Larsen, M.D. (2013). Adjusting sample weights for linkage-eligibility using SUDAAN. National Center for Health Statistics. Available at https://www.cdc.gov/nchs/data/datalinkage/adjusting_sample_weights_for_linkage_eligibility_using_sudaan.pdf.

United Nations (2023). United Nations Guide on Privacy-Enhancing Technologies for Official Statistics. United Nations Committee of Experts on Big Data and Data Science for Official Statistics, New York. Available at https://unstats.un.org/bigdata.

UNECE (2023). UNECE project on input privacy preservation: Final report. United Nations Economic Commission for Europe. Available at https://statswiki.unece.org/display/hlgbas/Input+Privacy-Preservation+for+Official+Statistics+Project+outcome.

PROJECT: Variance estimation for record linkage error-rates obtained via clerical review of stratified systematic samples of linked pairs

A common method of estimating record linkage error-rates is to use a manual, clerical review process. A sample of confirmed and rejected pairs from the linkage is sent to independent clerical reviewers. The reviewers then make decisions about each pair in the sample and their decisions are compared to the outcomes from the linkage process to estimate false match and missed-match rates. The Social Data Linkage Environment (SDLE) Methodology Unit at Statistics Canada currently makes use of a variant of this scheme in which the sample is drawn using a stratified systematic sampling design. This is also the method currently programmed into the clerical review tool of Statistics Canada’s Generalized System for Record Linkage (G-Link). Our unit does not currently provide estimates of design variance for our error-rate estimates. The goal of our project was to find a method of producing such estimates. This problem is interesting since there exists no unbiased estimator for the design variance of the Horvitz-Thompson estimator under systematic sampling.

Progress:

In order to find a suitable estimator, we employed methodology outlined by Kirk Wolter (1984) in a paper about variance estimation under systematic sampling. We considered a list of potential estimators, meant to be broadly representative of the variance estimators for systematic sampling available in the literature. We conducted a simulation study to evaluate the performance of these estimators. Our population of interest was the set of all pairs in a stratum from a typical SDLE linkage, together with the decisions clerical reviewers would make about those pairs. We created artificial versions of SDLE linkage strata by sampling from actual SDLE linkages, and we simulated clerical review decisions for these pairs using two different methods. One method was based on the Fellegi-Sunter record linkage model (Fellegi and Sunter, 1969), and the other was based on a decision tree trained using clerical review data (see Chen (2022) for a discussion of the use of decision trees as linkage models). Our estimators were evaluated in terms of bias, mean square error, and confidence probability. Through this comparison, we were able to identify one estimator that seems to perform better than the others for our purposes: the “overlapping strata” estimator discussed in (Yates (1981) page 231). In addition to our main line of investigation, we also computed the intra-class correlation coefficients for our artificial populations to compare the efficiency of systematic sampling and simple random sampling in this context, and we used our variance estimates to derive optimal sample allocations for our clerical review samples. Finally, we have incorporated variance estimates into one of the standard programs used in the SDLE production process. A detailed account of our work can be found in Loewen and Millar (2023).

For more information, please contact:
Goldwyn Millar (343-553-3930, goldwyn.millar@statcan.gc.ca).

References

Chen, W. (2022). Optimal feature extraction for probabilistic record linkage with model-based trees. Statistics Canada Coop student report, research supervised by Abel Dasylva, Statistics Canada, Ottawa.

Fellegi, I.P., and Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, Vol. 64, No. 328, 1183-1210.

Loewen, R., and Millar, G. (2023). Variance estimation for record linkage error-rates obtained via clerical review of stratified systematic samples of linked pairs. PowerPoint Presentation delivered at Methodology Seminar on May 10th, 2023, Internal Document, Statistics Canada, Ottawa.

Wolter, K. (1984). An investigation of some estimators of variance for systematic sampling. Journal of the American Statistical Association, Vol. 79, No. 388, 781-790.

Yates, F. (1981). Sampling Methods for Censuses and Surveys, 4th edition, New York, Macmillan Publishing Co.

1.3  Small area estimation

PROJECT: The use of random forests in small area estimation

When domain sample sizes are small, design-consistent direct estimators of population parameters can be unstable. To improve the precision of direct estimators, the Fay-Herriot area level model is often used. It has two components: a sampling model and a linking model. The latter specifies the relationship between the population parameters of interest and auxiliary variables available at the domain level. In its original form, the Fay-Herriot model assumes a linear linking model with constant error variance. It also requires estimating the smooth design variance of direct estimators, i.e., the model expectation of the design variance of direct estimators. Design-based variance estimators could be considered as estimators of the smooth design variances, but they are typically unstable for small sample sizes. To solve this problem, design-based variance estimates are usually smoothed, often using a log-linear smoothing model.

The assumptions underlying the Fay-Herriot and smoothing models are not always satisfied in practice, and it may be difficult and time-consuming to adequately correct the models. In this context, it may be desirable to have access to non-parametric methods, especially when the number of domains is large, because they depend less strongly on the validity of model assumptions. We are particularly interested in random forests for two reasons: i) they can be easily applied to the case of a mixture of categorical and continuous auxiliary variables, and ii) they produce predictions that always remain within the range of observed values. We consider a bootstrap procedure for the estimation of the mean square prediction error.

Progress:

We have developed a few non-parametric versions of the Empirical Best (EB) predictor when random forests are used to replace parametric models. We have evaluated the properties of our proposed EB predictors using real data and through simulation studies. Our results show that random forests are promising but further studies are needed before making firm conclusions. This project was presented at the 2023 Colloque francophone sur les sondages in Paris and at the 2023 annual conference of the Statistical Society of Canada. In the next year, we plan to complete our empirical studies and write a paper to be submitted to a peer-reviewed statistical journal.

For more information, please contact: 
Jean-François Beaumont (613-863-9024, jean-francois.beaumont@statcan.gc.ca).

PROJECT: Sample allocation under the Fay-Herriot small area model

Small area estimation procedures gained popularity in the last two decades because of the growing appetite for more granular estimates. Surveys are usually not designed to meet all users’ needs, which is why models such as the well-known Fay-Herriot model are used to compensate for small sample sizes in some domains. Moreover, a theoretical framework exists for many small area procedures (see Rao and Molina, 2015), which contributes to democratize this methodology.

The main objective of this project is to determine an effective method of allocating a sample to domains such that the resulting small area estimates, under the Fay-Herriot model, have sufficient precision for the largest number of domains possible. In other words, we focus on the optimization of the sample allocation when the Fay-Herriot model is used at the estimation stage. We consider the case where domains coincide with strata, as in Longford (2006). We also consider the theoretical scenario where the Fay-Herriot model parameters are known. This allows us to compute the best predictor and the variance of its prediction error, which is known as the g1 variance term in the literature. An important difference with standard design-based sample allocation is the substitution of the sampling variance of the direct estimator with the g1 variance term. This allows us to optimize, at the allocation stage, the precision of final domain estimates (or small area estimates).

Progress:

We proposed a new simple allocation method, which aims to reach target g1 variances for the most important domains. This method was tested using data from the Canadian Labour Force Survey and compared with the method proposed by Longford (2006) as well as more traditional sample allocation methods. The main conclusion is that sample allocation has more impact on direct survey estimates than small area estimates. Another conclusion is that our proposed method allows us by design to reach the desired precision for more domains than the alternative methods considered.

This project was presented at the 2023 Statistical Society of Canada Annual Meeting in Ottawa (Bosa and Beaumont, 2023). We are planning to write a paper summarizing our findings that will be submitted to the proceedings of the conference and/or to a peer-reviewed statistical journal.

For more information, please contact:
Keven Bosa (613-863-8964, keven.bosa@statcan.gc.ca).

References

Bosa, K., and Beaumont, J.-F. (2023). How to allocate the sample to maximize benefits from small area estimation techniques? Presentation at the Statistical Society of Canada Annual Meeting, May 2023.

Longford, N.T. (2006). Sample size calculation for small-area estimation. Survey Methodology, 32, 1, 87-96. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2006001/article/9259-eng.pdf.

Rao, J.N.K., and Molina, I. (2015). Small Area Estimation (2nd edition). NJ: John Wiley & Sons, Inc.

PROJECT: Relative efficiency of area‑level and unit‑level small‑area estimators when unit level auxiliary data is available

Small‑area estimation is widely used in many statistical agencies to produce reliable statistics using a model‑based approach using either area‑level or unit‑level models. This theory is well documented in many leading sources such as Rao and Molina (2015). In this research, we compare the two different approaches to modeling under the situation where we have complete auxiliary information for all the units in the population. This follows up on the work by Hidiroglou and You (2016) and Fay (2018).

We examine the unit-level empirical best linear unbiased predictor (EBLUP), and we compare it to two area-level model estimators under different scenarios for generating the population of interest. The first area-level model estimator uses direct estimates using the simple design-weighted estimator, and the second uses direct estimates from the survey regression estimator which is constructed from the auxiliary information.

Progress:

We conducted various simulations by generating populations under different linear models of the auxiliary variables. The simulations tend to support the results noted by Hidiroglou and You (2016) and Fay (2018). When the auxiliary variables in the unit‑level model provide a sufficiently good linear approximation to the survey variables, then the unit‑level model produces very efficient small area estimates. We can also produce efficient estimates using an area‑level model by first producing direct estimates using the survey regression estimator with the same auxiliary variables. These direct estimates when used as inputs to the area‑level model produce an area‑level estimator with very similar properties as the EBLUP estimator from the unit‑level model. There appears to be no advantage in using one estimator over the other. We also found that both estimators are more efficient than the area‑level estimator produced by using the direct estimates from the simple expansion or design‑weighted estimator. We also developed theory that supports these empirical findings.

These results are the basis of a presentation by J.N.K. Rao (Rao, Estevao, Beaumont and Bosa, 2023) at the 2023 Annual Meeting of the Statistical Society of Canada at Carleton University in Ottawa.

For more information, please contact:
Victor Estevao (613‑863‑9038, victor.estevao@statcan.gc.ca).

References

Fay, R.E. (2018). Further comparisons of unit- and area-level small area estimators. In Proceedings of the Survey Research Methods Section, 2018 Joint Statistical Meetings, Vancouver, Canada.

Hidiroglou, M.A., and You, Y. (2016). Comparison of unit level and area level small area estimators. Survey Methodology, 42, 1, 41-61. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016001/article/14540-eng.pdf.

Rao, J.N.K., Estevao, V., Beaumont, J.-F. and Bosa, K. (2023). Relative efficiency of area‑level and unit‑level small‑area estimators when unit level auxiliary data is available. Presentation at the Statistical Society of Canada Annual Meeting, May 2023, Ottawa, Canada.

Rao, J.N.K., and Molina, I. (2015). Small Area Estimation. New York: John Wiley & Sons, Inc.

PROJECT: Sampling variance smoothing methods for small area proportion estimation

In this project, we consider sampling variance smoothing methods including the use of design effects and the generalized variance function (GVF) for small area estimation. In particular, we propose different methods including the average smoothed estimator and weighted average estimator for sampling variance smoothing. The proposed smoothing methods can be used in the small area estimation as a standard approach and simplify the smoothing procedure for model-based small area estimation.

Progress:

The proposed smoothing methods perform very well for small area proportion estimation. We presented a paper at the Symposium 2022 conference, a conference paper has been written and submitted (You and Hidiroglou, 2022). We completed Labor Force Survey (LFS) application and more simulation study. A modified version of the paper has been submitted to Journal of Official Statistics (JOS) and the paper has been accepted by JOS for publication (You and Hidiroglou, 2023).

For more information, please contact:
Yong You (613‑863‑9263, yong.you@statcan.gc.ca).

References

You, Y., and Hidiroglou, M. (2022). Application of sampling variance smoothing methods for small area proportion estimation. Proceedings: Symposium 2022, Data Disaggregation: Building a more-representative data portrait of society, Statistics Canada, Ottawa, Canada (to appear).

You, Y., and Hidiroglou, M. (2023). Application of sampling variance smoothing methods for small area proportion estimation. Journal of Official Statistics, to appear in the December issue 2023.

PROJECT: HB inference for small area estimation using different priors for variance components

Hierarchical Bayes (HB) modeling is very popular in small area estimation and prior specification is very important in this approach. In this project, we study the impact of priors on variance components for small area estimation based on the HB models of You and Chapman (2006) and You (2021). Particularly we investigate the flat prior and inverse gamma priors for the variance components through simulation study and real data analysis.

Progress:

We have studied prior specifications for variance components in the HB models of You and Chapman (2006) and You (2021). We conducted a simulation study and applied the models to LFs data application. Our results indicate that the use of inverse gamma prior for variance component can be very effective in the HB models. A research paper has been submitted and will be published by Statistics in Transition new series (You, 2023).

For more information, please contact:
Yong You (613‑863‑9263, yong.you@statcan.gc.ca).

References

You, Y. (2021). Small area estimation using Fay-Herriot area level model with sampling variance smoothing and modeling. Survey Methodology, 47, 2, 361-370. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2021002/article/00007-eng.pdf.

You, Y. (2023). An empirical study of hierarchical Bayes small area estimators using different priors for model variances. Statistics in Transition new series, to appear.

You, Y., and Chapman, B. (2006). Small area estimation using area level models and estimated sampling variances. Survey Methodology, 32, 1, 97-103. Paper available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2006001/article/9263-eng.pdf.

PROJECT: Estimation of the poverty measures for small areas under a two-fold nested error linear regression model: comparison of two methods

Demand for reliable statistics at a local area (small area) level has greatly increased in recent years. Traditional area-specific estimators based on probability samples are not adequate because of small sample size or even zero sample size in a local area. As a result, methods based on models linking the areas are widely used. The World Bank has focused on estimating poverty measures, in particular poverty incidence and poverty gap called FGT measures (named for Foster, Greer and Thorbecke (1984)), using a simulated census method, called ELL (named for Elbers, Lanjouw and Lanjouw (2001)), based on a one-fold nested error model for a suitable transformation of the welfare variable. Modified ELL methods leading to significant gain in efficiency over ELL also have been proposed under the one-fold model. An advantage of ELL and modified ELL methods is that distributional assumptions on the random effects in the model are not needed.

In this research, we focused on two-fold random effect models involving area and subarea random effects. We extended ELL and modified ELL to two-fold nested error models to estimate poverty indicators for areas and subareas.

Progress:

We developed extensions of the ELL method and proposed two modifications methods to estimate FGT poverty measures in small areas under two-fold nested error linear regression model for unit-level data that includes area and subarea effects. Our simulation results indicated that the modified ELL estimators lead to large efficiency gains over ELL at the area level and subarea level. Further, a modified ELL method retaining both area and subarea estimated effects in the model (called MELL2) performs significantly better in terms of mean squared error (MSE) for sampled subareas than the modified ELL retaining only estimated area effects in the model (called MELL1). We have written a paper detailing the results of our research and submitted it to a peer-reviewed statistical journal (Sohrabi and Rao, 2023).

For more information, please contact:
Maryam Sohrabi (343-553-4529, maryam.sohrabi@statcan.gc.ca).

References

Elbers, C., Lanjouw, J.O. and Lanjouw, P. (2001). Welfare in Villages and Towns: Micro-Level Estimation of Poverty and Inequality. Unpublished manuscript, The World Bank.

Foster, J., Greer, J. and Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica: Journal of the Econometric Society, 52(3), 761-766.

Sohrabi, M., and Rao, J.N.K. (2023). Estimation of the poverty measures for small areas under a two-fold nested error linear regression model: Comparison of two methods. Draft manuscript submitted to a peer-reviewed statistical journal. arXiv:2306.04907 [stat].

PROJECT: Guiding Principles: Using the 2021 Census of Population Data to Produce Statistics on DDAP Groups of Interest

The gain in momentum of movements for Indigenous rights, racial justice, and economic equality have changed the data that must be collected. To help address the change in data needs, Statistics Canada implemented the Disaggregated Data Action Plan (DDAP) to better understand and highlight the challenges faced by population groups such as women, Indigenous peoples, racialized populations, and those living with daily limitations. As a result, more data centred around these diverse population groups at more levels of geography will become available for public use, with the goal to promote fairness and inclusion in decision making. However, as surveys designs change to account for the new requirements, questions related to the data sources to be used, ethical implications, and the appropriateness of various sampling methods have arisen. In an effort to address this, the Census Operations DDAP Research Project was funded to help document sampling methods to consider in DDAP contexts, as well as ethical and practical considerations The research project also focuses on the role of the 2021 Canadian Census of Population in identifying some of the targeted DDAP subgroups and provides a link between the theory and practice while accounting for factors such as the respondent’s burden and privacy.

Progress:

In order to fulfill the previously mentioned goals, the first outcome of the research project was a suite of tables of population counts for various DDAP subgroups of interest, based on the 2021 Census data, along with distributions of these populations when cross-tabulated by socio-demographic variables such as age, gender, and province (intersectionality variables). These population tables are used to gain clarity on the sampling possibilities of various subpopulations, and to guide survey designs by allowing appropriate sampling methods to be chosen based on the subpopulation’s size.

The second outcome of the research project was the document “Guiding Principles: Using the 2021 Census of Populations Data to Produce Statistics on DDAP Groups of Interest” (Pearce, Sallier and Laperrière, 2023) consisting of three chapters.The first chapter presents the organizational context of DDAP at Statistics Canada; the second chapter covers the various existing data sources for DDAP initiatives as well as ethical considerations. Lastly, the third chapter is the result of a literature review which aimed at listing sampling methods that can be used for DDAP initiatives. Indeed, to accompany population size tables in determining an appropriate sampling method, population characteristics like socially connectedness, frequenting known locations, and hiddenness also need to be considered. Therefore, chapter three considers these various factors, in theory as well as in practice, and lists pros and cons of the methods presented. This chapter also provides concrete examples of applications of these methods at Statistics Canada, as well as a section on practical considerations. This document of guiding principles and recommendations is currently being revised to be released internally and aims to centralize the available information to promote consistency and comparability within Statistics Canada, while also providing a set of coherent guiding principles moving forward for decision makers working at all levels.

For more information, please contact:
Kenza Sallier (343-998-8623, kenza.sallier@statcan.gc.ca).

Reference

Pearce, A., Sallier, K. and Laperrière, C. (2023). Guiding Principles: Using the 2021 Census of Populations Data to Produce Statistics on DDAP Groups of Interest. Internal Document, May 2023, Statistics Canada.


Date modified: