Statistical Methodology Research and Development Program Achievements, 2020/2021
3. Theory and framework

3.1 Inference from non-probability samples

There is a growing interest in National Statistical Offices to produce Official Statistics using non-probability sample data, such as big data or data from volunteer web surveys. Indeed, Statistics Canada has recently conducted several online volunteer surveys, called crowdsourcing surveys, to evaluate the impacts of the COVID-19 pandemic on different aspects of the life of the Canadian population. The main motivation for using non-probability samples is their low cost and respondent burden, and quick turnaround since they allow for producing estimates shortly after the information needs have been identified. However, non-probability samples are well known to produce estimates that may be fraught with significant selection bias. Beaumont and Rao (2021) discuss this important limitation, along with an illustration, and describe some remedies that involve the integration of data from the non-probability sample with data from a probability sample. Renaud and Beaumont (2020) describe four recent research initiatives to leverage non-probability sample data.

How to obtain meaningful estimates and make valid inferences from non-probability samples is an important question that still requires research and experimentations. The following four sub-projects attempted to address this question.

SUB-PROJECT: Bias reduction of non-probability sample estimators through propensity score weighting methods with application to crowdsourcing data

The goal of this project is to study and develop propensity score weighting methods that combine data from a non-probability sample that contains variables of interest and auxiliary variables with data from a probability sample that contains the same auxiliary variables (or a subset of them). Propensity score weighting involves modelling the probability of participation in the non-probability sample.

Progress:

We first considered a logistic model (see Chen, Li and Wu, 2019) and developed a variable selection procedure based on a modified Akaike Information Criterion (AIC). Our modified AIC properly accounts for the data structure and the possibly complex probability sampling design. We also developed a simple method of forming homogeneous post-strata. Moreover, we extended the Classification and Regression Trees (CART) algorithm (Breiman, Friedman, Stone and Olshen, 1984) to this data structure, and developed a pruning procedure that again properly accounts for the probability sampling design. Our new algorithm is called nppCART. Some details about the pruning procedure are given in Beaumont and Chu (2020). We also developed a bootstrap variance estimator that reflects two sources of variability: the probability sampling design and the participation model. Our methods are currently being evaluated using crowdsourcing data and Labour Force Survey (LFS) data. A draft paper has been written, which is close to being completed.  

SUB-PROJECT: An approximate Bayesian approach to improving probability sample estimators using a supplementary non-probability sample

A Bayesian method of combining data from a probability and non-probability sample was recently published in the Journal of Official Statistics (Sakshaug, Wisniowski, Ruiz, and Blom, 2019). These authors dealt with the estimation of model parameters when the dependent variable y and the vector of explanatory variables x are observed in both samples. They used the non-probability sample as a means of determining the prior mean for the model parameters under the assumption that the probability sampling design is ignorable. The goal of this project is to extend their method to the estimation of finite population parameters, under a possibly non-ignorable probability sampling design, and see if we can obtain model-based estimates that are more efficient than standard survey-weighted estimates.

Progress:

We have made the extension to the estimation of a finite population mean under a non-ignorable probability sampling design. We have also conducted preliminary simulation experiments that have been summarized in You (2021c). We plan to complete the simulation study and write a paper that will be presented at the 2021 Statistics Canada’s Symposium.

SUB-PROJECT: Mean square error estimation for non-probability sample estimates using small area estimation techniques

Administrative data and other non-probability sources of data are being increasingly considered by National Statistical Offices as a means of obtaining directly the information sought on a population. However, estimates from these non-probability sources are often subject to a number of errors, and there is a need to develop indicators of their accuracy.

Progress:

We developed an estimator of the conditional Mean Square Error (MSE) of non-probability sample estimates applicable when a probability sample with the same variables is also available. Our conditional MSE estimator is obtained by making use of Small Area Estimation techniques. We have evaluated the method using data from the Longitudinal Social Development Data program (LSDDP). The LSDDP file is constructed from administrative files and allows for the estimation of labour force characteristics. The Labour Force Survey (LFS) is used as the probability sample. We have started writing an article that summarizes the findings.

SUB-PROJECT: Statistical data integration using a prediction approach

We consider the problem where a non-probability sample is available that contains a vector of auxiliary variables, x, for each sample unit. We assume that this non-probability sample covers a significant portion of the population. A probability sample is also available that contains x as well as the variable of interest y for each sample unit. The indicator of participation in the non-probability sample is available in the probability sample. This scenario is relevant to a survey on postal traffic conducted by La Poste in France. Alain Dessertaine proposed a predictor for that scenario. We developed variance estimators, including a bootstrap variance estimator, for evaluating the quality of the proposed predictor. The details are given in an internal draft report.

Progress:

The collaboration with La Poste, Toulouse School of Economics and the university of Besançon continued and the objective is to write a joint paper. The results of this project will first be presented in an invited session at the Colloque francophone sur les sondages in the fall 2021.

For more information, please contact:
Jean-François Beaumont (613-863-9024, jean-francois.beaumont@statcan.gc.ca).

References

Breiman, L., Friedman, J.H., Stone, C.J. and Olshen, R.A. (1984). Classification and regression trees. CRC Press.

Chen, Y., Li, P. and Wu, C. (2019). Doubly robust inference with non-probability survey samples. Journal of the American Statistical Association (published online).

Sakshaug, J.W., Wisniowski, A., Ruiz, D.A.P. and Blom, A.G. (2019). Supplementing small probability samples with nonprobability samples: A Bayesian approach. Journal of Official Statsitics, 35, 653-681.

3.2 Machine Learning framework

Statistics Canada uses machine learning techniques to solve large-scale data problems. At the same time, national statistical institutes are facing unprecedented pressure to demonstrate to citizens, businesses and users that they are trustworthy and transparent institutions. Statistics Canada has developed a framework for the responsible use of machine learning techniques, including guidelines for constructing and implementing ethically and methodically sound processes. This framework is built on four themes— respect for people, respect for data, sound application and sound methods —and a number of attributes for each theme. It is aligned with the Directive on Automated Decision-Making and its Algorithmic Impact Assessment Tool, developed by the Treasury Board Secretariat (2020).

Progress:

Following testing and a review by management, the framework for responsible machine learning processes was formally adopted by Statistics Canada in July 2020. A checklist accompanying the guidelines has been finalized and is used to help assess responsible machine learning processes. A process for evaluating machine learning applications that will go into production has been developed, which includes an independent review by experts, adequate documentation and completion of the project-related checklist. The project methodology as well as the review carried out by the experts are presented to a scientific review committee which will issue recommendations on the proposed methodology. In the past year, seven projects were evaluated using this framework. Next steps for this project include: developing a dashboard for recording reviews, developing a template of documentation that should be provided to reviewers, and a new literature review regarding the concepts of explainability and interpretability to be up to date in this area.

For more information, please contact:
Keven Bosa (613-863-8964, keven.bosa@statcan.gc.ca).

3.3 Quality indicator research

In order to provide users with quality indicators for programs that combine administrative data sources, the Quality Secretariat has initiated work to develop a composite indicator that combines quality indicators related to different stages of data processing (record linkage, imputation, geocoding, etc.) into a single indicator. The objective is to give a global view of the quality of an estimate by taking into account several factors that can introduce errors. A first program will publish these indicators along with the estimates in the summer of 2021. This project will be presented at the European Establishment Statistics Workshop in September 2021 (Beaulieu and Lebrasseur, 2021).

For more information, please contact :
Martin Beaulieu (613-854-2406, martin-j.beaulieu@statcan.gc.ca).


Date modified: