### Publications

### Statistical Methodology Research and Development Program 2014-2015 Achievements

#### Other links

# Research Projects

## Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Skip to text

- Research, development and consultation in SRID
- Edit and imputation
- Sampling and estimation
- Small area estimation
- Data analysis research (DAR)
- Data collection
- Disclosure control resource centre
- Record linkage research

Text begins

For more information on the program as a whole, contact:

**Mike Hidiroglou** (613-951-0251, mike.hidiroglou@statcan.gc.ca).

## Research, Development and consultation in SRID

The Statistical Research and Innovation Division (SRID) was created within the Methodology Branch on June 21, 2006. SRID is responsible for researching, developing, promoting, monitoring and guiding the adoption of new and innovative techniques in statistical methodology in support of Statistics Canada’s statistical programs. Its mandate also includes the provision of technical leadership, advice and guidance to employees elsewhere in the Methodology and Research Development. This assistance is in the form of advice for methodological problems that arise in existing projects or during the development of new projects. SRID is also jointly involved with other employees in research projects sponsored by the Methodology.

The Statistical Research and Innovation Division (SRID) was involved in many research, development and consultation projects in 2014-2015. The division grew in size in the fall of 2014: the Data Analysis Research Center and the Generalized Systems Group joined SRID. Details of the progress made are incorporated into the review of the research topics reported later.

Yong You completed specifications for the development of small-area estimation (SAE) under Hierarchical Bayes (HB) methods with unknown sampling variances. This was reviewed by Mike and Victor, and implemented and tested by Victor in the SAE prototype. Two new functions were added to carry out this estimation under a matched model and an unmatched log-linear model.

Susana Rubin-Bleuer developed specifications for the unit-level modeling assuming non-negligible sampling fractions in all areas. This was reviewed by Mike and Victor, and implemented and tested by Victor in the SAE prototype. A new macro parameter and a new set of inputs were added to the programs to carry out this functionality. The validation of the inputs was modified to handle either negligible or non-negligible sampling fractions across all areas.

Susana Rubin-Bleuer was involved in several Small Area Estimation (SAE) research and consultation projects. Susana continued testing a variety of estimators and used the Small Area Estimation system to produce estimates for 212 small domains required by the System of National Accounts with data from the Research and Development in Canadian Industry. She presented the results and questions about the methodology to the Advisory Committee in survey Methodology in October 2014.

Harold Mantel investigated methods for completion of datasets created by merging different administrative files, when not all records match for Economic Analysis Division. A general approach (sequential regression imputation) was recommended, and work is now underway to implement and evaluate the methodology for the National Accounts Longitudinal Microfile (NALMF) – an enterprise level dataset created by merging data from PD7, T2, T4 and GST files.

Jean-François Beaumont was involved in a number of research projects related to estimation methods. The main topics of interest were:

- Adaptive collection designs;
- Design and weighting issues related to nonresponse follow-up;
- Estimation methods in the presence of influential units;
- Variance estimation in two-phase sampling;
- Bootstrap methods, including the mean bootstrap;
- The analysis of survey data.

Mike Hidiroglou cooperated on a joint paper, submitted to Survey Methodology, with Jay Kim of Iowa State University and Christian Nambeux. Mike Hidiroglou revised a paper co-authored with Yong You to Survey Methodology entitled “Comparison of Unit Level and Area Level Small Area Estimators”.

In addition to participating in the research activities of the Methodology Research and Development Program (MRDP) as research project managers and active researchers, SRID staff was involved in the following activities:

- Staff advised members of other methodology divisions on technical matters on the following topics:
- Jean-François Beaumont and Mike Hidiroglou gave technical advice on design and estimation issues for the new Job Vacancy Survey.
- Jean-François Beaumont was consulted a number of times regarding variance estimation issues with the use of SEVANI in IBSP.
- SRID staff participated actively in the business and household surveys technical committees.
- Harold Mantel co-operated with members of the other methodology divisions on how to best produce confidence intervals for proportions. A review of methods and some empirical comparisons under stratified SRS and clustered sampling led to a tentative recommendation to use the logit transform or modified Clopper-Pearson method. The work will be presented to the ACSM in May for their comments and advice.
- Harold Mantel advised LFS methodologists on how to deal with lower than expected yields from the new LFS design.

- SRID participated in presentations at the Advisory Committee on Statistical Methods:
- Mike Hidiroglou and Victor Estevao presented a simulation study at the fall Advisory Committee on Statistical Methods comparing a number of small area estimation procedures to direct estimation.
- Jean-François Beaumont, Wesley Yung, Mike Hidiroglou, Elisabeth Neusy and David Haziza presented a simulation study to study different sampling designs for subsampling nonrespondents to be followed up.
- Cynthia Bocci, Jean-François Beaumont and David Haziza presented an adaptive data collection procedure for the prioritization of telephone call attempts.

- SRID consulted with members of the Methods and Standards Committee, as well as with a number of other Statistics Canada managers, in determining the priorities for the research program.
- Staff continued activities in different Branch committees such as the Branch Learning and Development Committee, and the Methodology Branch Informatics Committee. In particular, they have participated actively in finding and discussing papers of the month.
- Jean-François Beaumont presented a weight smoothing approach at a conference held in Montréal.
- Jean-François Beaumont presented on outlier-robust estimation at a conference in Québec in the honor of Louis-Paul Rivest.
- Jean-François Beaumont (co-authored by Cynthia Bocci and David Haziza) presented on adaptive data collection at the Colloque francophone sur les sondages in Dijon.
- Jean-François Beaumont chaired the Scientific Committee of the 2014 Methodology Symposium.
- SRID continued to actively support the Survey Methodology Journal. Mike Hidiroglou has been the editor of the Survey Methodology Journal since January 2010. Five people within SRID contribute to the journal: one is associate editor, three others are assistant editors.
- Pierre Lavallée and Jean-François Beaumont wrote a paper on weighting issues to appear in a SAGE handbook of Survey Methodology.
- Mike Hidiroglou presented an invited paper co-authored with Victor Estevao entitled “A comparison of small area and direct estimators via simulation” at SAE 2014 that took place in Poznan, Poland.
- Susana Rubin-Bleuer attended the Annual Statistical Society of Canada conference in 2014. As president of the Survey Methods section (SMS) of the SSC (2013-2014), Susana organized and chaired a workshop on Applied Analysis of Survey Data, organized and chaired an SMS sponsored session on Big Data, as well as a ‘President Invited Session’ on Surveys and Society, and also jointly with Mary Thompson, a Memorial session for David Binder. In addition, she wrote the participation report with topics discussed, main ideas brought back, etc, on behalf of all the Statistics Canada conference participants.
- Susana Rubin-Bleuer successfully proposed and organized an invited session on ‘Small Area Estimation for Business and economic data’ sponsored by the International Association of Survey Statisticians.

For further information, contact:

Mike Hidiroglou (613-951-0251, mike.hidiroglou@statcan.gc.ca).

## Edit and imputation

Inconsistent responses and missing values occur in all surveys. These nonsampling errors may generate bias and increase the variance of standard survey estimates developed under the assumption of complete response. It has become particularly important to address this issue in recent years in light of declining response rates. The main goals of this project are:

- to develop new methods of preventing or handling missing data so as to reduce the effect of the nonresponse error;
- to study properties of existing methods under different scenarios to better understand how and when to use them;
- to develop computer tools implementing new or existing methods that could be beneficial to statistical programs.

This year, there were three subprojects under this topic. The progress is described below for each of these subprojects.

**Progress:**

**Estimation of the variance of an estimator of the total of a product of two variables that may have been imputed**

Estimating variance in the context of estimating a total composed of two variables that may be imputed is not covered in literature. Current methods require that one of the two variables must be have a full response to be able to correctly estimate the variance.

Examples:

- This problem mainly arises when estimating a total when the domain indicator and the variable of interest may be imputed.
- The problem is also encountered with totals for a variable of interest when proportions by domain are collected (rather than the variable of interest itself). This problem exists in certain surveys that are part of the Integrated Business Statistics Program (IBSP).

The purpose of this project was to find one or more methods to estimate the variance of an estimator of the total of a product of two variables that may have been imputed. During the year, part of the theory was developed and the first basic computer simulations were run. However, there are no concrete results to date. The theory, simulations and documentation should be finalized in the future. Depending on the outcomes of this project, SEVANI (System for Estimation of Variance due to Non-response and Imputation) could be modified to include estimation of a total composed of two variables.

**Quantification of variance components in the presence of imputed data **

The goal of this project is to quantify, on a relative basis, the different components of the variance of the estimator obtained after imputation. More specifically, we want to compare the total variance of this estimator with its sampling variance.

In the past year, we were able to produce preliminary results for the auxiliary value and mean imputation methods for simple sample designs such as the Bernoulli sample design and the simple random design without replacement. Our findings, especially those relating to auxiliary value imputation, are in line with those obtained by Beaumont, Haziza and Bocci (2011). It also appears that, for mean imputation, the proportion of the total variance due to sampling increases as the size of the domain of interest decreases. This interesting finding will require further, more detailed verification.

In the next year, we plan to continue the work begun (particularly for survey designs and more complex imputation methods) and formulate certain assumptions that allow us to predict the asymptotic behaviour of variance components, especially when the domain is small. We will build on the work of Isaki and Fuller (1982). Simulation studies will be presented to support the theory developed.

The results of our study might be useful to SEVANI users to determine whether it is possible to omit certain components when estimating variance. We will make a series of recommendations on the most efficient method of estimating variance based on such parameters as the relative size of the domain in the population of respondents, survey fraction, response rate, imputation method, quality of the imputation model, etc.

**Tool for comparing various outlier detection methods**

We continued our work on an outlier detection tool developed using the SAS 9.2 software and more recent versions. The tool currently incorporates eight of the detection methods most used at Statistics Canada. The goal was to bring all of these methods under the same environment, to reduce the development time for users and make it possible to compare methods in order to choose the best one based on the survey data. It is an easy-to-use tool with a graphic module to view results.

Development of a new detection method using time series data has been completed but has not yet been incorporated in the tool. The method needs to be adapted to make it compatible with the existing tool. It is a new detection method that will allow data time analysis, which is not possible with the methods currently available.

New surveys used the tool to analyze new outlier detection options (e.g., Survey of Employment, Payrolls and Hours, Longitudinal and International Study of Adults, etc.). The International Accounts and Trade Division also used the outlier detection tool for a freight study. The Consumer Price Index (CPI) Division, in production, used three methods to detect basket weight outliers. In addition, following development of an outlier detection method for the traveller accommodation sub-index, two other sub-indexes will use the same detection method for their collected data.

Implementation of MM estimation was analyzed and the option of having other robust methods has not been ruled out. The goal would be to have a method with a higher breaking point than the M estimation method already available.

A new format (PDF) for saving graphics was added in addition to HTML. A few minor adjustments were made to the functionalities on the recommendation of users.

For further information, contact:

Jean-François Beaumont (613-863-9024, jean-francois.beaumont@statcan.gc.ca).

### References

Beaumont, J.-F., Haziza, D. and Bocci, C. (2011). On variance estimation under auxiliary value imputation in sample surveys. *Statistica Sinica*, 21, 515-537.

Isaki, C.T., and Fuller, W.A. (1982). Survey design under the regression superpopulation model. *Journal of the American Statistician Association*, 89-96.

## Sampling and Estimation

This progress report covers seven research projects on sampling and estimation:

- Estimation of variances and disclosure control using the bootstrap method.
- Adapting index estimation and aggregation formulae to use auxiliary information.
- On the cornerstones of coordinated bootstrap.
- On the quality of the approximation to the variance of the Generalized REGression (GREG) estimator obtained by linearization.
- On weighting respondents when a follow-up subsample of non-respondents is taken.
- Confidence intervals.
- Covariance estimation with partially overlapping samples.

**Progress:**

**1. Estimation of variances and disclosure control using the bootstrap method**

This project builds on the 2013-2014 research project on variance estimation using the bootstrap method and the study of disclosure control. Three criteria are examined in creating bootstrap weights for surveys with complex designs:

- Produce a “good” estimate of the variances from weights.
- Ensure that the method used is compatible with the survey design chosen (there might, for example, be reticence about using the Rao-Wu method if our survey fractions are high or the Preston method if our design is not a stratified simple random design).
- Protect he disclosure of the variables underlying the survey design when disseminating microdata.

Four bootstrap methods were considered: Rao-Wu-Yue, Preston, Poisson and the generalized method. Simulations made it possible to examine the accuracy of these four methods in a context of a stratified multi-stage design. This research project began in 2013-2014 and, during that time, had been the subject of two presentations to the Household Survey Methods Division (HSMD) Technical Committee (December 2012, December 2013). For 2014-2015, the work continued on assessing the methods and one new estimation method that captured the team’s attention was examined; it is a method presented in the Kim and Wu article that appeared in the June 2013 issue of *Survey Methodology. *Lastly, the disclosure risk was also on the menu of items to be examined during the period covered.

The results obtained for the four methods were examined more closely; in particular, the team devoted special attention to the unexpected performance of the Poisson method despite the fact that it does not include aspects of the survey design when calculating the variance. To this end, intra-cluster correlations were examined and the theoretical development of the method (in the context of the study) was discussed without, however, reaching any conclusive results. In addition, the generalized method was tested on real survey data, specifically, data from the Canadian Community Health Survey (CCHS). The Kim and Wu method was examined in greater detail without, however, leading to conclusive results. The goal was to adapt the method proposed in the article to our methods (Rao-Wu-Yue, Poisson and generalized) to reduce the number of replications required to estimate the variance. The theoretical challenge to adapt these formulas proved greater than anticipated. We were unable to reproduce meaningful variance estimates in simulation using this method. Lastly, the risk of disclosure was estimated by a clustering method and that analysis allowed us to quantify the proposed hypotheses. Documentation of the results has begun and a report will be forthcoming covering the entire project.

**2. Adapting index estimation and aggregation formulae to use auxiliary information**

Index estimation and aggregation works on two levels. In the first level referred to as the elementary index, price variations are typically aggregated using unweighted geometric means (Jevons index). In the second level, the various elementary indices are aggregated with basket weight information in a Laspeyres-type formula. In this project, we studied how we can improve the representativeness of the final estimation by introducing auxiliary information when we go from the first to the second level of aggregation. This project is a follow-up of work started in 2013. The goal of this second step is to further refine the simulation results to incorporate the effect of “chaining”, a process frequently applied during a basket update and to estimate the impact of using auxiliary information on the bias of the index computation.

Simulated data was created based on the price distributions of Canada’s Consumer Price Index (CPI). Generic programs were developed in order to be re-used with future updates to the dataset if need be. Various versions of the CPI estimates were computed depending on the version of the basket weights. This allowed the computation of the Laspeyres and Paasche indexes, which lead to the computation of the Fisher index to be used as the reference to compare the proposed method to the current method. An assessment of the stability of the market shares used as weights by types of stores was also performed. While there is constant evolution in the market shares for some commodity classes, the study revealed that the distribution of sales by types of stores is stable enough to be updated only at the same time as the basket weights, i.e., every two years.

The objective of the research was to determine if the use of auxiliary information in a new aggregation structure for the CPI lead to more representative estimates and how this auxiliary information should be introduced in the aggregation structure. Results thus far have shown that it does lead to more representative estimates. In the short term, results have shown that the use of a weighted geometric mean at the lowest level of aggregation or the creation of new elementary aggregates by type of store lead to similar results. Future work will verify is this similarity remains in the long term. The research will also adapt the project to assess if they same method can be used if within an elementary aggregate scanner data is available for some stores (i.e., much more information available) whereas classic field collection continues with others. A comprehensive paper including both phases of the project will be written in the next months. It is already started as the literature review was updated. The objective is to submit this paper to a relevant refereed journal.

**3. On the cornerstones of coordinated bootstrap**** **

The project consists in developing a theoretical framework for coordinated bootstrap. The rationale for the project is a lack of rigorous methodology for coordinating bootstrap weights although there are some ad hoc methods. The idea is not necessarily to obtain a better coordination method but rather to illustrate in general how coordination can function. The theoretical method can then be compared to a survey’s current method.

The goal of the research is to find a theoretical justification for using coordinated bootstrap weights. The Berger and Priam (2010) covariance estimator was adapted to the Beaumont and Patak (2010) generalized bootstrap. A Rao-Wu form of the estimator was derived. Simulation work was started (only the programming). A draft article was also started. The project will continue in 2015 (40 days this time). The goal is to complete the simulation studies and a first draft of the article. If the findings are conclusive, the method could perhaps be used for any survey that produces results for several cycles (for example, to estimate the difference between two cycles).

**4. On the quality of the approximation to the variance of the GREG estimator obtained by linearization**

A workable algebraic expression of the exact variance with respect to a sample design of the generalized regression estimator of a total, GREG, is difficult (if not impossible) to obtain. As a result, one rather finds in the literature, e.g., Särndal, Swensson and Wretman (1992), an expression for the approximation of that variance obtained using linearization.

Practitioners know that the exact variance of the GREG estimator is valid regardless of whether or not the underlying linear model agrees with the data: one says that GREG draws inspiration from the model but does not depend on it – that is, the GREG is model assisted, but not model dependent. Of course, if the choice of the underlying linear model is ill-advised in light of the data, then that (exact) variance is expected to be larger than what it would have been had the model been properly chosen. Thus, given data, one could compare two competing model assisted estimators to see which one has the best properties for its variance estimator. In reality, however, one compares an approximation to the variance obtained using Taylor linearization whenever the exact variance cannot be readily obtained. This is the approach used in Hidiroglou and Slanta (2002) to compare the Narain-Horvitz-Thompson and Hajek estimators (both of which are instances of the GREG estimator) under Poisson sampling.

For the result of such comparisons to be valid, one needs to make (at least implicitly) the assumption that the approximation to the variance used is valid too, regardless of whether the underlying model is well-specified or not. Simulation studies we have performed have shown this to not be the case: the quality of the approximation to the variance of the Hajek estimator obtained using linearization can be very poor when the linear model underlying the choice of the Hajek estimator under Poisson sampling is at odds with the actual data. This fact is not known by practitioners and seems not to have been established in the literature on model assisted inference.

This research is to further investigate the importance of model fit for accurate variance estimation and try to tie conclusively (i.e., theoretically) the quality of the approximation to the mis-specification of the linear model underlying the GREG estimator.

The following progress has been achieved:

- A literature review was conducted, showing that very little is said about the quality of the approximation; one exception being the book by Kirk Wolter (Wolter, 2007) which contains words of warning to practitioners.
- Simulations were performed and have shown that survey situations exist in which the approximation of the variance of the Hajek estimator of a total can be very poor despite the relatively large sample size used.
- Fellow colleagues were asked about these findings and it was established that the potentially poor quality of the approximation of the variance obtained by linearization is not a widely known phenomenon.
- Expert-colleagues were consulted to help determine how best to show theoretically whether the approximation holds reasonably well or not in a given survey context.
- Some algebraic derivations were attempted and completed, showing under which conditions is the approximation reliable or not.
- The results obtained were presented at the conference of the Statistical Society of Canada (SSC) 2014 (Toronto, May 2014) in an attempt to further raise awareness of practitioners with respect to this issue.
- A six-page report was completed, translated and submitted for the Proceedings of the SSC once the institutional review has been successfully carried out.
- A lengthier report (about 20 pages) has been produced and peer-reviewed; it is an expanded version of the Proceedings paper providing more extensive details about the work done.

**5. On weighting respondents when a follow-up subsample of non-respondents is taken**

Nonresponse is frequent in surveys and is expected to lead to biased estimates. A useful way to control nonresponse bias is to follow up a random subsample of nonrespondents after a certain point in time during data collection. Nonresponse bias can be eliminated through a proper weighting strategy, assuming that all the units selected in the subsample respond. Selecting a subsample of nonrespondents is sometimes done in practice because it can be less costly than following up with all the nonrespondents.

Nonresponse bias cannot be completely eliminated in practice because it is unlikely that all the units selected in the subsample will respond. However, nonresponse in the follow-up subsample can be treated using standard techniques such as nonresponse weighting or imputation.

The goal of this project was to focus on the weighting issue resulting from the occurrence of late respondents. The late respondents are those who eventually respond to the survey but after the point in time where the subsample was selected. These late respondents may or may not have been selected in the follow-up sample. We developed and investigated nonresponse weighting strategies that can be used to handle late respondents without just not considering them.

In May 2014, we presented to the Advisory Committee on Survey Methods (ACSM) two weighting approaches that lead to consistent estimators of totals provided consistent estimators of the unknown probabilities of being a late respondent are used. It appeared from the simulation study that the strategy used in the American Community Survey in the United States is the most promising. The simulation study was later expanded to include some strategies proposed by the discussant at the ACSM. Responses to their recommendations were prepared. In addition, a draft version of an article intended for a peer-reviewed publication was written.

**6. Confidence Intervals**

The purpose of the project is to examine the different methods of constructing confidence intervals for proportions in the context of complex sample designs. A large number of methods have been proposed for non-survey data, of which a few have been adapted for complex survey data. There are some empirical studies, but mostly restricted to the non-complex-survey case. Korn and Graubard (1998) and Liu and Kott (2009) did limited simulations of complex surveys. Others have examined output of various methods using specific survey data.

In 2013-2014, we ran simulations for simple random sampling. The purpose of this project in 2014-2015 was to extend simulations done in 2013-2014 with simple random sampling to more complex sample designs. The primary methods under consideration were the bootstrap percentile method, the modified Clopper-Pearson interval and the logit-transform interval. In particular, the bootstrap percentile method has not been studied in simulations of complex surveys. The performance of these methods were studied and compared under stratified sampling and two stage sampling.

A document (Neusy and Mantel, 2014) proposing guidelines for confidence interval for proportions was written and the revised based on comments from several methodologists including those in the Data Analysis Resource Centre (DARC). The document is intended for use by the Quality Secretariat and perhaps DARC. It will also provide guidance for G-Tab, and for surveys which produce confidence intervals for proportions.

Simulations were conducted using SRSWOR, stratified SRSWOR, and two-stage cluster sampling to explore the performance of the logit-transform, bootstrap percentile, and modified Clopper-Pearson methods. Some of the results are summarized in a paper (Mantel and Neusy, 2015) that will be presented to the ACSM in May.

Results of the simulations should be recorded more completely in a working paper. This could include running more simulations to cover other scenarios. The Guidelines document will be finalized and distributed. A more extensive document on methods written in 2013, “Confidence Intervals for Small Proportions Estimated from Complex Surveys” (Mantel, 2013) will be revised.

**7. Covariance estimation with partially overlapping samples**

Covariance estimation is an important component in the estimation of the variance of a trend. The aim of this project is to develop, implement, and test design-based estimators of the covariance of a pair of estimators for a variety of sampling designs, including those with overlapping samples, based on a working paper by Gambino (2013). In particular, overlapping two-phase sampling designs where both phases follow a Poisson design are of interest.

The main objective of this project was to extend the theory in the working paper by Gambino (2013) to more complicated sampling designs and estimators. The theory has been extended to the estimation of the covariance of expansion and calibration estimators using general multi-phase and multi-stage sampling designs with overlap. An application has been made to the special case of a pair of Hayek estimators using two-phase overlapping Poisson sampling. A Statistical Analysis System (SAS) program that calculates the covariance has been written for this particular application. A working paper describing these results together with proofs has been completed. A simulation study and testing with real data was planned, but not completed.

For further information, contact:

Pierre Lavallée (613-850-0532, pierre.lavallee@statcan.gc.ca).

### References

Armstrong, J., and St-Jean, H. (1994). Generalized regression estimation for a two-phase sample of tax records. *Survey Methodology*, 20, 2, 97-105.

Beaumont, J.-F., and Patak, Z. (2012). On the generalized bootstrap for sample surveys with special attention to poisson sampling.*International Statistical Review*, 80, 1, 127-148.

Berger, Y.G., and Priam, R. (2010). Estimation des corrélations entre les estimations transversales issues d’enquêtes répétées – Une application à la variance du changement. Recueil : Symposium 2010, Statistiques sociales : interaction entre recensements, enquêtes et données administratives, 268-277.

Deville, J.-C., and Särndal, C.-E. (1992). Calibration estimators in survey sampling. *Journal of the American Statistical Association*, 87, 376-382.

Elliott, D.R., O’Neill, J.R. and Sanderson, R. (2012). Stochastic and Sampling Approaches to the Choice of Elementary Aggregate Formula. Discussion Paper. Office for National Statistics.

Gambino, J. (2013). Covariances and correlations in finite population sampling. *HSMD Working Paper* HSMD-2013-007E.

Hidiroglou, M.A., and Slanta, J. (2002). On the Use of Auxiliary Data in Poisson Sampling. Proceedings of the Joint Statistical Meetings.

International Labour Office (ILO), IMF, OECD, Eurostat, UNECE, and the World Bank (2004). Consumer Price Index Manual: Theory and Practice. Geneva, Switzerland: International Labour Office.

Liu, Y.K., and Kott, P.S. (2009). Evaluating alternative one-sided coverage intervals for a proportion. *Journal of Official Statistics*, 25, 569-588.

Kim, J.K., et Wu, C. (2013). Estimation parcimonieuse et efficace de la variance par rééchantillonnage pour les enquêtes complexes. *Techniques d’enquête*, 39, 1, 105-137.

Korn, E.L., and Graubard, B.I. (1998). Confidence intervals for proportions with small expected number of positive counts estimated from survey data. *Survey Methodology*, 24, 193-201.

Kroger, H., Särndal, C.-E. and Teikari, I. (1999). Poisson mixture sampling: A family of designs for coordinated selection using permanent random numbers. *Survey Methodology*, 25, 1, 3-11.

Lu, W.W., and Sitter, R.R. (2008). Disclosure risk and replication-based variance estimation. *Statistica Sinica*, 18, 1669-1687.

Mantel, H. (2013). Confidence Intervals for Small Proportions Estimated from Complex Surveys. Draft document.

Preston, J. (2009). Bootstrap rééchelonné pour l’échantillonnage stratifié à plusieurs degrés. *Techniques d’enquête*, 35, 2, 247-254.

Rao, J.N.K., Wu, C.F.J. and Yue, K. (1992). Some recent work on resampling methods for complex surveys. *Survey Methodology*, 18, 2, 209-217.

Särndal, C.-E., Swensson, B. and Wretman, J. (1989). The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. *Biometrika*, 76, 527-537.

Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model assisted survey sampling. Springer, New-York.

Statistics Canada (1995). The Consumer Price Index Reference Paper Update based on 1992 Expenditures. Catalogue 62-553. Ottawa, ON: Statistics Canada.

Toninelli, D., and Beaulieu, M. (2013). Using frame information to enhance the quality of a price index. Proceeding: Symposium 2013, Producing reliable estimates from imperfect frames, 287-293.

Toninelli, D., Patak, Z. and Beaulieu, M. (2013). Enhancing the quality of price index estimates combining updated weights, a more representative sample design and a different aggregation structure. In *JSM Proceedings, Statistical Computing Section*. Alexandria, VA: American Statistical Association. 2038-2052.

Wolter, K. (2007). *Introduction to Variance Estimation*. Springer. 2nd Edition.

## Small area estimation

**Small Area Estimation for the Labour Force Survey**

The purpose of this research project was to evaluate whether small area estimation (SAE) could improve estimation of unemployment rate (UR) for the Labour Force Survey (LFS). The core LFS sample was enlarged by 16,600 additional households (top up sample) to improve 3-month moving average of unemployment rates (UR) estimates in Employment Insurance Economic Regions (EIER). We took advantage of the top up sample to model direct estimates of unemployment rates in EIERs, obtained from smaller size samples and compare the resulting small area estimates with the direct UR estimates obtained with the larger sample that includes the top up sample. We simulated 1,000 reduced LFS samples by repeated sub-sampling from past LFS response data and then weighting, calibrating, and computing corresponding bootstrap variance estimates. Direct estimates from these smaller samples were fit with the area-level Fay-Herriot model using Employment Insurance (EI) claims (with a 2 month lag) divided by census population counts as auxiliary variable. We compared the design-based expectation of the resulting Empirical Best Linear Unbiased Predictors (EBLUPs) from the reduced samples with current direct estimates from the larger LFS sample. In a preliminary trial, we fitted the model for a subset of EIERs in Ontario and Quebec over a number of months in 2011 and 2012. Design-based RRMSE was approximated by taking the sum across the subsamples of squared differences between model estimates and direct estimates from the larger sample. Preliminary results suggest that estimates for the number of unemployed in the small areas are quite close to the direct estimates obtained from the larger LFS sample, and that the model significantly improved the quality of the subsample estimates.

**Extensions of the Pseudo-EBLUP estimator with application to SEPH **

Previously, we had conducted a design-based comparison of various design-based and model-based cross-sectional small area estimators for possible application to SEPH (Rubin-Bleuer and Godbout, 2006) and (Rubin-Bleuer, Godbout and Morin, 2007). We had used a synthetic population following the characteristics of the SEPH population. This period, using the same synthetic population, we used the Statistics Canada Small Area Estimation System to run the unit level model and ensure that the previous results were reproducible. We also developed formulas for the model-based estimator of the Mean Squared Error (MSE) of the pseudo-EBLUP estimator of a weighted mean, when the sampling rate is non-negligible. We wrote a paper with these empirical and theoretical results for possible publication in a peer reviewed journal (Rubin-Bleuer, Godbout and Jang, 2014). In addition, we compared the design-based expectation of the model-based MSE estimator with the corresponding design-based MSE for some industries and some months. Preliminary showed that they were not similar.

**Implementation of SAE System for a business survey with application to RDCI**

The survey of Research and Development in Canadian Industry (RDCI) uses administrative data together with a sample of 2,000 “firms” or collection entities. RDCI is required to produce a fully imputed micro data file, and to produce estimates for 212 NAICS groups for the System of National Accounts (SNA). In IBSP, the current objectives and sample size of the RDCI do not support either of these. In this project, we conducted a feasibility study on the production of estimates for 212 small domains using small area estimation techniques. We investigated a variety of models and used the Small Area Estimation system to obtain the estimates as well as techniques to deal with outliers. A report detailing the methodology and discussing the results was written. We presented the results and questions about the methodology to the Advisory Committee in survey Methodology in October 2014 (Rubin-Bleuer, Julien and Pelletier, 2014).

**Positive variance estimators for the Fay-Herriot model**

The Restricted Maximum Likelihood (REML) method of variance estimation is generally used for the estimation of the variance of the random area effect under the small area Fay-Herriot model to obtain the EBLUP estimate of a small area mean. When the REML estimate is zero, the weight of the direct sample estimator is zero and the EBLUP becomes a synthetic estimator. However, most practitioners are reluctant to use synthetic estimators for small area means, since these ignore the survey based information and are often biased. This feature gave rise to a series of adjusted likelihood adjusted density maximization (ADM) variance estimation methods that yield always positive estimates. Some of these ADM estimators have large bias and this affects the estimation of Mean Squared Error (MSE) of the EBLUP. Previously, we had proposed the MIX variance estimator and had studied some of its theoretical and finite sample properties of the MIX (Rubin-Bleuer and You, 2012). In this period, we included new ADM variance estimators in our empirical comparison and added more performance measures in the study. In addition, we found out that using the method of score in the maximization of the likelihood, the new ADM estimators yielded zero estimates in similar proportion to REML and contrary to the theory. We solved this operational problem by writing and running a Statistical Analysis System (SAS) program using a grid maximization algorithm. A paper with the new results was written and submitted to Survey Methodology (Rubin-Bleuer and You, 2015).

**Study of prior distributions on model variance in small area estimation using area level models**

In this project, we investigate the impact of prior distribution on model variance in the hierarchical Bayes inference for small area estimation. We estimate the model variance through a simulation study based on different priors. We also investigate the estimates of small area parameters based on different priors on the model variance. Our results have shown that the Jeffrey’s reference prior performs the best among all the priors considered in the paper. The bias or overestimation of the model variance is substantially reduced when the Jeffrey’s prior is used in the model. We also apply the Fay-Herriot model to s small Canadian Community Health Survey (CCHS) health data set to compare the hierarchical Bayes estimates based on the different priors on the model variance. The Jeffrey’s prior leads to best model fit and smallest coefficient of variations for the model-based estimates. Based on the simulation study and real data analysis, we recommend Jeffrey’s prior for the model variance in the Fay-Herriot model in the hierarchical Bayes inference, particularly when the true model variance is relatively small and the number of small areas is also small. A research paper based on this project has been finished (You, 2015).

**Comparison and evaluation of unit level and area level models**

Area level models and unit level models are generally used in small area estimation to obtain efficient estimates for small areas. In this project, we consider the confidence interval estimation based on the estimates from area level and unit level models. In particular, we are interested in the performance of the area level and unit level estimators based on model misspecification. We developed the direct estimators based on the PPS sampling for the area level model and then obtained corresponding area level model-based estimators. We consider the EBLUP and pseudo-EBLUP estimators for the unit level models. Our results have shown that the unit level model is more efficient than the area level model, and both the EBLUP and pseudo-EBLUP estimators perform well under correct modeling. However, under model misspecification, the pseudo-EBLUP estimators with sampling weights accounted perform much better than the EBLUP, particularly under informative sampling design. A paper based on this project has been revised and submitted to a journal for publication (Hidiroglou and You, 2015).

**Small area unemployment rate estimation using the Fay-Herriot model for the Labour Force Survey**

The Canadian Labour Force Survey (LFS) produces monthly estimates of the unemployment rate at national and provincial levels. The LFS also releases unemployment estimates for sub-provincial areas such as Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs) across Canada. However, for some sub-provincial areas, the direct estimates are not reliable since the sample size in some areas is quite small. The small area estimation in LFS concerns estimation of unemployment rates for local areas such as CMA/CAs using small area models. In this project, we applied the Fay-Herriot model to the direct estimates and obtained the model-based unemployment rate estimates. Monthly Employment Insurance (EI) beneficiary rate at the CMA/CA level was used as auxiliary covariate in the model. We also showed some methods for sampling variance smoothing. We applied the Fay-Herriot model to the 2013 data and compared the direct and model-based estimates by point estimates and CVs. Our results show that the cross-sectional Fay-Herriot model can provide efficient model-based small area unemployment rate estimates. A paper was written up based on this project (You and Hidiroglou, 2014).

**Development of Prototype for Small-Area Estimation: 2014-2015**

A methodology was proposed for unit-level modeling assuming non-negligible sampling fractions in all areas (Rubin-Bleuer, 2014). This was reviewed, documented, implemented and tested in the SAE prototype. A new parameter and a new set of inputs were added to the programs to carry out this functionality. The validation of the inputs was modified to handle either negligible or non-negligible sampling fractions across all areas. The methodology was proposed for area-level modeling under Hierarchical Bayes estimation with unknown sampling variances. This was reviewed, documented, implemented and tested in the SAE prototype. Two new functions were added to carry out this estimation under a matched model and an unmatched log-linear model. Consultation and support was provided to internal and external clients on the use of the SAE prototype (Estevao, Hidiroglou and You, 2015).

**Small area estimation with unit-level models for an informative survey design**

The goal of the project is to develop and study a simple estimation method for small areas using unit-level models that produce reliable estimates when the survey design is informative. When using such models, the very strong assumption is usually that the survey design is not informative. This project addresses certain shortcomings that arise when that assumption does not apply. The method consists in using the EBLUP (Henderson 1975) or the pseudo-EBLUP of You and Rao (2002), but adding as an explanatory variable to the model variables of the survey design such as design weights or size measurement when working with a survey design with selection probabilities proportional to the size. This simple approach is compared to that of Pfeffermann and Sverchkov (2007). The project examines an initial scenario corresponding to a single-stage stratified design where the strata are small. It will then study a second scenario corresponding to a two-stage design where the small areas are the primary sampling units. Simulations showed in the first scenario that adjustment of the enhanced model resulted in important gains in precision in point estimates under informative designs, both in terms of the bias and the root mean square error. An article on the first scenario was submitted to *Survey Methodology* at the end of the 2011-2012 fiscal year and was accepted subject to revisions. During the 2014-2015 fiscal year, it was finalized (Verret, Rao and Hidiroglou 2015). During the year, we also started developing the theory and planning the simulations for the second scenario.

For further information, contact:

Susana Rubin-Bleuer (613-863-9230, susana.rubin-bleuer@statcan.gc.ca).

### References

Henderson, C.R. (1975). Best linear unbiased estimation and prediction under a selection model. *Biometrics*, 31, 423-447.

Pfeffermann, D., and Sverchkov, M. (2007). Small-area estimation under informative probability sampling of areas and within the selected areas. *Journal of the American Statistical Association*, 102, 480, 1427-1439.

Rubin-Bleuer, S., and Godbout, S. (2006). Specifications for the coding of small area estimators, their design-based Montecarlo mean squared errors and the estimators of the model-based mean squared errors of the pseudo-EBLUP estimator under heteroscedastic errors. *BSMD, Internal document.*

Rubin-Bleuer, S., and You, Y. (2012). A positive variance estimator for the Fay-Herriot small area model. SRID-2012-009E, Statistical Research and Innovation Division, Statistics Canada.

Rubin-Bleuer, S., Godbout, S. and Morin, Y. (2007). Evaluation of small domain estimators for the Survey of Employment, Payroll and Hours. *Proceedings of the Survey Methods Section of the Statistical Society of Canada*, 2007.

You, Y., et Rao, J.N.K. (2002). A pseudo-empirical best linear unbiased prediction approach to small area estimation using survey weights. *The Canadian Journal of Statistics*, 30, 3, 431-439.

## Data analysis research (DAR)

*The resources assigned to Data Analysis Research are used for conducting research on current analysis-related methodological problems that have been identified by analysts and methodologists; the resources are also used for research on problems that are anticipated to be of strategic importance in the foreseeable future. This report presents work on three projects.*

**Understanding Multiple Testing and False Detection Rates: Applications to Data Analysis.**

Analysts and methodologists routinely perform several independent significance tests, each at significance level α, without adjusting the testing procedure for multiple comparisons. This leads to a very high Type I error rate for the *family* of all the comparisons being tested. A very simple solution is applying the Bonferroni procedure. However, the Bonferroni procedure is not very popular because it is too conservative.

An alternative method which controls the false discovery rate (FDR) was proposed by Benjamini and Hochberg (1995). Thissen, Steinberg and Kuang (2002) provide a simple procedure for the implementation of the Benjamini-Hochberg method. So far, very little work has been done on the applicability of the FDR method to testing of multiple comparisons with survey data. This work can be used in many areas, in particular big data analysis.

This project will use a simulation study to investigate the applicability of the FDR method in the context of survey data analysis. Work has been done on completing a review of the literature on this topic. Future work will include several simulation studies and a note.

**Pair-wise Estimator**

Survey pairs are created by comparing every respondent with every other respondent (without repeats) using some arbitrary function. Therefore, for a survey of size n there are (n(n-1)/2) pairs. For each pair the similarity measure based on an arbitrary distance function is calculated. Subsequently the population or overall mean of the distance across pairs is calculated. This work looks at the mean similarity for the whole population, for strata and for different domains.

There has been a review of asymptotic techniques and variance calculations in design based estimation. Work will continue on the variance of the pair-wise estimator for both simple random sampling and stratified sampling plans.

For further information, contact:

Karla Fox (613-851-8556, karla.fox@statcan.gc.ca).

### References

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 1, 289-300.

Thissen, D., Steinberg, L. and Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. *Journal of Educational and Behavioral Statistics*, 27.1, 77-83.

## Data collection

The data collection research portfolio has for objective to increase knowledge and to put in place cost-efficient and quality processes. The 2014-15 collection research activities cover projects that are related to the corporate risk of declining response rates for household surveys, to active collection management initiatives and to help prepare the future multimode environment.

**Prototype of embedded experimental testing tool for collection strategies**

Statistical agencies conduct experimental or embedded designs in sample surveys to test the effects of various modifications in the survey process. Van den Brakel and Rensen (2005) proposed design-based analyses which have been implemented in Blaise’s X-Tool. Using this program, one can test hypotheses about differences between design-based estimates of finite population parameters observed under different survey approaches. Additional theoretical developments have been made by Van den Brakel since then. With the arrival of ICOS, and many new collection strategies, a new tool for analysis is required.

**Progress:**

During the reference period, we continued to increase our understanding of the theory of embedded designs and the functionalities of X-Tool. We were able to produce SAS code that can replicate X-Tool estimates for various scenarios. Simulation studies were conducted to further understand the link between the survey design and the variance estimator of the test statistic used by X-Tool. Additional simulations were also done to better understand the use of weights in van den Brakel’s formulas (design weights vs. final weights). A thorough review of the concept of power was equally undertaken to better comprehend its role in experimental designs. Finally, we met with Dr. van den Brakel during the 2015 Methodology Symposium. We discussed the project with him and also the possibility of collaborating with him in the development of a workshop on the topic of embedded experiments for the 2016 Methodology Symposium.

**Profile of respondents to the Field 8 electronic collection test**

Field 8 (Social, health and labour statistics) conducted a test survey to test electronic collection when households/persons are initially contacted by regular mail. Two subsamples were tested: an initial sample consisting of dwelling addresses and a second consisting of persons. The project’s objectives were to measure the difference in the participation and response rates between the two subsamples. Since the sample frame used contained a great deal of auxiliary information (from the NHS), it is possible to analyze the profiles of the participating or non-participating units, respondents or nonrespondents, etc. The goal of this research project is to conduct a detailed analysis of the different profiles and thus better understand the dynamic around electronic collection in household surveys.

**Progress:**

A first set of analyses were completed in order to address the following two questions:

- What types of households are more likely to log in, when receiving a letter addressed to the occupant?
- What types of individuals are more likely to log in, when receiving a letter addressed to them specifically?

Results were obtained mostly through logistic regressions with independent variables coming from several sources (National Household Survey (NHS), Household Survey Frame (HSF), Socio-economic File (SEF)) connected to the frame. Segmentation models were also explored. Conclusions drawn from these analyses helped better understand the typical profile of internet respondents and non-respondents, and this information can be used for the planning of future internet collection surveys. A report detailing the analyses was completed and published in methodology documentation (MEDOC).

A second set of analyses was done on sampled units that resulted in post-office returns. Given that there were a non-negligible portion of these, an analysis of the profile of these units was completed using similar methods to the ones used in the earlier work. Post-office returns for the targeted respondents sample are related to the mobility of people (people having moved since the last Census), while the dwelling based sample would be more about the effectiveness of the survey frame to provide an accurate existing address for a given dwelling. Given that proportion of post-office returns was so low for the dwelling based sample, only the targeted respondents sample was examined.

Several presentations were made about this project, including presentations given to the Household Survey Methods Division (HSMD) Collection Research group, as well as at the Methodology Symposium and as a Methodology seminar.

**Intelligent Management Collection Strategy**

In the recent years, a new active collection management strategy was developed in the context of the Integrated Business Statistics Program (IBSP). This strategy consists of periodically producing estimates and quality indicators, and to derive Measure of Impact scores (MI) that predict the impact of each unit on the quality indicators, and this, in order to prioritize the units for telephone follow-up.

Now, even though follow-up efforts will be put on units having the most potential impact on the estimates’ quality, follow-up efforts should also ideally take into account the probability that each follow up action be successful, something that is currently not implemented in the current strategy. Therefore, we would like to extend the methodology implemented by exploring the benefits of incorporating the use of response propensities. The first step would be to develop response propensity models using historical and frame information as well as collection paradata. These response probabilities would then be incorporated in the calculation of the MI scores that are used to prioritize units. Simulations would then be run to evaluate the efficiency gains obtained from the addition of these response probabilities.

**Progress:**

In Groves and Heeringa (2006), they suggested to build models using discrete time hazard models as in Singer and Willett (1993). Those models were logistic models predicting the odds of obtaining an interview on the next call, given a set of prior experiences with the sample case. We decided to use a similar approach. We could estimate the likelihood of a business to be finalized as respondent before the next Rolling Estimate (RE) iteration. In IBSP, a RE iteration is run regularly during collection to refresh the priority list of units to be followed up by phone. Then paradata from the BTHs of the 32 surveys in the Integrated Business Statistics Program (IBSP) for RY2013 were used in conjunction with frame information to derived potential predictors variables and build response propensity models. Paradata variables such as number of call attempts, number of contacts, initial mode of collection, etc. have a significant impact on the propensity for a business. However, the list of significant variables on the model differs from one survey to another. A report was written to give the results of the response propensity models and explain how we could incorporate the response propensity in the active collection management.

**Assessing the relationship between questionnaire complexity and respondent burden in household surveys using a graph theoretic approach**

Historically, several researchers have proposed modelling questionnaires as graphs. By treating questions as nodes, and flows from question to question as edges, we can, using existing graph theoretic algorithms, measure and model a number of aspects of questionnaires, including various measures of complexity and response burden. Picard (1965) proposed a graph-based questionnaire design theory. The coverage of a question is the surveyed population that is asked that question, and a flow is a sequence of questions that a specific population is asked. A simple flows-structure eases the important tasks of editing and imputation, and ultimately makes the data analysis more transparent.

The electronic world offers opportunities as well as creating challenges. As questionnaire flows, of necessity, will be stored in electronically retrievable formats, we can readily extract this information, and measure things like survey length, which survey paths are used most often, and provide measures of complexity. Used in design stages, and building up metrics over time, we can provide ideas on which proposed surveys are too complex.

**Progress:**

We have collaborated with the producers of the Questionnaire Design Tool (QDT) used in the Social Survey Processing Environment (SSPE) to help and define ways in which the graphs could be automated. We have a SAS program that creates a visual graph and calculates another measure of complexity. Using this SAS program it was noted that we over-count the number of possible paths as some ‘nodes’ are in fact conditional. This understanding has lead to the development of a second graphical form of the questionnaire.

We have completed rendering specifications using the QDT output, and have a working prototype automating the graph. We have completed work on an algorithm for complexity measures that now corrects for the double counting (this was one of the original reasons Statistics Netherlands stopped working on this area). We have worked on a presentation of this prototype.

**Understanding Respondents: profiling respondents using Data Mining Techniques**

Data mining often uses adaptive parametric and nonparametric models for understanding information in large databases. Data mining techniques have been used extensively in marketing research for customer profiling studies. These techniques could be used to help Statistics Canada better understand its respondent population, which is an area of technical expertise that is of interest for methodologists and survey managers. This study will review the literature on analytical approaches to respondent profiling, with the view of understanding response behaviors and response motivation.

Additionally, this study will look at the feasibility of clustering and segmentation techniques (such as k-means, hierarchical clustering and potentially adaptive decision tree clustering) with Statistics Canada paradata, auxiliary and respondent data as an analysis tool to understand respondents’ behavior. Cluster analysis in this context is a method for identifying homogenous groups of respondents based on distance between their characteristics. Individuals in the same cluster share many characteristics, while at the same time are dissimilar to individuals not in their cluster. Analysis will focus on the methods ability to differentiate individuals into useful response clusters or groups.

**Progress:**

An initial review of the literature on nonresponse categorization was completed. Non-respondent categories (or the reasons for nonresponse) were grouped using the following references:

- Guidelines on nonresponse reporting (Statistics Canada 2009).
- The categories used in monitoring nonresponse in the 2011 Census.
- Explaining Rising Nonresponse Rates in Cross-Sectional Surveys. Brick and Williams (2013).

A variety of survey data were investigated to find relevant data and context. A supplementary survey to the LFS, the Canadian Income Survey (CIS), was selected in order to have the auxiliary information on nonrespondents to the supplements.

**Quality of proxy response – Understanding the relation between the mode and instrument and New Paradata Application**

The purpose of this study is to identify proxy respondents in the Labour Force Survey of households. In this context, we would like to determine the factors that would influence the decision to respond by proxy. The hypothesis adopted in this work is that the decision to respond by proxy is an individual one and not made at the household level.

This project falls within the general context of evaluating proxy response quality. In the literature, studies on proxy reporting are mostly focused on evaluation of the potential bias on estimates resulting from use of this mode of reporting and its correction. The decision to be a proxy respondent is not really discussed. One of the reasons is that, in most surveys, the proxy respondent is known. This study is a first step to address the problem of measuring the potential bias of proxy responses in cases where the proxy respondent is not identified. This situation can arise with electronic collection. Moussa and Fox (2013) dealt with the methodology of evaluating this bias with multimodal collection. That approach would not be feasible with electronic collection unless there was prior identification of the proxy respondent.

**Progress:**

The work was carried out in the first part of the year.

**Development of a prioritization score for assigning case to interviewers who become available in a computer-assisted telephone interview (CATI) survey**

Development of a prioritization score for assigning cases to interviewers who become available in a CATI survey. In particular, the goal is to develop a method of calculating a score by which the call scheduler of a CATI system will select the individual that an interviewer who has become available will contact.

**Progress:**

A general form of the score was proposed. It takes into account the status (firm appointment, flexible appointment, etc.) achieved by each of the sample’s cases at the time of the call. A calculation is made for cases where there is no commitment to the respondent. This calculation takes into account three parameters:

- The time since the last attempt;
- The level of priority of each case according to the survey managers;
- An indicator of the level of availability of each case to respond to the survey at the time the call is made.

The goal is to develop a way to calculate the third parameter taking into account the information available on each individual selected, that is, the information contained in the survey frame, from sources outside the survey, or collected in previous attempts. Models that can be used for this purpose have been identified. They now need to be evaluated.

For further information, contact:

Michelle Simard (613-951-6910, michelle.simard@statcan.gc.ca).

### References

Bethlehem, J., and Hundepool, A. (2001). TADEQ, a tool for the Analysis and Documentation of Electronic Questionnaires. *Proceedings of the 2001 New Techniques and Technologies for Statistics Conference, *9-17.

Bethlehem, J., and Hundepool, A. (2002). On the documentation and analysis of electronic questionnaires. *International Conference on Questionnaire Development, Evaluation and Testing,* 1-22.

Bouchon, B. (1976). Useful information and questionnaires. *Information and Control**,* 32.4, 368-378.

Boulet, C. (2013). Analytical Potential of Field Collection Paradat from the 2011 Census of Population and National Household Survey. SSMD.

Brick, J.M., and Williams, D. (2013). Explaining rising nonresponse rates in cross-sectional surveys. *The Annals of the American Academy of Political and Social Science*, 645.

Elliott, S. (2012). The application of graph theory to the development and testing of survey instruments. *Survey Methodology*, 38, 1, 11-21.

Fagan, J., and Greenberg, B.V. (1988). Using graph theory to analyse skip patterns in questionnaires. Rapport technique, Bureau of the Census. *Statistical Research Division, Washington DC*, 124.

Groves, R.M., and Cooper, M.P. (1998). Nonresponse in Household Interview Survey, New York, John Wiley & Sons, Inc. (Wiley series in Probability and Statistics. Survey methodology section), 344 pages.

Groves, R.M., and Heeringa, S.G. (2006). Responsive design for household surveys: tools for actively controlling survey errors and costs. *Journal of the Royal Statistical Society, Series A, 169, 3, 439-457.*

Hosmer, Jr., D.W., Lemeshow, S. and Sturdivant, R.X. (2013). Applied Logistic Regression. 3rd ed. Wiley Series in Probability and Statistics.

Kreuter, F. (2013). Improving surveys with paradata. Analytic uses of process information. *Wiley Series in Survey Methodology*.

Moussa, S., and Fox, K. (2013). The use of proxy response and mixed mode data collection. Proceedings: Symposium 2013, Producing reliable estimates from imperfect frames, Statistics Canada.

Picard, C.-F. (1965). *Théorie des questionnaires*. Gauthier-Villars, 20.

Singer, J.D., and Willett, J.B. (1993). It’s about time: using discrete-time survival analysis to study duration and timing events. *Journal of Educational and Behavioral Statistics,* *18, 2, 155-195.*

Statistique Canada (2009). Lignes directrices pour la déclaration des taux de non-réponse.

Statistique Canada (2012). Le guide de l’Enquête sur la population active. Produit no. 71-543-G au catalogue de Statistique Canada.

Van den Brakel, J., and Binder, D. (2000). Variance estimation for experiments embedded in complex sampling schemes. *Proceedings of the section on Survey Research Methods*, American Statistical Association, 805-810.

Van den Brakel, J.A., and Renssen, R.H. (2005). Analysis of experiments embedded in complex sampling designs. *Survey Methodology*, 31, 1, 23-40.

Willenborg, L.C.R.J. (1995). Data editing and imputation from a computational point of view. *Statistica neerlandica,* 49.2, 139-151.

Willenborg, L.C.R.J. (1995). Testing the logical structure of questionnaires. *Statistica Neerlandica,* 49.1, 95-109.

## Disclosure control resource centre

As part of its mandate, the DCRC provides advice and assistance to Statistics Canada programs on methods of disclosure risk assessment and control. It also shares information or provides advice on disclosure control practices with other departments and agencies. Finally, it oversees research activities in Disclosure Control Methods*.*

**Progress:**

Information and advice on confidentiality was provided internally and to the Public Health Agency of Canada and the Canada Revenue Agency. A workshop “Strategies and Methods for Disclosure Control” has been presented on the day before Statistics Canada’s 2014 International Symposium on Methodological Issues. A paper and presentation on the Layered Perturbation Method has been completed for the Advisory Committee on Statistical Methods (see reference).

A course has also been prepared and given for the first time in March 2015, by Jean-Louis Tambay and Sébastien Labelle-Blanchet: Strategies and Methods for Statistical Disclosure Control (H-0406C).

**Administrative data disclosure control**

**Development of disclosure control rules for administrative data.**

The project involves developing rules for controlling disclosure of personal administrative data in aggregate forms (tables and analytical results). We divided the types of administrative data into two groups—type A (health, justice, education, etc.) and type B (tax data)—to better adapt the rules to the specific needs and challenges of each group. We are focusing on dissemination of administrative data controlled by the Agency (with the GTAB system) rather than dissemination by the Real Time Direct Access (RTDA) program. Different approaches were examined and discussed (suppression, Barnardization, score method, data permutation, noise methods, etc.). Our approach attempts to preserve what is already available (rules already in place for RTDA and GTAB). In general, the approach is post-tabular (processing on outputs).

**Progress:**

Rules were developed for all statistics with type A administrative data. These rules were presented to the HSMD Technical Committee, commented on and approved. Specifications were produced and presented to the Microdata Access Division. They will be incorporated in the GTAB over the coming fiscal year.

The work to develop disclosure control rules for new statistics (level of change, percentage of change and moving average) is well under way and should be completed next year. A presentation on the proposal was given to the project team and to the divisions involved.

**Confidentialized plots**

Programs to enable outside researchers to publish results from Statistics Canada data have stringent disclosure rules to prevent release of individual-level data, as required by the Statistics Act. Increasingly, tools like Real Time Remote Access (RTRA) allow the publication of a variety of descriptive and model statistics, individually or tabulated. Guidelines for graphical output, on the other hand, are modest and cover mostly table-equivalent outputs like choropleth maps, pie charts and histograms. Outside researchers wanting to show a two-dimensional distribution of values have been frustrated by a general ban on scatter plots that do not necessarily apply to internal researchers. Binned plots can also incorporate survey weights more elegantly than simple scatter plots or bubble plots (particularly in the case of large volumes of data), so users without disclosure control constraints can benefit too. Because of this, they could level the playing field for displaying bivariate distributions, giving both internal and external users an appealing and disclosure-resistant tool.

This research project would start from tools developed by the Australian Bureau of Statistics (ABS) as part of their DataAnalyzer package to implement, evaluate and refine a standard method of showing a bivariate distribution using hexagonal binning with scale and offset of the hexagonal cells determined by the data. An alternative approach, rectangular binning with adaptive controlled rounding techniques already used at Statistics Canada applied, will also be examined. Kernel smoothing is also a technique of interest: if there is time, investigating what changes might be necessary to introduce disclosure limitation into existing (built into SAS and Stata) methods of setting kernel parameters would respond to researcher needs and provide another confidentiality-aware method of producing nonparametric graphical output.

**Progress:**

The DataAnalyzer R source code has been released from the ABS. We have found some non-confidential data from Lohr with a variety of skewed and non-skewed continuous data, and in some cases, weights, to test a prototype on and have partially understood and specified the ABS R code for porting into SAS. We developed a prototype that performs binning with a variable bin size and orientation. We also assisted a HAD-based project with clustering observations before applying a kernel density estimator as a related confidentiality project fulfilling a need similar to hex binning. Doing this provided some insight into the strengths and weaknesses of SAS PROC FASTCLUS for disclosure control purposes, and motivated the development of an alternative clustering methodology.

**Data Perturbation to Reduce Cell Suppression**

The primary method of protecting sensitive cells in business survey tables is through cell suppression. When dealing with additive tabular data, secondary suppression must be performed to protect against residual disclosure. Alternatives to cell suppression include perturbed tabular adjustment and noise addition; these methods result in a loss of quality (through perturbation) instead of a loss of information (through suppression). If the values of sensitive cells are small relative to other cells in the same table, the resulting loss of quality using these methods could be very small, and preferable to cell suppression. In particular, this project will consider the case where only cells in a very small province or region (such as Nunavut) require sensitivity protection. The project will investigate the effectiveness of these methods, and their possible application, both in theory and using actual business data from previous surveys.

**Progress:**

A draft report on the findings was prepared (see reference).

**Disclosure Risk Measures**

In preparing Public Use Microdata Files (PUMFs) we often have to make decisions about the amount of detail to provide or make tradeoffs between different levels of detail for different characteristics. Measures of risk are available in the literature but they are generally not used at StatCan. Most methods needs some idea of the population distribution – which is hard to come by. Some risk measures factor in errors introduced in the data – which would be useful. The project aims to review the literature and previous research and to recommend a strategy or strategies for the calculation of risk.

**Progress:**

Due to unexpected events in other projects, no major progress were made in the period.

For further information, contact:

Michelle Simard (613-951-6910, michelle.simard@statcan.gc.ca).

## Record Linkage Research

Record linkage brings together records within the same file or across different files. It is an important tool for exploiting administrative data and may serve many purposes, e.g., creating a frame or collecting data. The research has three components. The first direction is the exploration of new methods for linking data. The second direction focuses on solutions for measuring linkage errors and the quality of linked data. The third direction is analysis and estimation with linked data.

**Optimal deterministic record-linkage**

In many linkages, there is enough information to establish reliable links through clerical reviews. In such cases, clerical reviews provide a solid basis for automating linkage decisions (Newcombe 1988, Ch. 31, pp. 97), regardless of whether the record pairs are compared with a deterministic, hierarchical or probabilistic methodology. The Fellegi-Sunter decision rule (Fellegi and Sunter 1969) is the corner stone of this automation with or without conditional independence assumptions.

A methodology is proposed to draw a probabilistic clerical-review sample, manually review the pairs in accordance with good practices and estimate the linkage parameters from the review results. It also includes a methodology for measuring clerical-review errors, whereby a small subsample of pairs is subjected to multiple independent reviews, and the repeated measures are processed by an Expectation-Maximization (E-M) algorithm (Biemer 2015). The different components of this methodology are described in reports, including one on good practices for clerical-review. A SAS-based prototype has been developed and applied to the data from the 2011 Census Overcoverage Study (COS).

The proposed solution provides at least four immediate benefits of which the first is a guarantee of optimal linkage, whenever the clerical-reviews are error-free and used as gold standard. Note that this first benefit also applies to linkages implemented with G-LINK, where conditional independence assumptions are currently made and lead to biased estimates and suboptimal links. The second benefit is the minimization of clerical-review costs through optimal sampling designs. The third benefit is an automated solution for replacing many deterministic or hierarchical linkages by probabilistic linkages of an equivalent quality, as measured by the achieved rates of linkage error. Thus it facilitates the convergence of the different linkage methodologies at Statistics Canada.

**Building a sampling frame of connected records groups through probabilistic record linkage**

In the COS, overcoverage is estimated by building a sampling frame of groups with connected records, drawing a probabilistic sample from this frame and manually reviewing the sampled groups to verify the occurrence and nature of their overcoverage. The groups may be defined as entire connected components or as simple pairs in the linked census graph; the graph where each census record represents a node and each pair an edge. The first option may produce a nearly unbiased overcoverage estimate but it leads to a high manual review burden, when few very large components are sampled. The second option overestimates the overcoverage but produces a reduced manual review burden.

A methodology is proposed where a compromise is made between these two alternatives. It defines sampling units based on neighbourhoods, which are record-groups larger than pairs but smaller than connected components. Neighbourhoods based on the hop-count distance (defined by the minimum number of intermediate links between two connected records) are recommended but other alternatives are also described. This methodology also addresses the issues of sampling and estimation with the neighbourhoods, to avoid any bias when triple or higher-order enumerations occur in the census.

The methodology is described in a report. It provides a basis for reducing clerical-review costs and for improving the quality of estimates in the upcoming 2016 COS. A SAS-based prototype has been developed to support the evaluation of the methodology through simulations and an application to the 2011 COS data.

**Expectation Maximization (EM) algorithm for record linkage**

In probabilistic record-linkage, links are based on parameters such as linkage weights and thresholds. With G-LINK, these parameters are currently set in a manual and iterative manner. However, in the 2011 COS, a SAS-based Expectation-Maximization (E-M) algorithm is used (Dasylva, Titus and Thibault 2015). This algorithm greatly simplifies linkages but currently suffers from certain limitations.

A methodology report discusses these limitations and proposes ways of accounting for correlations and clerical-reviews in the E-M algorithm, while reusing the SAS-based PROC CATMOD procedure in the M-step. It also looks at other commercial or open source software packages that are based on R, STATA or standalone. A SAS-based prototype has been developed to support the evaluation of the methodology through simulations and an application to the 2011 COS data.

This improved E-M algorithm will be used for the estimation of linkage parameters in the 2016 COS. It is to be integrated with G-LINK in the future.

**Linear and Logistic regression with linked survey data**

Data Linkage is a well-accepted method for finding matches between two sets of records. However, it can lead to linkage errors that include incorrect links and missed links. In general, ignoring these errors leads to biased estimates in analysis. There has been some work on how to make valid inferences in the presence of linkage errors. Chipperfield, Bishop and Campbell (2011) have adapted Maximum Likelihood methods for contingency tables and logistic regressions, while adjusting for incorrect links. It has been applied to linked census data based on the Febrl package (see Christen, Churches and Hegland 2004). Chambers, Chipperfield, Davis and Kovacevic (2009) have also used generalized estimating equations for estimating the parameters of a logistic model in the presence of linkage errors.

This research has addressed the problem of bias due to missed or incorrect links when analyzing linked data, which results from a probabilistic linkage between a survey (File X) and an administrative database (File Y) with G-Link. It has aimed at an estimation E-M based estimation methodology. For this fiscal year, two important contributions have been made. First, a reweighting scheme has been developed to address the problem of missed links and unlinked records, which may be viewed as form of unit non-response. Second a methodology has been developed for performing logistic regression with linked data, while adjusting for bad links that are akin to measurement errors. This methodology uses an implemented E-M algorithm.

**Quality Indicators for deterministic record-linkage**

MixMatch is a solution for deterministic linkage that is widely used at Statistics Canada. However, it does not provide any indicator or measure of link quality. This project aims at closing this gap without resorting to manual reviews.

Existing solutions (Block 2013; Belin and Rubin 2015) for measuring link quality have been reviewed and their limitations identified. A new solution is proposed that includes two components. The first component is a generic strength indicator for string comparators, the PLICE (Perfect Legitimate Inclusions Conflicts Extras), which is also to be incorporated into G-LINK. The second component is a methodology (Dasylva and Sinha 2014) for estimating False Match Rate (FMR) and the Missed Match Rate (MMR), when the distribution of unmatched pairs is known. It does not require any manual review. The methodology has been adapted for MixMatch with a SAS-based prototype. This solution will now be refined and tested with real data in the coming fiscal year.

For further information, contact:

Abel Dasylva (613-854-1918, abel.dasylva@statcan.gc.ca).

### References

Belin, T.R., and Rubin, D.B. (1995). A method for calibrating false-match rates in record linkage. *Journal of the American Statistical Association*, 90, 81-94.

Biemer, P. (2015). Latent Class Analysis of Survey Error, New Jersey: John Wiley & Sons, Inc.

Block, C. (2013). Elections Canada’s Practical Approach to Record Linkage. http://method/BiblioStat/Seminars/HSMD/ClaytonBlock_20130604_e.htm.

Chambers, R., Chipperfield, J., Davis, W. and Kovacevic, M. (2009). Inference based on Estimating Equations and Probability-Linked Data, Working Paper.

Chipperfield, J.O., Bishop, G.R. and Campbell, P. (2011). Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data. *Survey Methodology*, 37, 1, 13-24.

Christen, P., Churches, T. and Hegland, M. (2004). Febrl - A parallel open source data linkage system. In* Proceedings of Pacific Asia Knowledge Discovery and Data Mining (PAKDD) Conference*, Springer LNAI, vol. 3056, pp. 638-647, Sydney.

Fellegi, I.P., and Sunter, A.B. (1969). A theory for record linkage. *Journal of the American Statistical Association*, 64(328), 1183-1210.

Newcombe, H.B. (1988). Handbook of Record Linkage, New York: Oxford University Press.

## Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

- Date modified: