Analysis

Skip to main content
Skip to footer

Language selection

Français

Search and menus

Search and menus

Search

Skip to filters. View results.

Statistics Canada's Trust Centre

Results

All (12)

All (12) (0 to 10 of 12 results)

1. A case study of using Splink: Census duplicate matching Archived
Articles and reports: 11-522-X202200100002
Description: The authors used the Splink probabilistic linkage package developed by the UK Ministry of Justice, to link census data from England and Wales to itself to find duplicate census responses. A large gold standard of confirmed census duplicates was available meaning that the results of the Splink implementation could be quality assured. This paper describes the implementation and features of Splink, gives details of the settings and parameters that we used to tune Splink for our particular project, and gives the results that we obtained.
Release date: 2024-03-25
2. One-sided testing of population domain means in surveys
Articles and reports: 12-001-X202300100001
Description: Recent work in survey domain estimation allows for estimation of population domain means under a priori assumptions expressed in terms of linear inequality constraints. For example, it might be known that the population means are non-decreasing along ordered domains. Imposing the constraints has been shown to provide estimators with smaller variance and tighter confidence intervals. In this paper we consider a formal test of the null hypothesis that all the constraints are binding, versus the alternative that at least one constraint is non-binding. The test of constant versus increasing domain means is a special case. The power of the test is substantially better than the test with the same null hypothesis and an unconstrained alternative. The new test is used with data from the National Survey of College Graduates, to show that salaries are positively related to the subject’s father’s educational level, across fields of study and over several years of cohorts.
Release date: 2023-06-30
3. Growing Regression Trees that Use Sampling Frame Covariates to Explore Response Burden for Use in Survey Design Archived
Articles and reports: 11-522-X202100100024
Description: The Economic Directorate of the U.S. Census Bureau is developing coordinated design and sample selection procedures for the Annual Integrated Economic Survey. The unified sample will replace the directorate’s existing practice of independently developing sampling frames and sampling procedures for a suite of separate annual surveys, which optimizes sample design features at the cost of increased response burden. Size attributes of business populations, e.g., revenues and employment, are highly skewed. A high percentage of companies operate in more than one industry. Therefore, many companies are sampled into multiple surveys compounding the response burden, especially for “medium sized” companies.
This component of response burden is reduced by selecting a single coordinated sample but will not be completely alleviated. Response burden is a function of several factors, including (1) questionnaire length and complexity, (2) accessibility of data, (3) expected number of repeated measures, and (4) frequency of collection. The sample design can have profound effects on the third and fourth factors. To help inform decisions about the integrated sample design, we use regression trees to identify covariates from the sampling frame that are related to response burden. Using historic frame and response data from four independently sampled surveys, we test a variety of algorithms, then grow regression trees that explain relationships between expected levels of response burden (as measured by response rate) and frame covariates common to more than one survey. We validate initial findings by cross-validation, examining results over time. Finally, we make recommendations on how to incorporate our robust findings into the coordinated sample design.
Release date: 2021-10-29
4. Technical supplement for the Investment Banking Services Price Index Archived
Articles and reports: 62F0014M2019005
Description:
This document describes the updated methodology for Investment Banking Services Price Index (IBSPI).
Release date: 2019-07-08
5. Pseudo-likelihood-based Bayesian information criterion for variable selection in survey data Archived
Articles and reports: 12-001-X201300211871
Description:
Regression models are routinely used in the analysis of survey data, where one common issue of interest is to identify influential factors that are associated with certain behavioral, social, or economic indices within a target population. When data are collected through complex surveys, the properties of classical variable selection approaches developed in i.i.d. non-survey settings need to be re-examined. In this paper, we derive a pseudo-likelihood-based BIC criterion for variable selection in the analysis of survey data and suggest a sample-based penalized likelihood approach for its implementation. The sampling weights are appropriately assigned to correct the biased selection result caused by the distortion between the sample and the target population. Under a joint randomization framework, we establish the consistency of the proposed selection procedure. The finite-sample performance of the approach is assessed through analysis and computer simulations based on data from the hypertension component of the 2009 Survey on Living with Chronic Diseases in Canada.
Release date: 2014-01-15
6. Imputation for nonmonotone nonresponse in the survey of industrial research and development Archived
Articles and reports: 12-001-X201200211753
Description:
Nonresponse in longitudinal studies often occurs in a nonmonotone pattern. In the Survey of Industrial Research and Development (SIRD), it is reasonable to assume that the nonresponse mechanism is past-value-dependent in the sense that the response propensity of a study variable at time point t depends on response status and observed or missing values of the same variable at time points prior to t. Since this nonresponse is nonignorable, the parametric likelihood approach is sensitive to the specification of parametric models on both the joint distribution of variables at different time points and the nonresponse mechanism. The nonmonotone nonresponse also limits the application of inverse propensity weighting methods. By discarding all observed data from a subject after its first missing value, one can create a dataset with a monotone ignorable nonresponse and then apply established methods for ignorable nonresponse. However, discarding observed data is not desirable and it may result in inefficient estimators when many observed data are discarded. We propose to impute nonrespondents through regression under imputation models carefully created under the past-value-dependent nonresponse mechanism. This method does not require any parametric model on the joint distribution of the variables across time points or the nonresponse mechanism. Performance of the estimated means based on the proposed imputation method is investigated through some simulation studies and empirical analysis of the SIRD data.
Release date: 2012-12-19
7. Low-income Dynamics and Determinants Under Different Thresholds: New Findings for Canada in 2000 and Beyond Archived
Articles and reports: 75F0002M2011003
Description:
Existing studies on Canadian poverty (or low-income) dynamics are mainly based on 1990s data from the Longitudinal Administrative Database or the Survey of Labour and Income Dynamics (SLID). These studies typically rely on a single low-income threshold.
Our work extends the existing studies beyond 1999 by using SLID data from Panel 3 (1999 to 2004) and Panel 4 (2002 to 2007). We consider all three low-income thresholds established by federal departments: Statistics Canada's low-income cut-off (LICO) and low-income measure (LIM), and the market basket measure (MBM) of Human Resources and Skills Development Canada.
Release date: 2011-10-21
8. Treatments for link nonresponse in indirect sampling Archived
Articles and reports: 12-001-X200900211038
Description:
We examine overcoming the overestimation in using generalized weight share method (GWSM) caused by link nonresponse in indirect sampling. A few adjustment methods incorporating link nonresponse in using GWSM have been constructed for situations both with and without the availability of auxiliary variables. A simulation study on a longitudinal survey is presented using some of the adjustment methods we recommend. The simulation results show that these adjusted GWSMs perform well in reducing both estimation bias and variance. The advancement in bias reduction is significant.
Release date: 2009-12-23
9. Comparisons of collection follow-up score functions Archived
Articles and reports: 11-522-X200800010959
Description:
The Unified Enterprise Survey (UES) at Statistics Canada is an annual business survey that unifies more than 60 surveys from different industries. Two types of collection follow-up score functions are currently used in the UES data collection. The objective of using a score function is to maximize the economically weighted response rates of the survey in terms of the primary variables of interest, under the constraint of a limited follow-up budget. Since the two types of score functions are based on different methodologies, they could have different impacts on the final estimates.
This study generally compares the two types of score functions based on the collection data obtained from the two recent years. For comparison purposes, this study applies each score function method to the same data respectively and computes various estimates of the published financial and commodity variables, their deviation from the true pseudo value and their mean square deviation, based on each method. These estimates of deviation and mean square deviation based on each method are then used to measure the impact of each score function on the final estimates of the financial and commodity variables.
Release date: 2009-12-03
10. Imputation for nonmonotone last-value-dependent nonrespondents in longitudinal surveys Archived
Articles and reports: 12-001-X200800210756
Description:
In longitudinal surveys nonresponse often occurs in a pattern that is not monotone. We consider estimation of time-dependent means under the assumption that the nonresponse mechanism is last-value-dependent. Since the last value itself may be missing when nonresponse is nonmonotone, the nonresponse mechanism under consideration is nonignorable. We propose an imputation method by first deriving some regression imputation models according to the nonresponse mechanism and then applying nonparametric regression imputation. We assume that the longitudinal data follow a Markov chain with finite second-order moments. No other assumption is imposed on the joint distribution of longitudinal data and their nonresponse indicators. A bootstrap method is applied for variance estimation. Some simulation results and an example concerning the Current Employment Survey are presented.
Release date: 2008-12-23

Stats in brief (0)

Stats in brief (0) (0 results)

No content available at this time.

Articles and reports (12)

Articles and reports (12) (0 to 10 of 12 results)

1. A case study of using Splink: Census duplicate matching Archived
Articles and reports: 11-522-X202200100002
Description: The authors used the Splink probabilistic linkage package developed by the UK Ministry of Justice, to link census data from England and Wales to itself to find duplicate census responses. A large gold standard of confirmed census duplicates was available meaning that the results of the Splink implementation could be quality assured. This paper describes the implementation and features of Splink, gives details of the settings and parameters that we used to tune Splink for our particular project, and gives the results that we obtained.
Release date: 2024-03-25
2. One-sided testing of population domain means in surveys
Articles and reports: 12-001-X202300100001
Description: Recent work in survey domain estimation allows for estimation of population domain means under a priori assumptions expressed in terms of linear inequality constraints. For example, it might be known that the population means are non-decreasing along ordered domains. Imposing the constraints has been shown to provide estimators with smaller variance and tighter confidence intervals. In this paper we consider a formal test of the null hypothesis that all the constraints are binding, versus the alternative that at least one constraint is non-binding. The test of constant versus increasing domain means is a special case. The power of the test is substantially better than the test with the same null hypothesis and an unconstrained alternative. The new test is used with data from the National Survey of College Graduates, to show that salaries are positively related to the subject’s father’s educational level, across fields of study and over several years of cohorts.
Release date: 2023-06-30
3. Growing Regression Trees that Use Sampling Frame Covariates to Explore Response Burden for Use in Survey Design Archived
Articles and reports: 11-522-X202100100024
Description: The Economic Directorate of the U.S. Census Bureau is developing coordinated design and sample selection procedures for the Annual Integrated Economic Survey. The unified sample will replace the directorate’s existing practice of independently developing sampling frames and sampling procedures for a suite of separate annual surveys, which optimizes sample design features at the cost of increased response burden. Size attributes of business populations, e.g., revenues and employment, are highly skewed. A high percentage of companies operate in more than one industry. Therefore, many companies are sampled into multiple surveys compounding the response burden, especially for “medium sized” companies.
This component of response burden is reduced by selecting a single coordinated sample but will not be completely alleviated. Response burden is a function of several factors, including (1) questionnaire length and complexity, (2) accessibility of data, (3) expected number of repeated measures, and (4) frequency of collection. The sample design can have profound effects on the third and fourth factors. To help inform decisions about the integrated sample design, we use regression trees to identify covariates from the sampling frame that are related to response burden. Using historic frame and response data from four independently sampled surveys, we test a variety of algorithms, then grow regression trees that explain relationships between expected levels of response burden (as measured by response rate) and frame covariates common to more than one survey. We validate initial findings by cross-validation, examining results over time. Finally, we make recommendations on how to incorporate our robust findings into the coordinated sample design.
Release date: 2021-10-29
4. Technical supplement for the Investment Banking Services Price Index Archived
Articles and reports: 62F0014M2019005
Description:
This document describes the updated methodology for Investment Banking Services Price Index (IBSPI).
Release date: 2019-07-08
5. Pseudo-likelihood-based Bayesian information criterion for variable selection in survey data Archived
Articles and reports: 12-001-X201300211871
Description:
Regression models are routinely used in the analysis of survey data, where one common issue of interest is to identify influential factors that are associated with certain behavioral, social, or economic indices within a target population. When data are collected through complex surveys, the properties of classical variable selection approaches developed in i.i.d. non-survey settings need to be re-examined. In this paper, we derive a pseudo-likelihood-based BIC criterion for variable selection in the analysis of survey data and suggest a sample-based penalized likelihood approach for its implementation. The sampling weights are appropriately assigned to correct the biased selection result caused by the distortion between the sample and the target population. Under a joint randomization framework, we establish the consistency of the proposed selection procedure. The finite-sample performance of the approach is assessed through analysis and computer simulations based on data from the hypertension component of the 2009 Survey on Living with Chronic Diseases in Canada.
Release date: 2014-01-15
6. Imputation for nonmonotone nonresponse in the survey of industrial research and development Archived
Articles and reports: 12-001-X201200211753
Description:
Nonresponse in longitudinal studies often occurs in a nonmonotone pattern. In the Survey of Industrial Research and Development (SIRD), it is reasonable to assume that the nonresponse mechanism is past-value-dependent in the sense that the response propensity of a study variable at time point t depends on response status and observed or missing values of the same variable at time points prior to t. Since this nonresponse is nonignorable, the parametric likelihood approach is sensitive to the specification of parametric models on both the joint distribution of variables at different time points and the nonresponse mechanism. The nonmonotone nonresponse also limits the application of inverse propensity weighting methods. By discarding all observed data from a subject after its first missing value, one can create a dataset with a monotone ignorable nonresponse and then apply established methods for ignorable nonresponse. However, discarding observed data is not desirable and it may result in inefficient estimators when many observed data are discarded. We propose to impute nonrespondents through regression under imputation models carefully created under the past-value-dependent nonresponse mechanism. This method does not require any parametric model on the joint distribution of the variables across time points or the nonresponse mechanism. Performance of the estimated means based on the proposed imputation method is investigated through some simulation studies and empirical analysis of the SIRD data.
Release date: 2012-12-19
7. Low-income Dynamics and Determinants Under Different Thresholds: New Findings for Canada in 2000 and Beyond Archived
Articles and reports: 75F0002M2011003
Description:
Existing studies on Canadian poverty (or low-income) dynamics are mainly based on 1990s data from the Longitudinal Administrative Database or the Survey of Labour and Income Dynamics (SLID). These studies typically rely on a single low-income threshold.
Our work extends the existing studies beyond 1999 by using SLID data from Panel 3 (1999 to 2004) and Panel 4 (2002 to 2007). We consider all three low-income thresholds established by federal departments: Statistics Canada's low-income cut-off (LICO) and low-income measure (LIM), and the market basket measure (MBM) of Human Resources and Skills Development Canada.
Release date: 2011-10-21
8. Treatments for link nonresponse in indirect sampling Archived
Articles and reports: 12-001-X200900211038
Description:
We examine overcoming the overestimation in using generalized weight share method (GWSM) caused by link nonresponse in indirect sampling. A few adjustment methods incorporating link nonresponse in using GWSM have been constructed for situations both with and without the availability of auxiliary variables. A simulation study on a longitudinal survey is presented using some of the adjustment methods we recommend. The simulation results show that these adjusted GWSMs perform well in reducing both estimation bias and variance. The advancement in bias reduction is significant.
Release date: 2009-12-23
9. Comparisons of collection follow-up score functions Archived
Articles and reports: 11-522-X200800010959
Description:
The Unified Enterprise Survey (UES) at Statistics Canada is an annual business survey that unifies more than 60 surveys from different industries. Two types of collection follow-up score functions are currently used in the UES data collection. The objective of using a score function is to maximize the economically weighted response rates of the survey in terms of the primary variables of interest, under the constraint of a limited follow-up budget. Since the two types of score functions are based on different methodologies, they could have different impacts on the final estimates.
This study generally compares the two types of score functions based on the collection data obtained from the two recent years. For comparison purposes, this study applies each score function method to the same data respectively and computes various estimates of the published financial and commodity variables, their deviation from the true pseudo value and their mean square deviation, based on each method. These estimates of deviation and mean square deviation based on each method are then used to measure the impact of each score function on the final estimates of the financial and commodity variables.
Release date: 2009-12-03
10. Imputation for nonmonotone last-value-dependent nonrespondents in longitudinal surveys Archived
Articles and reports: 12-001-X200800210756
Description:
In longitudinal surveys nonresponse often occurs in a pattern that is not monotone. We consider estimation of time-dependent means under the assumption that the nonresponse mechanism is last-value-dependent. Since the last value itself may be missing when nonresponse is nonmonotone, the nonresponse mechanism under consideration is nonignorable. We propose an imputation method by first deriving some regression imputation models according to the nonresponse mechanism and then applying nonparametric regression imputation. We assume that the longitudinal data follow a Markov chain with finite second-order moments. No other assumption is imposed on the joint distribution of longitudinal data and their nonresponse indicators. A bootstrap method is applied for variance estimation. Some simulation results and an example concerning the Current Employment Survey are presented.
Release date: 2008-12-23

Journals and periodicals (0)

Journals and periodicals (0) (0 results)

No content available at this time.

Report a problem or mistake on this page

Date modified:: 2024-10-18