Keyword search
Filter results by
Search HelpKeyword(s)
Subject
Type
Results
All (30)
All (30) (0 to 10 of 30 results)
- 1. Using bootstrap weights with Wes Var and SUDAAN ArchivedArticles and reports: 12-002-X20040027032Description:
This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.
The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.
Release date: 2004-10-05 - Articles and reports: 11-522-X20020016430Description:
Linearization (or Taylor series) methods are widely used to estimate standard errors for the co-efficients of linear regression models fit to multi-stage samples. When the number of primary sampling units (PSUs) is large, linearization can produce accurate standard errors under quite general conditions. However, when the number of PSUs is small or a co-efficient depends primarily on data from a small number of PSUs, linearization estimators can have large negative bias.
In this paper, we characterize features of the design matrix that produce large bias in linearization standard errors for linear regression co-efficients. We then propose a new method, bias reduced linearization (BRL), based on residuals adjusted to better approximate the covariance of the true errors. When the errors are independent and identically distributed (i.i.d.), the BRL estimator is unbiased for the variance. Furthermore, a simulation study shows that BRL can greatly reduce the bias, even if the errors are not i.i.d. We also propose using a Satterthwaite approximation to determine the degrees of freedom of the reference distribution for tests and confidence intervals about linear combinations of co-efficients based on the BRL estimator. We demonstrate that the jackknife estimator also tends to be biased in situations where linearization is biased. However, the jackknife's bias tends to be positive. Our bias-reduced linearization estimator can be viewed as a compromise between the traditional linearization and jackknife estimators.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016716Description:
Missing data are a constant problem in large-scale surveys. Such incompleteness is usually dealt with either by restricting the analysis to the cases with complete records or by imputing, for each missing item, an efficiently estimated value. The deficiencies of these approaches will be discussed in this paper, especially in the context of estimating a large number of quantities. The main part of the paper will describe two examples of analyses using multiple imputation.
In the first, the International Labour Organization (ILO) employment status is imputed in the British Labour Force Survey by a Bayesian bootstrap method. It is an adaptation of the hot-deck method, which seeks to fully exploit the auxiliary information. Important auxiliary information is given by the previous ILO status, when available, and the standard demographic variables.
Missing data can be interpreted more generally, as in the framework of the expectation maximization (EM) algorithm. The second example is from the Scottish House Condition Survey, and its focus is on the inconsistency of the surveyors. The surveyors assess the sampled dwelling units on a large number of elements or features of the dwelling, such as internal walls, roof and plumbing, that are scored and converted to a summarizing 'comprehensive repair cost.' The level of inconsistency is estimated from the discrepancies between the pairs of assessments of doubly surveyed dwellings. The principal research questions concern the amount of information that is lost as a result of the inconsistency and whether the naive estimators that ignore the inconsistency are unbiased. The problem is solved by multiple imputation, generating plausible scores for all the dwellings in the survey.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016717Description:
In the United States, the National Health and Nutrition Examination Survey (NHANES) is linked to the National Health Interview Survey (NHIS) at the primary sampling unit level (the same counties, but not necessarily the same persons, are in both surveys). The NHANES examines about 5,000 persons per year, while the NHIS samples about 100,000 persons per year. In this paper, we present and develop properties of models that allow NHIS and administrative data to be used as auxiliary information for estimating quantities of interest in the NHANES. The methodology, related to Fay-Herriot (1979) small-area models and to calibration estimators in Deville and Sarndal (1992), accounts for the survey designs in the error structure.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016718Description:
Cancer surveillance research requires accurate estimates of risk factors at the small area level. These risk factors are often obtained from surveys such as the National Health Interview Survey (NHIS) or the Behavioral Risk Factors Surveillance Survey (BRFSS). Unfortunately, no one population-based survey provides ideal prevalence estimates of such risk factors. One strategy is to combine information from multiple surveys, using the complementary strengths of one survey to compensate for the weakness of the other. The NHIS is a nationally representative, face-to-face survey with a high response rate; however, it cannot produce state or substate estimates of risk factor prevalence because sample sizes are too small. The BRFSS is a state-level telephone survey that excludes non-telephone households and has a lower response rate, but does provide reasonable sample sizes in all states and many counties. Several methods are available for constructing small-area estimators that combine information from both the NHIS and the BRFSS, including direct estimators, estimators under hierarchical Bayes models and model-assisted estimators. In this paper, we focus on the latter, constructing generalized regression (GREG) and 'minimum-distance' estimators and using existing and newly developed small-area smoothing techniques to smooth the resulting estimators.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016728Description:
Nearly all surveys use complex sampling designs to collect data and these data are frequently used for statistical analyses beyond the estimation of simple descriptive parameters of the target population. Many procedures available in popular statistical software packages are not appropriate for this purpose because the analyses are based on the assumption that the sample has been drawn with simple random sampling. Therefore, the results of the analyses conducted using these software packages would not be valid when the sample design incorporates multistage sampling, stratification, or clustering. Two commonly used methods for analysing data from complex surveys are replication and Taylor linearization techniques. We discuss the use of WESVAR software to compute estimates and replicate variance estimates by properly reflecting complex sampling and estimation procedures. We also illustrate the WESVAR features by using data from two Westat surveys that employ complex survey designs: the Third International Mathematics and Science Study (TIMSS) and the National Health and Nutrition Examination Survey (NHANES).
Release date: 2004-09-13 - 7. Inferences for finite populations using multiple data sources with different reference times ArchivedArticles and reports: 11-522-X20020016733Description:
While censuses and surveys are often said to measure populations as they are, most reflect information about individuals as they were at the time of measurement, or even at some prior time point. Inferences from such data therefore should take into account change over time at both the population and individual levels. In this paper, we provide a unifying framework for such inference problems, illustrating it through a diverse series of examples including: (1) estimating residency status on Census Day using multiple administrative records, (2) combining administrative records for estimating the size of the US population, (3) using rolling averages from the American Community Survey, and (4) estimating the prevalence of human rights abuses.
Specifically, at the population level, the estimands of interest, such as the size or mean characteristics of a population, might be changing. At the same time, individual subjects might be moving in and out of the frame of the study or changing their characteristics. Such changes over time can affect statistical studies of government data that combine information from multiple data sources, including censuses, surveys and administrative records, an increasingly common practice. Inferences from the resulting merged databases often depend heavily on specific choices made in combining, editing and analysing the data that reflect assumptions about how populations of interest change or remain stable over time.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016734Description:
According to recent literature, the calibration method has gained much popularity on survey sampling and calibration estimators are routinely computed by many survey organizations. The choice of calibration variables for all existing approaches, however, remains ad hoc. In this article, we show that the model-calibration estimator for the finite population mean, which was proposed by Wu and Sitter (2001) through an intuitive argument, is indeed optimal among a class of calibration estimators. We further present optimal calibration estimators for the finite population distribution function, the population variance, variance of a linear estimator and other quadratic finite population functions under a unified framework. A limited simulation study shows that the improvement of these optimal estimators over the conventional ones can be substantial. The question of when and how auxiliary information can be used for both the estimation of the population mean using a generalized regression estimator and the estimation of its variance through calibration is addressed clearly under the proposed general methodology. Constructions of proposed estimators under two-phase sampling and some fundamental issues in using auxiliary information from survey data are also addressed under the context of optimal estimation.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016735Description:
In the 2001 Canadian Census of Population, calibration or regression estimation was used to calculate a single set of household level weights to be used for all census estimates based on a one in five national sample of more than two million households. Because many auxiliary variables were available, only a subset of them could be used. Otherwise, some of the weights would have been smaller than the number one or even negative. In this technical paper, a forward selection procedure was used to discard auxiliary variables that caused weights to be smaller than one or that caused a large condition number for the calibration weight matrix being inverted. Also, two calibration adjustments were done to achieve close agreement between auxiliary population counts and estimates for small areas. Prior to 2001, the projection generalized regression (GREG) estimator was used and the weights were required to be greater than zero. For the 2001 Census, a switch was made to a pseudo-optimal regression estimator that kept more auxiliary variables and, at the same time, required that the weights be one or more.
Release date: 2004-09-13 - 10. Accuracy estimation with clustered dataset ArchivedArticles and reports: 11-522-X20020016737Description:
If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.
Release date: 2004-09-13
Data (0)
Data (0) (0 results)
No content available at this time.
Analysis (28)
Analysis (28) (0 to 10 of 28 results)
- 1. Using bootstrap weights with Wes Var and SUDAAN ArchivedArticles and reports: 12-002-X20040027032Description:
This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.
The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.
Release date: 2004-10-05 - Articles and reports: 11-522-X20020016430Description:
Linearization (or Taylor series) methods are widely used to estimate standard errors for the co-efficients of linear regression models fit to multi-stage samples. When the number of primary sampling units (PSUs) is large, linearization can produce accurate standard errors under quite general conditions. However, when the number of PSUs is small or a co-efficient depends primarily on data from a small number of PSUs, linearization estimators can have large negative bias.
In this paper, we characterize features of the design matrix that produce large bias in linearization standard errors for linear regression co-efficients. We then propose a new method, bias reduced linearization (BRL), based on residuals adjusted to better approximate the covariance of the true errors. When the errors are independent and identically distributed (i.i.d.), the BRL estimator is unbiased for the variance. Furthermore, a simulation study shows that BRL can greatly reduce the bias, even if the errors are not i.i.d. We also propose using a Satterthwaite approximation to determine the degrees of freedom of the reference distribution for tests and confidence intervals about linear combinations of co-efficients based on the BRL estimator. We demonstrate that the jackknife estimator also tends to be biased in situations where linearization is biased. However, the jackknife's bias tends to be positive. Our bias-reduced linearization estimator can be viewed as a compromise between the traditional linearization and jackknife estimators.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016716Description:
Missing data are a constant problem in large-scale surveys. Such incompleteness is usually dealt with either by restricting the analysis to the cases with complete records or by imputing, for each missing item, an efficiently estimated value. The deficiencies of these approaches will be discussed in this paper, especially in the context of estimating a large number of quantities. The main part of the paper will describe two examples of analyses using multiple imputation.
In the first, the International Labour Organization (ILO) employment status is imputed in the British Labour Force Survey by a Bayesian bootstrap method. It is an adaptation of the hot-deck method, which seeks to fully exploit the auxiliary information. Important auxiliary information is given by the previous ILO status, when available, and the standard demographic variables.
Missing data can be interpreted more generally, as in the framework of the expectation maximization (EM) algorithm. The second example is from the Scottish House Condition Survey, and its focus is on the inconsistency of the surveyors. The surveyors assess the sampled dwelling units on a large number of elements or features of the dwelling, such as internal walls, roof and plumbing, that are scored and converted to a summarizing 'comprehensive repair cost.' The level of inconsistency is estimated from the discrepancies between the pairs of assessments of doubly surveyed dwellings. The principal research questions concern the amount of information that is lost as a result of the inconsistency and whether the naive estimators that ignore the inconsistency are unbiased. The problem is solved by multiple imputation, generating plausible scores for all the dwellings in the survey.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016717Description:
In the United States, the National Health and Nutrition Examination Survey (NHANES) is linked to the National Health Interview Survey (NHIS) at the primary sampling unit level (the same counties, but not necessarily the same persons, are in both surveys). The NHANES examines about 5,000 persons per year, while the NHIS samples about 100,000 persons per year. In this paper, we present and develop properties of models that allow NHIS and administrative data to be used as auxiliary information for estimating quantities of interest in the NHANES. The methodology, related to Fay-Herriot (1979) small-area models and to calibration estimators in Deville and Sarndal (1992), accounts for the survey designs in the error structure.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016718Description:
Cancer surveillance research requires accurate estimates of risk factors at the small area level. These risk factors are often obtained from surveys such as the National Health Interview Survey (NHIS) or the Behavioral Risk Factors Surveillance Survey (BRFSS). Unfortunately, no one population-based survey provides ideal prevalence estimates of such risk factors. One strategy is to combine information from multiple surveys, using the complementary strengths of one survey to compensate for the weakness of the other. The NHIS is a nationally representative, face-to-face survey with a high response rate; however, it cannot produce state or substate estimates of risk factor prevalence because sample sizes are too small. The BRFSS is a state-level telephone survey that excludes non-telephone households and has a lower response rate, but does provide reasonable sample sizes in all states and many counties. Several methods are available for constructing small-area estimators that combine information from both the NHIS and the BRFSS, including direct estimators, estimators under hierarchical Bayes models and model-assisted estimators. In this paper, we focus on the latter, constructing generalized regression (GREG) and 'minimum-distance' estimators and using existing and newly developed small-area smoothing techniques to smooth the resulting estimators.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016728Description:
Nearly all surveys use complex sampling designs to collect data and these data are frequently used for statistical analyses beyond the estimation of simple descriptive parameters of the target population. Many procedures available in popular statistical software packages are not appropriate for this purpose because the analyses are based on the assumption that the sample has been drawn with simple random sampling. Therefore, the results of the analyses conducted using these software packages would not be valid when the sample design incorporates multistage sampling, stratification, or clustering. Two commonly used methods for analysing data from complex surveys are replication and Taylor linearization techniques. We discuss the use of WESVAR software to compute estimates and replicate variance estimates by properly reflecting complex sampling and estimation procedures. We also illustrate the WESVAR features by using data from two Westat surveys that employ complex survey designs: the Third International Mathematics and Science Study (TIMSS) and the National Health and Nutrition Examination Survey (NHANES).
Release date: 2004-09-13 - 7. Inferences for finite populations using multiple data sources with different reference times ArchivedArticles and reports: 11-522-X20020016733Description:
While censuses and surveys are often said to measure populations as they are, most reflect information about individuals as they were at the time of measurement, or even at some prior time point. Inferences from such data therefore should take into account change over time at both the population and individual levels. In this paper, we provide a unifying framework for such inference problems, illustrating it through a diverse series of examples including: (1) estimating residency status on Census Day using multiple administrative records, (2) combining administrative records for estimating the size of the US population, (3) using rolling averages from the American Community Survey, and (4) estimating the prevalence of human rights abuses.
Specifically, at the population level, the estimands of interest, such as the size or mean characteristics of a population, might be changing. At the same time, individual subjects might be moving in and out of the frame of the study or changing their characteristics. Such changes over time can affect statistical studies of government data that combine information from multiple data sources, including censuses, surveys and administrative records, an increasingly common practice. Inferences from the resulting merged databases often depend heavily on specific choices made in combining, editing and analysing the data that reflect assumptions about how populations of interest change or remain stable over time.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016734Description:
According to recent literature, the calibration method has gained much popularity on survey sampling and calibration estimators are routinely computed by many survey organizations. The choice of calibration variables for all existing approaches, however, remains ad hoc. In this article, we show that the model-calibration estimator for the finite population mean, which was proposed by Wu and Sitter (2001) through an intuitive argument, is indeed optimal among a class of calibration estimators. We further present optimal calibration estimators for the finite population distribution function, the population variance, variance of a linear estimator and other quadratic finite population functions under a unified framework. A limited simulation study shows that the improvement of these optimal estimators over the conventional ones can be substantial. The question of when and how auxiliary information can be used for both the estimation of the population mean using a generalized regression estimator and the estimation of its variance through calibration is addressed clearly under the proposed general methodology. Constructions of proposed estimators under two-phase sampling and some fundamental issues in using auxiliary information from survey data are also addressed under the context of optimal estimation.
Release date: 2004-09-13 - Articles and reports: 11-522-X20020016735Description:
In the 2001 Canadian Census of Population, calibration or regression estimation was used to calculate a single set of household level weights to be used for all census estimates based on a one in five national sample of more than two million households. Because many auxiliary variables were available, only a subset of them could be used. Otherwise, some of the weights would have been smaller than the number one or even negative. In this technical paper, a forward selection procedure was used to discard auxiliary variables that caused weights to be smaller than one or that caused a large condition number for the calibration weight matrix being inverted. Also, two calibration adjustments were done to achieve close agreement between auxiliary population counts and estimates for small areas. Prior to 2001, the projection generalized regression (GREG) estimator was used and the weights were required to be greater than zero. For the 2001 Census, a switch was made to a pseudo-optimal regression estimator that kept more auxiliary variables and, at the same time, required that the weights be one or more.
Release date: 2004-09-13 - 10. Accuracy estimation with clustered dataset ArchivedArticles and reports: 11-522-X20020016737Description:
If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.
Release date: 2004-09-13
Reference (2)
Reference (2) ((2 results))
- Surveys and statistical programs – Documentation: 11F0026M2004002Description:
This paper discusses the productivity program at Statistics Canada, covering topics such as international efforts to provide more comparable statistics, attempts to expand our knowledge of the factors behind productivity growth, and challenges facing the program.
Release date: 2004-08-06 - Surveys and statistical programs – Documentation: 12-002-X20040016891Description:
These two programs are designed to estimate variability due to measurement error beyond the sampling variance introduced by the survey design in the Youth in Transition Survey / Programme of International Student Assessment (YITS/PISA). Program code is included in an appendix.
Release date: 2004-04-15
- Date modified: