Sort Help
entries

Results

All (10)

All (10) ((10 results))

  • Articles and reports: 75F0002M2004010
    Description:

    This document offers a set of guidelines for analysing income distributions. It focuses on the basic intuition of the concepts and techniques instead of the equations and technical details.

    Release date: 2004-10-08

  • Articles and reports: 12-002-X20040027032
    Description:

    This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.

    The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.

    Release date: 2004-10-05

  • Articles and reports: 11-522-X20020016724
    Description:

    Some of the most commonly used statistical models are fitted using maximum likelihood (ML) or some extension of ML. Stata's ML command provides researchers and data analysts with a tool to develop estimation commands to fit their models using their data. Such models may include multiple equations, clustered observations, sampling weights and other survey design characteristics. These elements are discussed in this paper.

    Release date: 2004-09-13

  • Articles and reports: 11-522-X20020016725
    Description:

    In 1997, the US Office of Management and Budget issued revised standards for the collection of race information within the federal statistical system. One revision allows individuals to choose more than one race group when responding to federal surveys and other federal data collections. This change presents challenges for analyses that involve data collected under both the old and new race-reporting systems, since the data on race are not comparable. The following paper discusses the problems encountered by these changes and methods developed to overcome them.

    Since most people under both systems report only a single race, a common proposed solution is to try to bridge the transition by assigning a single-race category to each multiple-race reporter under the new system, and to conduct analyses using just the observed and assigned single-race categories. Thus, the problem can be viewed as a missing-data problem, in which single-race responses are missing for multiple-race reporters and needing to be imputed.

    The US Office of Management and Budget suggested several simple bridging methods to handle this missing-data problem. Schenker and Parker (Statistics in Medicine, forthcoming) analysed data from the National Health Interview Survey of the US National Center for Health Statistics, which allows multiple-race reporting but also asks multiple-race reporters to specify a primary race, and found that improved bridging methods could result from incorporating individual-level and contextual covariates into the bridging models.

    While Schenker and Parker discussed only three large multiple-race groups, the current application requires predicting single-race categories for several small multiple-race groups as well. Thus, problems of sparse data arise in fitting the bridging models. We address these problems by building combined models for several multiple-race groups, thus borrowing strength across them. These and other methodological issues are discussed.

    Release date: 2004-09-13

  • Articles and reports: 11-522-X20020016737
    Description:

    If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.

    Release date: 2004-09-13

  • Articles and reports: 12-001-X20040016995
    Description:

    One of the main objectives of a sample survey is the computation of estimates of means and totals for specific domains of interest. Domains are determined either before the survey is carried out (primary domains) or after it has been carried out (secondary domains). The reliability of the associated estimates depends on the variability of the sample size as well as on the y-variables of interest. This variability cannot be controlled in the absence of auxiliary information for subgroups of the population. However, if auxiliary information is available, the estimated reliability of the resulting estimates can be controlled to some extent. In this paper, we study the potential improvements in terms of the reliability of domain estimates that use auxiliary information. The properties (bias, coverage, efficiency) of various estimators that use auxiliary information are compared using a conditional approach.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016996
    Description:

    This article studies the use of the sample distribution for the prediction of finite population totals under single-stage sampling. The proposed predictors employ the sample values of the target study variable, the sampling weights of the sample units and possibly known population values of auxiliary variables. The prediction problem is solved by estimating the expectation of the study values for units outside the sample as a function of the corresponding expectation under the sample distribution and the sampling weights. The prediction mean square error is estimated by a combination of an inverse sampling procedure and a re-sampling method. An interesting outcome of the present analysis is that several familiar estimators in common use are shown to be special cases of the proposed approach, thus providing them a new interpretation. The performance of the new and some old predictors in common use is evaluated and compared by a Monte Carlo simulation study using a real data set.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016997
    Description:

    Multilevel models are often fitted to survey data gathered with a complex multistage sampling design. However, if such a design is informative, in the sense that the inclusion probabilities depend on the response variable even after conditioning on the covariates, then standard maximum likelihood estimators are biased. In this paper, following the Pseudo Maximum Likelihood (PML) approach of Skinner (1989), we propose a probability weighted estimation procedure for multilevel ordinal and binary models which eliminates the bias generated by the informativeness of the design. The reciprocals of the inclusion probabilities at each sampling stage are used to weight the log likelihood function and the weighted estimators obtained in this way are tested by means of a simulation study for the simple case of a binary random intercept model with and without covariates. The variance estimators are obtained by a bootstrap procedure. The maximization of the weighted log likelihood of the model is done by the NLMIXED procedure of the SAS, which is based on adaptive Gaussian quadrature. Also the bootstrap estimation of variances is implemented in the SAS environment.

    Release date: 2004-07-14

  • Articles and reports: 62F0014M2004017
    Geography: Canada
    Description:

    This paper documents the approach used to construct the shelter element of the current spatial index program. The intercity indexes of the retail price differentials program of Prices Division had, until recently, excluded any reference to shelter because of conceptual issues. A rental equivalence approach is used for measuring spatial variations in the costs of shelter services among cities. To control for quality variations across areas, a semi-log separate hedonic regression methodology is used to construct the Laspeyres, Paasche, and Fisher Törnqvist interarea indices.

    Release date: 2004-04-16

  • Articles and reports: 12-002-X20040016890
    Description:

    This note introduces a STATA command that calculates variance estimates using bootstrap weights.

    Release date: 2004-04-15
Stats in brief (0)

Stats in brief (0) (0 results)

No content available at this time.

Articles and reports (10)

Articles and reports (10) ((10 results))

  • Articles and reports: 75F0002M2004010
    Description:

    This document offers a set of guidelines for analysing income distributions. It focuses on the basic intuition of the concepts and techniques instead of the equations and technical details.

    Release date: 2004-10-08

  • Articles and reports: 12-002-X20040027032
    Description:

    This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.

    The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.

    Release date: 2004-10-05

  • Articles and reports: 11-522-X20020016724
    Description:

    Some of the most commonly used statistical models are fitted using maximum likelihood (ML) or some extension of ML. Stata's ML command provides researchers and data analysts with a tool to develop estimation commands to fit their models using their data. Such models may include multiple equations, clustered observations, sampling weights and other survey design characteristics. These elements are discussed in this paper.

    Release date: 2004-09-13

  • Articles and reports: 11-522-X20020016725
    Description:

    In 1997, the US Office of Management and Budget issued revised standards for the collection of race information within the federal statistical system. One revision allows individuals to choose more than one race group when responding to federal surveys and other federal data collections. This change presents challenges for analyses that involve data collected under both the old and new race-reporting systems, since the data on race are not comparable. The following paper discusses the problems encountered by these changes and methods developed to overcome them.

    Since most people under both systems report only a single race, a common proposed solution is to try to bridge the transition by assigning a single-race category to each multiple-race reporter under the new system, and to conduct analyses using just the observed and assigned single-race categories. Thus, the problem can be viewed as a missing-data problem, in which single-race responses are missing for multiple-race reporters and needing to be imputed.

    The US Office of Management and Budget suggested several simple bridging methods to handle this missing-data problem. Schenker and Parker (Statistics in Medicine, forthcoming) analysed data from the National Health Interview Survey of the US National Center for Health Statistics, which allows multiple-race reporting but also asks multiple-race reporters to specify a primary race, and found that improved bridging methods could result from incorporating individual-level and contextual covariates into the bridging models.

    While Schenker and Parker discussed only three large multiple-race groups, the current application requires predicting single-race categories for several small multiple-race groups as well. Thus, problems of sparse data arise in fitting the bridging models. We address these problems by building combined models for several multiple-race groups, thus borrowing strength across them. These and other methodological issues are discussed.

    Release date: 2004-09-13

  • Articles and reports: 11-522-X20020016737
    Description:

    If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.

    Release date: 2004-09-13

  • Articles and reports: 12-001-X20040016995
    Description:

    One of the main objectives of a sample survey is the computation of estimates of means and totals for specific domains of interest. Domains are determined either before the survey is carried out (primary domains) or after it has been carried out (secondary domains). The reliability of the associated estimates depends on the variability of the sample size as well as on the y-variables of interest. This variability cannot be controlled in the absence of auxiliary information for subgroups of the population. However, if auxiliary information is available, the estimated reliability of the resulting estimates can be controlled to some extent. In this paper, we study the potential improvements in terms of the reliability of domain estimates that use auxiliary information. The properties (bias, coverage, efficiency) of various estimators that use auxiliary information are compared using a conditional approach.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016996
    Description:

    This article studies the use of the sample distribution for the prediction of finite population totals under single-stage sampling. The proposed predictors employ the sample values of the target study variable, the sampling weights of the sample units and possibly known population values of auxiliary variables. The prediction problem is solved by estimating the expectation of the study values for units outside the sample as a function of the corresponding expectation under the sample distribution and the sampling weights. The prediction mean square error is estimated by a combination of an inverse sampling procedure and a re-sampling method. An interesting outcome of the present analysis is that several familiar estimators in common use are shown to be special cases of the proposed approach, thus providing them a new interpretation. The performance of the new and some old predictors in common use is evaluated and compared by a Monte Carlo simulation study using a real data set.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016997
    Description:

    Multilevel models are often fitted to survey data gathered with a complex multistage sampling design. However, if such a design is informative, in the sense that the inclusion probabilities depend on the response variable even after conditioning on the covariates, then standard maximum likelihood estimators are biased. In this paper, following the Pseudo Maximum Likelihood (PML) approach of Skinner (1989), we propose a probability weighted estimation procedure for multilevel ordinal and binary models which eliminates the bias generated by the informativeness of the design. The reciprocals of the inclusion probabilities at each sampling stage are used to weight the log likelihood function and the weighted estimators obtained in this way are tested by means of a simulation study for the simple case of a binary random intercept model with and without covariates. The variance estimators are obtained by a bootstrap procedure. The maximization of the weighted log likelihood of the model is done by the NLMIXED procedure of the SAS, which is based on adaptive Gaussian quadrature. Also the bootstrap estimation of variances is implemented in the SAS environment.

    Release date: 2004-07-14

  • Articles and reports: 62F0014M2004017
    Geography: Canada
    Description:

    This paper documents the approach used to construct the shelter element of the current spatial index program. The intercity indexes of the retail price differentials program of Prices Division had, until recently, excluded any reference to shelter because of conceptual issues. A rental equivalence approach is used for measuring spatial variations in the costs of shelter services among cities. To control for quality variations across areas, a semi-log separate hedonic regression methodology is used to construct the Laspeyres, Paasche, and Fisher Törnqvist interarea indices.

    Release date: 2004-04-16

  • Articles and reports: 12-002-X20040016890
    Description:

    This note introduces a STATA command that calculates variance estimates using bootstrap weights.

    Release date: 2004-04-15
Journals and periodicals (0)

Journals and periodicals (0) (0 results)

No content available at this time.

Date modified: