Sort Help
entries

Results

All (23)

All (23) (0 to 10 of 23 results)

  • Articles and reports: 12-001-X201400214089
    Description:

    This manuscript describes the use of multiple imputation to combine information from multiple surveys of the same underlying population. We use a newly developed method to generate synthetic populations nonparametrically using a finite population Bayesian bootstrap that automatically accounts for complex sample designs. We then analyze each synthetic population with standard complete-data software for simple random samples and obtain valid inference by combining the point and variance estimates using extensions of existing combining rules for synthetic data. We illustrate the approach by combining data from the 2006 National Health Interview Survey (NHIS) and the 2006 Medical Expenditure Panel Survey (MEPS).

    Release date: 2014-12-19

  • Articles and reports: 75-006-X201400114120
    Description:

    This study examines the characteristics of Canadian workers aged 25 to 54 who are covered by defined benefit (DB) registered pension plans (RPPs) as well as those covered by defined contribution RPPs or hybrid plans. It does so by taking advantage of new data from the new Longitudinal and International Survey of Adults (LISA), first conducted in 2012.

    Release date: 2014-12-18

  • Articles and reports: 11-522-X201300014255
    Description:

    The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014261
    Description:

    National statistical offices are subject to two requirements that are difficult to reconcile. On the one hand, they must provide increasingly precise information on specific subjects and hard-to-reach or minority populations, using innovative methods that make the measurement more objective or ensure its confidentiality, and so on. On the other hand, they must deal with budget restrictions in a context where households are increasingly difficult to contact. This twofold demand has an impact on survey quality in the broad sense, that is, not only in terms of precision, but also in terms of relevance, comparability, coherence, clarity and timeliness. Because the cost of Internet collection is low and a large proportion of the population has an Internet connection, statistical offices see this modern collection mode as a solution to their problems. Consequently, the development of Internet collection and, more generally, of multimode collection is supposedly the solution for maximizing survey quality, particularly in terms of total survey error, because it addresses the problems of coverage, sampling, non-response or measurement while respecting budget constraints. However, while Internet collection is an inexpensive mode, it presents serious methodological problems: coverage, self-selection or selection bias, non-response and non-response adjustment difficulties, ‘satisficing,’ and so on. As a result, before developing or generalizing the use of multimode collection, the National Institute of Statistics and Economic Studies (INSEE) launched a wide-ranging set of experiments to study the various methodological issues, and the initial results show that multimode collection is a source of both solutions and new methodological problems.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014263
    Description:

    Collecting information from sampled units over the Internet or by mail is much more cost-efficient than conducting interviews. These methods make self-enumeration an attractive data-collection method for surveys and censuses. Despite the benefits associated with self-enumeration data collection, in particular Internet-based data collection, self-enumeration can produce low response rates compared with interviews. To increase response rates, nonrespondents are subject to a mixed mode of follow-up treatments, which influence the resulting probability of response, to encourage them to participate. Factors and interactions are commonly used in regression analyses, and have important implications for the interpretation of statistical models. Because response occurrence is intrinsically conditional, we first record response occurrence in discrete intervals, and we characterize the probability of response by a discrete time hazard. This approach facilitates examining when a response is most likely to occur and how the probability of responding varies over time. The nonresponse bias can be avoided by multiplying the sampling weight of respondents by the inverse of an estimate of the response probability. Estimators for model parameters as well as for finite population parameters are given. Simulation results on the performance of the proposed estimators are also presented.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014265
    Description:

    Exact record linkage is an essential tool for exploiting administrative files, especially when one is studying the relationships among many variables that are not contained in a single administrative file. It is aimed at identifying pairs of records associated with the same individual or entity. The result is a linked file that may be used to estimate population parameters including totals and ratios. Unfortunately, the linkage process is complex and error-prone because it usually relies on linkage variables that are non-unique and recorded with errors. As a result, the linked file contains linkage errors, including bad links between unrelated records, and missing links between related records. These errors may lead to biased estimators when they are ignored in the estimation process. Previous work in this area has accounted for these errors using assumptions about their distribution. In general, the assumed distribution is in fact a very coarse approximation of the true distribution because the linkage process is inherently complex. Consequently, the resulting estimators may be subject to bias. A new methodological framework, grounded in traditional survey sampling, is proposed for obtaining design-based estimators from linked administrative files. It consists of three steps. First, a probabilistic sample of record-pairs is selected. Second, a manual review is carried out for all sampled pairs. Finally, design-based estimators are computed based on the review results. This methodology leads to estimators with a design-based sampling error, even when the process is solely based on two administrative files. It departs from the previous work that is model-based, and provides more robust estimators. This result is achieved by placing manual reviews at the center of the estimation process. Effectively using manual reviews is crucial because they are a de-facto gold-standard regarding the quality of linkage decisions. The proposed framework may also be applied when estimating from linked administrative and survey data.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014273
    Description:

    More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because of the fact that these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. However, first experiences obtained with analyses of large amounts of Dutch traffic loop detection records, call detail records of mobile phones and Dutch social media messages reveal that a number of challenges need to be addressed to enable the application of these data sources for official statistics. These and the lessons learned during these initial studies will be addressed and illustrated by examples. More specifically, the following topics are discussed: the three general types of Big Data discerned, the need to access and analyse large amounts of data, how we deal with noisy data and look at selectivity (and our own bias towards this topic), how to go beyond correlation, how we found people with the right skills and mind-set to perform the work, and how we have dealt with privacy and security issues.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014275
    Description:

    Since July 2014, the Office for National Statistics has committed to a predominantly online 2021 UK Census. Item-level imputation will play an important role in adjusting the 2021 Census database. Research indicates that the internet may yield cleaner data than paper based capture and attract people with particular characteristics. Here, we provide preliminary results from research directed at understanding how we might manage these features in a 2021 UK Census imputation strategy. Our findings suggest that if using a donor-based imputation method, it may need to consider including response mode as a matching variable in the underlying imputation model.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014283
    Description:

    The project MIAD of the Statistical Network aims at developing methodologies for an integrated use of administrative data (AD) in the statistical process. MIAD main target is providing guidelines for exploiting AD for statistical purposes. In particular, a quality framework has been developed, a mapping of possible uses has been provided and a schema of alternative informative contexts is proposed. This paper focuses on this latter aspect. In particular, we distinguish between dimensions that relate to features of the source connected with accessibility and with characteristics that are connected to the AD structure and their relationships with the statistical concepts. We denote the first class of features the framework for access and the second class of features the data framework. In this paper we mainly concentrate on the second class of characteristics that are related specifically with the kind of information that can be obtained from the secondary source. In particular, these features relate to the target administrative population and measurement on this population and how it is (or may be) connected with the target population and target statistical concepts.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014287
    Description:

    The purpose of the EpiNano program is to monitor workers who may be exposed to intentionally produced nanomaterials in France. This program is based both on industrial hygiene data collected in businesses for the purpose of gauging exposure to nanomaterials at workstations and on data from self-administered questionnaires completed by participants. These data will subsequently be matched with health data from national medical-administrative databases (passive monitoring of health events). Follow-up questionnaires will be sent regularly to participants. This paper describes the arrangements for optimizing data collection and matching.

    Release date: 2014-10-31
Stats in brief (0)

Stats in brief (0) (0 results)

No content available at this time.

Articles and reports (23)

Articles and reports (23) (0 to 10 of 23 results)

  • Articles and reports: 12-001-X201400214089
    Description:

    This manuscript describes the use of multiple imputation to combine information from multiple surveys of the same underlying population. We use a newly developed method to generate synthetic populations nonparametrically using a finite population Bayesian bootstrap that automatically accounts for complex sample designs. We then analyze each synthetic population with standard complete-data software for simple random samples and obtain valid inference by combining the point and variance estimates using extensions of existing combining rules for synthetic data. We illustrate the approach by combining data from the 2006 National Health Interview Survey (NHIS) and the 2006 Medical Expenditure Panel Survey (MEPS).

    Release date: 2014-12-19

  • Articles and reports: 75-006-X201400114120
    Description:

    This study examines the characteristics of Canadian workers aged 25 to 54 who are covered by defined benefit (DB) registered pension plans (RPPs) as well as those covered by defined contribution RPPs or hybrid plans. It does so by taking advantage of new data from the new Longitudinal and International Survey of Adults (LISA), first conducted in 2012.

    Release date: 2014-12-18

  • Articles and reports: 11-522-X201300014255
    Description:

    The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014261
    Description:

    National statistical offices are subject to two requirements that are difficult to reconcile. On the one hand, they must provide increasingly precise information on specific subjects and hard-to-reach or minority populations, using innovative methods that make the measurement more objective or ensure its confidentiality, and so on. On the other hand, they must deal with budget restrictions in a context where households are increasingly difficult to contact. This twofold demand has an impact on survey quality in the broad sense, that is, not only in terms of precision, but also in terms of relevance, comparability, coherence, clarity and timeliness. Because the cost of Internet collection is low and a large proportion of the population has an Internet connection, statistical offices see this modern collection mode as a solution to their problems. Consequently, the development of Internet collection and, more generally, of multimode collection is supposedly the solution for maximizing survey quality, particularly in terms of total survey error, because it addresses the problems of coverage, sampling, non-response or measurement while respecting budget constraints. However, while Internet collection is an inexpensive mode, it presents serious methodological problems: coverage, self-selection or selection bias, non-response and non-response adjustment difficulties, ‘satisficing,’ and so on. As a result, before developing or generalizing the use of multimode collection, the National Institute of Statistics and Economic Studies (INSEE) launched a wide-ranging set of experiments to study the various methodological issues, and the initial results show that multimode collection is a source of both solutions and new methodological problems.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014263
    Description:

    Collecting information from sampled units over the Internet or by mail is much more cost-efficient than conducting interviews. These methods make self-enumeration an attractive data-collection method for surveys and censuses. Despite the benefits associated with self-enumeration data collection, in particular Internet-based data collection, self-enumeration can produce low response rates compared with interviews. To increase response rates, nonrespondents are subject to a mixed mode of follow-up treatments, which influence the resulting probability of response, to encourage them to participate. Factors and interactions are commonly used in regression analyses, and have important implications for the interpretation of statistical models. Because response occurrence is intrinsically conditional, we first record response occurrence in discrete intervals, and we characterize the probability of response by a discrete time hazard. This approach facilitates examining when a response is most likely to occur and how the probability of responding varies over time. The nonresponse bias can be avoided by multiplying the sampling weight of respondents by the inverse of an estimate of the response probability. Estimators for model parameters as well as for finite population parameters are given. Simulation results on the performance of the proposed estimators are also presented.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014265
    Description:

    Exact record linkage is an essential tool for exploiting administrative files, especially when one is studying the relationships among many variables that are not contained in a single administrative file. It is aimed at identifying pairs of records associated with the same individual or entity. The result is a linked file that may be used to estimate population parameters including totals and ratios. Unfortunately, the linkage process is complex and error-prone because it usually relies on linkage variables that are non-unique and recorded with errors. As a result, the linked file contains linkage errors, including bad links between unrelated records, and missing links between related records. These errors may lead to biased estimators when they are ignored in the estimation process. Previous work in this area has accounted for these errors using assumptions about their distribution. In general, the assumed distribution is in fact a very coarse approximation of the true distribution because the linkage process is inherently complex. Consequently, the resulting estimators may be subject to bias. A new methodological framework, grounded in traditional survey sampling, is proposed for obtaining design-based estimators from linked administrative files. It consists of three steps. First, a probabilistic sample of record-pairs is selected. Second, a manual review is carried out for all sampled pairs. Finally, design-based estimators are computed based on the review results. This methodology leads to estimators with a design-based sampling error, even when the process is solely based on two administrative files. It departs from the previous work that is model-based, and provides more robust estimators. This result is achieved by placing manual reviews at the center of the estimation process. Effectively using manual reviews is crucial because they are a de-facto gold-standard regarding the quality of linkage decisions. The proposed framework may also be applied when estimating from linked administrative and survey data.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014273
    Description:

    More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because of the fact that these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. However, first experiences obtained with analyses of large amounts of Dutch traffic loop detection records, call detail records of mobile phones and Dutch social media messages reveal that a number of challenges need to be addressed to enable the application of these data sources for official statistics. These and the lessons learned during these initial studies will be addressed and illustrated by examples. More specifically, the following topics are discussed: the three general types of Big Data discerned, the need to access and analyse large amounts of data, how we deal with noisy data and look at selectivity (and our own bias towards this topic), how to go beyond correlation, how we found people with the right skills and mind-set to perform the work, and how we have dealt with privacy and security issues.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014275
    Description:

    Since July 2014, the Office for National Statistics has committed to a predominantly online 2021 UK Census. Item-level imputation will play an important role in adjusting the 2021 Census database. Research indicates that the internet may yield cleaner data than paper based capture and attract people with particular characteristics. Here, we provide preliminary results from research directed at understanding how we might manage these features in a 2021 UK Census imputation strategy. Our findings suggest that if using a donor-based imputation method, it may need to consider including response mode as a matching variable in the underlying imputation model.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014283
    Description:

    The project MIAD of the Statistical Network aims at developing methodologies for an integrated use of administrative data (AD) in the statistical process. MIAD main target is providing guidelines for exploiting AD for statistical purposes. In particular, a quality framework has been developed, a mapping of possible uses has been provided and a schema of alternative informative contexts is proposed. This paper focuses on this latter aspect. In particular, we distinguish between dimensions that relate to features of the source connected with accessibility and with characteristics that are connected to the AD structure and their relationships with the statistical concepts. We denote the first class of features the framework for access and the second class of features the data framework. In this paper we mainly concentrate on the second class of characteristics that are related specifically with the kind of information that can be obtained from the secondary source. In particular, these features relate to the target administrative population and measurement on this population and how it is (or may be) connected with the target population and target statistical concepts.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X201300014287
    Description:

    The purpose of the EpiNano program is to monitor workers who may be exposed to intentionally produced nanomaterials in France. This program is based both on industrial hygiene data collected in businesses for the purpose of gauging exposure to nanomaterials at workstations and on data from self-administered questionnaires completed by participants. These data will subsequently be matched with health data from national medical-administrative databases (passive monitoring of health events). Follow-up questionnaires will be sent regularly to participants. This paper describes the arrangements for optimizing data collection and matching.

    Release date: 2014-10-31
Journals and periodicals (0)

Journals and periodicals (0) (0 results)

No content available at this time.

Date modified: