Health Reports
International population-based health surveys linked to outcome data: A new resource for public health and epidemiology

by Stacey Fisher, Carol Bennett, Deirdre Hennessy, Tony Robertson, Alastair Leyland, Monica Taljaard, Claudia Sanmartin, Prabhat Jha, John Frank, Jack V. Tu, Laura C. Rosella, JianLi Wang, Christopher Tait, and Douglas G. Manuel

Release date: July 29, 2020


In Canada and elsewhere, national population-based health surveys are increasingly being linked to vital statistics and health care data, bringing together large amounts of high-quality, nationally representative information about health risk factors with individual-level health outcomes.Note 1Note 2Note 3Note 4Note 5 Health surveys in Canada and the United States alone have collected detailed sociodemographic and health behaviour information from over 1 million respondents since 1997, and have been linked to over 6 million person-years of mortality follow-up.Note 1Note 6 Because national health surveys often have similar surveillance objectives and designs, pooling data from linked population health surveys could create a new resource for health surveillance and research, with an unparalleled international and population perspective.

National population health surveys vs. traditional epidemiology studies

National population health surveys collect a broad range of information about health status, health behaviours and sociodemographic characteristics from a representative sample of a country’s community-dwelling population. These surveys are a cornerstone of population health surveillance (Table 1) and have a population perspective—they use a sampling approach that is designed to produce a population-representative sample (in terms of sociodemographic characteristics). This sample is used to estimate the prevalence of health conditions and risk factors within the population. It is also used to monitor population trends; inform policy development, implementation and evaluation; inform decisions about health resource allocation; and assess progress toward national health goals.

National population health surveys are typically conducted at regular intervals (often yearly) to provide up-to-date snapshots of the population’s health. In contrast, traditional epidemiology studies—studies that use conventions taught in most introductory epidemiology courses to investigate a specific exposure–outcome relationship—typically have an etiological focus, and often use convenience sampling.

Population health surveys do not usually collect longitudinal information about survey respondents, unlike most traditional epidemiology studies, which typically involve actively determining outcomes, often with repeat exposure assessment during follow-up. Typically, population health surveys determine only baseline exposures, through self-response, and most surveys do not ascertain temporal outcomes because of their cross-sectional nature. However, linking health surveys to outcome data, such as vital statistics and health care data, introduces a longitudinal perspective that greatly increases the surveys’ utility.

In addition to population health surveillance, national health surveys are used for population health research since they collect information that is not available in administrative health files (e.g., health behaviours). These data are used by researchers to study the relationships between social determinants and health outcomes, to evaluate disease and risk factor burden, and to study the role of risk factor modification in prevention. These data are also used to assess the performance of the health care system across sociodemographic and economic groups, and across groups with varying levels of illness. Data are also used to inform the development of health policy. National health surveys are key tools for understanding, monitoring and improving population health.

Individual-level pooling of national population health surveys

Meta-analyses have long been used to summarize collections of traditional epidemiology studies, offering increased statistical power and more precise effect estimates. Individual-level pooling of linked population health surveys may confer similar benefits to population health questions, and could produce a valuable new resource for modern population health planning, including the use of population-level multivariable risk algorithmsNote 7Note 8Note 9Note 10Note 11 and microsimulationNote 12Note 13 to project disease burden and evaluate risk reduction strategies.

Meta-analysis of traditional epidemiology studies

Meta-analysis is a statistical procedure used to summarize the results from multiple independent clinical trials or observational studies investigating a specific exposure–outcome association.

The key value of meta-analysis is that it involves aggregating data from all relevant studies, which produces a quantitative summary of a body of research with higher statistical power and more precise effect estimates than the individual studies alone. Meta-analysis can be used to reconcile inconsistent results from previous studies, and to investigate rare diseases and uncommon or weak risk factors that individual studies were unable to investigate.Note 14Note 15Note 16

Meta-analyses also offer the opportunity to produce new insights through the exploration of statistical heterogeneity. Statistical heterogeneity is present in a meta-analysis when the effect estimate of interest differs across the studies by more than can be accounted for by sampling variation. This can be caused by differences in study design, statistical methodology or study quality—leading to methodological heterogeneity—or by differences in exposure or outcome definitions, or population characteristics—leading to clinical heterogeneity.Note 17 In all meta-analyses, it is important to identify the presence or absenceof heterogeneity because aggregating studies with inconsistent results can lead to inaccurate or misleading conclusions.Note 17Note 18Note 19Note 20Note 21 However, heterogeneity can also be “our greatest ally”Note 20 since investigating its causes can lead to significant scientific and clinical results.Note 17Note 21

Individual patient data (IPD) meta-analyses involve pooling and reanalyzing raw data from eligible studies.Note 22 These meta-analyses are considered the gold standard of systematic reviews.Note 23 Pooling and reanalysis allow for the standardization of participant inclusion and exclusion criteria, variable definitions, confounder adjustment, and modelling. This leads to more accurate summary effect estimates,Note 24 and makes it easier to investigate the influence of participant-level characteristics on the effect estimate, and to identify subgroups where risk factor associations may vary. Despite the substantial advantages over meta-analyses without individual-level data, IPD meta-analyses are not frequently performed since they require substantial cooperation and organization, data sharing, and advanced statistical expertise.Note 24Note 25

Application to national population health surveys

The aggregation of linked international health surveys creates a valuable resource for modern population health care planning and research. Pooling and analyzing individual-level data from national population health surveys using methods similar to IPD meta-analysis could produce more accurate effect estimates with less statistical uncertainty. Additionally, investigating survey-level heterogeneity and subgroups could produce new insights. Survey aggregation could produce improved comparisons of disease risk, burden and trends internationally; facilitate equity analyses; and support health policy and priority setting.

This paper’s objectives are to examine the feasibility of pooling linked population health surveys from three countries, facilitate the examination of health behaviours, and present useful information to assist in the planning of international population health surveillance and research studies. Detailed comparisons of the design, methodologies and content of national health surveys from Canada, the United States and Scotland were performed. Common variables were constructed, and sample size and estimate outcome counts are provided.


Survey designs for the Canadian Community Health Survey (CCHS) (cycles 2.1 [2003], 3.1 [2005] and 4.1 [2007], and CCHS 2008),Note 26 the United States National Health Interview Survey (NHIS) (2000 and 2005)Note 27Note 28 and the Scottish Health Survey (SHeS) (2003, 2008 to 2010)Note 29Note 30Note 31Note 32 were examined for comparability. This involved evaluating survey content, target populations and exclusions, sampling and administration methods, sample size and response rates, and linkage. Survey year, inclusion of health behaviour topics of interest (e.g., the NHIS collects detailed diet information every five years) and availability of mortality linkage were considered to select the relevant survey cycles.

Questions on smoking, alcohol, physical activity and diet were identified, and question construction, response categorization and structure were compared. Health behaviours were the focus since they are important health risk factors that are collected in virtually all health surveys, and because they are conceptually complex and are observed using different approaches. Health behaviour concepts were assessed for comparability, and existing variables were used to create new common variables. Common variables were constructed to achieve the highest level of detail possible in all surveys, which were assessed and discussed by three reviewers.

Public use files were used to obtain sex-specific sample size estimates of survey respondents aged 20 and older from the three countries. CCHS estimates were obtained through collaboration with Statistics Canada. Public use NHIS data were downloaded from the Centers for Disease Control and Prevention (CDC) website ( Public use SHeS data were obtained from the United Kingdom Data Service website (

CCHS mortality estimates were obtained through collaboration with Statistics Canada. NHIS mortality estimates were obtained from public use NHIS files linked to the National Death Index, which is also available for download from the CDC’s website. Mortality estimates for the SHeS were obtained through collaboration with Scotland’s Information Services Division. Mortality follow-up data for the CCHS and the NHIS went to December 31, 2011, while SHeS follow-up data went to December 31, 2014. Research ethics approval was obtained from the Ottawa Health Science Network Research Ethics Board.


Survey comparability

The CCHS, NHIS and SHeS are government-funded, cross-sectional household surveys designed to support national health surveillance efforts in Canada, the United States and Scotland, respectively.Note 2Note 6Note 26 The CCHS was administered biennially from 2001 to 2007, and has been administered annually since 2008. The NHIS has been administered annually since 1957. The SHeS was administered in 1995, 1998 and 2003, and annually since 2008. Results of the comparability analysis are summarized in Table 2.


Core questionnaires collect information about sociodemographic characteristics, health status, health care services and health determinants. Information about additional health topics of interest are collected in rapid response modules (CCHS), survey supplements (NHIS) and a rotating biennial module (SHeS). The SHeS also collects anthropometric measurements and blood, saliva and urine samples from a subsample of survey respondents.

Target population and exclusions

The CCHS, NHIS and SHeS have comparable target populations that include the non-institutionalized national population and exclude active members of the military, those in prison and long-term care facilities, and those living in some remote areas (CCHS and SHeS) or outside the country (NHIS). The CCHS also excludes those living on reserves. The CCHS collects information only for those aged 12 and older, while both the NHIS and the SHeS collect information on all individuals, regardless of age.

Sampling methods

Although the countries’ populations vary (Canada has 37.6 million residents, the United States has 329 million residents, and Scotland has 5.3 million residents), similar multistage area sampling methods, designed to produce annual national-level data, are used in the CCHS, NHIS and SHeS. The CCHS also produces annual estimates at the levels of the provinces, territories and 110 health regions. The SHeS produces health-board-level data every four years. The sample size of the NHIS is not large enough to provide state-level data with acceptable precision, but data can be evaluated over multiple survey years to obtain estimates.

For the CCHS, sampling was done by allocating the annual sample size among the provinces and territories according to their population size and number of health regions, and then further allocating the sample among the health regions. The NHIS sampling frame for the 2000 and 2005 surveys used 358 primary sampling units—within which two further sampling units were used—and involved oversampling of both Blacks and Hispanics. For the SHeS, each year’s sample was clustered, and the four-year sample was unclustered. In 2008, 25 strata of area deprivation were used in the SHeS to produce estimates at the health-board level, allowing for the oversampling of deprived areas. All the surveys used sample weights to account for selection probabilities and non-response bias.

Administration methods

All the surveys used computer-assisted personal interviews administered by trained interviewers. Approximately half of the CCHS interviews were administered using computer-assisted telephone interviews.

Sample size and response rates

The CCHS was administered to 130,000 respondents every two years when it began in 2001. Since 2007, it has been administered to 65,000 respondents annually. The total adult response rate was 81% in 2003 and 76% in 2008. The NHIS has been administered to approximately 30,000 adult respondents annually since 1997. Response rates from a non-conditional sample of adults were 72% in 2000 and 69% in 2005.Note 27Note 28 The SHeS has a much smaller sample size than both the CCHS and the NHIS. Until 2011, the SHeS surveyed approximately 7,000 adults per cycle. Since 2011, it has surveyed approximately 4,500 adult respondents annually. From 2003 to 2010, response rates for eligible adults were between 55% and 60%.Note 29Note 32

Available linkages

The CCHS has been linked to vital statistics data up to December 31, 2011, and to hospital discharge abstracts, with plans for further data linkages.Note 1 Access to these data is restricted to Statistics Canada and the Statistics Canada research data centres. The NHIS has been linked to the National Death Index, with follow-up to December 31, 2011. Information on accessing public use data files, feasibility data files and restricted-access data is available from the CDC ( The SHeS has been linked to mortality and health administrative databases, including hospitalizations,Note 2 with mortality follow-up to December 31, 2014. Access to these data can be requested from the Public Benefit and Privacy Panel for Health and Social Care (

Question construction, response categorization and structure

Common variables were created to measure smoking, alcohol consumption, physical activity and diet in the three surveys (Table 3). The common variables for smoking and physical activity are comparable between the CCHS, NHIS and the SHeS. The common variables for alcohol and fruit and vegetable consumption are comparable between the CCHS and the NHIS. However, the SHeS collects and reports alcohol consumption and diet information using more detailed and standardized measures. Therefore, comparisons of alcohol consumption and diet data between the SHeS and the other two surveys should be performed with caution. In the SHeS, alcohol consumption is reported using unitsof alcohol. This is not directly comparable with the CCHS and NHIS, which use the more subjective “number of drinks.” Similarly, the SHeS collects detailed fruit and vegetable consumption information, including amount consumed, while the CCHS and NHIS collect only frequency information. In-depth descriptions of how survey similarities and differences influenced common variable creation, and how these differences may affect their interpretation, can be found online at

Mortality linkage sample size estimates

Approximately 87%, 94% and 83% of CCHS, NHIS and SHeS respondents, respectively, who agreed to data sharing and linkage were successfully linked to national mortality data (Table 4). Among those successfully linked, 19,227 deaths occurred among CCHS respondents during 1.8 million person-years of mortality follow-up. Among NHIS respondents, 6,341 deaths occurred during almost half a million person-years of follow-up. Among SHeS respondents, 2,856 deaths occurred during 160,000 person-years of follow-up.


National population health surveys are the largest population-based cohorts with information on health status, health behaviours, sociodemographic characteristics, health care use and health-related quality-of-life measures. Given the surveys’ broad objectives, these data are well suited for many purposes, especially when linked to health outcome data. Pooling linked health survey data could produce a new population health research resource.

There are two main benefits of pooling linked population health surveys at the individual level across jurisdictions. First, combined data have a larger sample with greater statistical power, which produces more precise effect estimates. This enables additional subgroup analyses and more detailed examinations of mediation and interaction effects. Increased sample size also allows for improved examination of uncommon or weak risk factors, and uncommon outcomes, such as cancer and many chronic diseases. Second, these data could improve the generalizability of study findings. Relationships between survey exposures and linked outcomes that are consistent across jurisdictions are potentially more robust, compared with inconsistent relationships. The investigation of inconsistent relationships can also lead to new insights, similar to the investigation of heterogeneity in meta-analyses. For example, the effect of health behaviours on mortality risk may be associated with country-level differences in socioeconomic inequality or access to health care services.Note 33

Larger sample size and improved generalizability lead to many research and surveillance opportunities. For example, most international analyses rely on aggregated results from different sources, so many studies have difficulties considering sociodemographic variables and addressing mediation, interaction and exposure–outcome lag time.Note 8Note 34 Because health surveys typically include sociodemographic questions regarding education, work history, income, ethnicity and immigrant status, these data are well suited for investigating health risks from an equity perspective.

Combined health survey data also enhance the ability to monitor the relationship between survey exposures and outcomes. For example, there are concerns that the relationship between smoking and health outcomes has changed over time, given changes in smoking patterns and the composition of smoking products. An international, longitudinal investigation of this relationship is possible with pooled, linked international population health surveys.

Furthermore, pooled, linked international population health surveys could be used to produce improved international comparisons of disease burden estimates. Disease burden reporting requires information about risk factor prevalence, outcome counts and relative risk estimates associated with the exposure of interest. Most current disease burden estimates, including those from the Global Burden of Disease study,Note 35 use the aggregated data approach first described by Levin (1978).Note 36 With this method, aggregate measures of risk factor prevalence and outcome counts are obtained from independent sources. Risk factor prevalence is obtained from population health surveys, outcome counts are obtained from vital statistics data sources, and relative risk estimates are obtained from independent epidemiology studies to describe the association between the risk factor and outcome. However, national health surveys that have been linked to outcome data can be used as single data sources for these studies,Note 8 and pooling these data from multiple countries would allow for a standardized analysis methodology.

Limitations and challenges

One of the greatest challenges of pooling linked population health surveys from different countries is the heterogeneity caused by survey question dissimilarities. It was difficult and labour-intensive to create common health behaviour variables using the surveys from Canada, the United States and Scotland for this study, and the variables created were less detailed and were not entirely comparable across all surveys.

Over time, the ascertainment of behavioural risk factors has become more consistent across countries, and an increasing number of validation studies exist that indicate acceptable ascertainment bias.Note 40 However, there is a need for more consistency. For example, despite international recommendations for smoking ascertainment that are used in over 100 countries,Note 41 the lack of smoking history information in the NHIS prevented the calculation of pack years—a more detailed measure of smoking behaviour than the categorical measure of “smoking status” that was created in this study. Changes to the CCHS also prevented differentiation between former drinkers and non-drinkers in cycle 4.1. In this study, even if a concept was present, the time frame over which the exposure was ascertained often varied. Furthermore, some questions were collected optionally by geographic region, and there were differences in variable definitions and classification.

The comparison of exposures in multiple health surveys is challenging, and health surveys are constantly changing. To help with this, “cchsflow,” an open-source library to support the harmonization of CCHS variables across survey cycles, was developed.Note 37 This approach to variable harmonization can be extended to other international population health surveys, and can be used to harmonize variables both across cycles within a single survey and between surveys from different countries. Survey metadata also support harmonization by improving survey cataloguing. Survey metadata are available in Data Documentation Initiative format, an international metadata standard developed for this purpose.Note 38Note 39

Another challenge is decreasing response rates. If participants systematically differ from those who do not participate, the survey sample will be non-representative of the target population, and valid inference will be impeded. Non-respondents are repeatedly found to have unfavourable health behaviours and excess mortality compared with respondents.Note 42Note 43Note 44Note 45 Data linkage can be used to assess and, potentially, adjust for non-response bias. The extent of non-response bias in the SHeS was evaluated by Gorman et al. (2014)Note 46 by comparing rates of all-cause mortality and alcohol-related harm among survey respondents and the general population. Incidence rates were found to be lower among survey respondents, with survey-to-population rate ratios of 0.69 for alcohol-related harm and 0.89 for all-cause mortality. They concluded that heavy drinkers were less likely to respond to the SHeS than moderate or light drinkers. This type of comparison of respondents and non-respondents can inform weighting and imputation procedures to adjust for non-response bias.

Approaches to combining cycles of population health surveys from a single country have been developed,Note 47 but approaches to pooling surveys from different countries are more complex because of differences in each country’s survey design. Modified meta-analytic methods and techniques used by internationally pooled epidemiology cohort studies, such as the European Prospective Investigation into Cancer and Nutrition,Note 48 may be used. However, new methodologies will need to be established. Additionally, differences in the underlying survey populations may also prevent the estimation of a pooled effect estimate for an exposure–outcome effect of interest. However, this will not be the case for all effects, and investigations into the sources of this heterogeneity could also produce important new insights.

Lastly, the largest practical limitation to pooling international linked population health surveys is data access. The accessibility of outcome-linked health survey data varies across countries. For example, access to the mortality-linked NHIS data is publicly available on the CDC website. In contrast, access to the equivalent information in Canada, Scotland and many other countries is restricted. That said, unlinked health survey data are often publicly available—including data from the CCHS, which has a Statistics Canada Open Licence. The NHIS has demonstrated that it is possible to assess how to include linked outcomes to existing public use surveys, while ensuring there is no increase in re-identification risk and ensuring adherence to existing data sharing principles.

The Strategy for Patient-Oriented Research ( was developed to address data access and harmonization efforts across Canada. A similar model could be used to facilitate analogous tasks for multi-country studies, including the pooling of linked population health surveys. Within networks such as the International Population Data Linkage Network (, there has also been more discussion and interest in conducting studies using data from multiple countries. Improvements to cross-jurisdictional data sharing and privacy issues are necessary for the benefits of pooled health survey analyses to be fully realized. This is beyond the scope of this paper.


The use of pooled national population health surveys linked to health outcomes has enormous potential for international health risk evaluation and comparison, equity analysis, disease burden estimation, and ongoing surveillance. Innovative methodologies will be required to mitigate challenges introduced by survey dissimilarities, and these methodologies can be improved with the introduction of international standards for collecting core health-related measures. Jurisdictional data restrictions and privacy issues require discussion and resolution.

List of abbreviations

CAPI: computer-assisted personal interview
CCHS: Canadian Community Health Survey
CDC: Centers for Disease Control and Prevention
IPD: Individual patient data
NHIS: National Health Interview Survey
SHeS: Scottish Health Survey


Availability of data and material

Unlinked CCHS public use files are available from Statistics Canada and through Ontario Data Documentation, Extraction Service and Infrastructure ( Linked CCHS data are available for use by approved researchers at Statistics Canada and at research data centres ( Both linked and unlinked NHIS public use data files are available for download from the CDC website ( Public use SHeS data can be obtained from the United Kingdom Data Service website ( Access to linked SHeS data can be requested from the Public Benefit and Privacy Panel for Health and Social Care (

Competing interests

The authors declare that they have no competing interests.


This work was supported by the Canadian Institutes of Health Research, operating grant MOP-142177. The study sponsor had no role in study design; collection, analysis or interpretation of data; manuscript writing; or the decision to submit for publication. Alastair Leyland is funded by the Medical Research Council (MC_UU_12017/13) and the Scottish Government Chief Scientist Office (SPHSU13).


Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

Privacy notice

Date modified: