Linking the Canadian Community Health Survey and the Canadian Mortality Database: An enhanced data source for the study of mortality
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
by Claudia Sanmartin, Yves Decady, Richard Trudeau, Abel Dasylva, Michael Tjepkema, Philippe Finès, Rick Burnett, Nancy Ross and Douglas G. Manuel
In most industrialized countries, vital statistics registries and national health surveys are cornerstones of health surveillance.
Mortality data compiled by vital statistics registries for administrative purposes can be tabulated by basic demographic characteristics (age and sex), province and cause of death. However, little is known about the socioeconomic, cultural or linguistic characteristics of those who die, or about the contributions of lifestyle and social factors to mortality risk
Health surveys, by contrast, collect information about the health status and health behaviours of individuals, as well as their socioeconomic and cultural characteristics. However, these surveys are typically cross-sectional; no information is provided about respondent health beyond the date of the survey.
When complementary health survey and administrative data are combined through record linkage,Note 1Note 2Note 3 relationships between social determinants and health outcomes such as death can be analyzed in more depth.Note 4Note 5Note 6Note 7Note 8 For example, in Canada, mortality data have been linked to census results to examine differences in the risk of death across socioeconomic groups.Note 9Note 10 These linked data have also been used to calculate mortality rates for subpopulations such as immigrants and Aboriginal groups.Note 11Note 12Note 13 Similarly, researchers in other countries have used linked data to investigate mortality among immigrants, Indigenous peoples and prisoners.Note 14Note 15Note 16Note 17 A landmark Ontario study used linked health survey and mortality data to explore associations with smoking, poor diet, physical inactivity and high stress. Results demonstrated that 60% of Ontario deaths were attributable to these risk factors.Note 18
The linkage described in the present study combines information from the Canadian Community Health Survey with mortality data from the Canadian Mortality Database. This article explains the record linkage process and presents results about associations between health behaviours and mortality among a representative sample of Canadians.
Data and methods
Canadian Community Health Survey
The cross-sectional Canadian Community Health Survey (CCHS) collects information about the health, health behaviours and health care use of the non-institutionalized household population aged 12 or older. The survey excludes full-time members of the Canadian Forces and residents of reserves and some remote areas, together representing about 4% of the target population. The CCHS was first conducted in 2000/2001 (cycle 1), and again in 2003 (cycle 2) and 2005 (cycle 3), each time with a sample of size of approximately 130,000. Starting in 2007, the survey was conducted annually (sample size of 65,000). Response rates ranged from 69.8% to 78.9%. Details about the sampling strategy and content are available elsewhere (www.statcan.gc.ca).Note 19
The CCHS records eligible for linkage to the Canadian Mortality Database were respondents who consented to share their survey information with provincial and federal ministries of health and to link their responses to administrative data. Overall, 85.3% (n = 614,774) of CCHS respondents were eligible for linkage (Table 1). Special sampling weights were created for the linked CCHS data to adjust for those who did not agree to share and link.
Canadian Mortality Database
The Canadian Mortality Database (CMDB) is a census of all deaths registered in Canada. Deaths are reported by the provincial and territorial Vital Statistics Registries to Statistics Canada; the information provided includes cause and date of death, names, date of birth, and postal code at the time of death. Cause-of-death information is coded using the version of the International Classification of Diseases (ICD) in effect at the time of death. CMDB records eligible for linkage were deaths of people aged 12 or older (n = 2.774 million) that occurred from January 1, 2000 through December 31, 2011.
Historical Tax Summary File
The Historical Tax Summary File (HTSF) is a compilation of annual tax return files representing unique individuals for whom a tax declaration was produced in a given year. The HTSF contains only personal identifier information (names, date of birth, sex and postal code) from the T1 Income Tax Personal Master File; the file does not include income data. Statistics Canada uses the HTSF to assist in record linkage, specifically, through the provision of additional linkage information (names, postal code), and in manual resolution of doubtful links.
HTSF data for 1996 through 2011, representing about 33.5 million records, were used in the linkage. Information derived from the HTSF and used during linkage is removed from the final analytical file.
Linkage involved three steps: 1) data preparation; 2) record linkage; and 3) quality assessment. The linkage was approved by Statistics Canada’s Executive Management Board.Note 20 Use of the linked data is governed by the Directive on Record Linkage.Note 21
The following variables, which were chosen based on commonality among datasets, data quality and discriminatory power, were used for the linkage: given name, last name, date of birth, sex and postal code. Invalid responses for any of the variables were set to missing.
The rate of missing among the linkage variables was typically less than 3%. The exception was last names captured in the CCHS, where an average of 10% of respondents had missing information across survey cycles. The percentage of missing last names ranged from 4.1% in 2000/2001 to 17.1% in 2011. With linkage to the HTSF, the percentage of CCHS respondents with no last name was reduced to 2%. To reduce missed links due to misspelling, all names (given and last) were converted to their phonetic forms using the New York State Individual Intelligence System.
The discriminatory power of the linkage variables was assessed with Shannon entropy scores.Note 22 Higher levels of entropy reflect higher discriminatory power. Variables with larger numbers of distinct values (for example, last name) have higher discriminatory power and entropy levels, compared with variables with fewer distinct values (for example, sex).Note 23 Entropy levels among the linkage variables ranged from 0.43 for given names to 0.85 for postal codes (data not shown in tables). This information was employed in the record linkage strategy to assign initial linkage weights.
A single linkage cohort was created by concatenating the linkage variables of all CCHS cycles. This approach was more efficient and reduced the probability of false links, compared with linking each cycle separately. Additional details about the record linkage are available in an internal report.Note 24
Record linkage of the CCHS and CMDB was accomplished in two steps. First, eligible CCHS respondents were linked to the HTSF to attach alternative postal codes and names. This information was used in the second step, where the CCHS was linked to the CMDB. All eligible CCHS respondents were linked to the CMDB, regardless of whether the record was linked to the HTSF.
The linkages were conducted in G-Link using probabilistic linkage methodology based on the Fellegi-Sunter theory of record linkage.Note 25 G-Link is a SAS-based generalized record linkage software developed at Statistics Canada to facilitate large-scale probabilistic record linkage.Note 26
The CCHS was linked to the HTSF using the following variables: given and last names, date of birth, postal code and sex. The CCHS was linked to the CMDB using the same variables, which were available in both datasets. For records that linked to the HSTF, the following variables were also used: alternative postal codes, alternative names (for example, maiden name, father’s name), death date and interview date. The availability of alternative postal codes in the HTSF has been found to improve linkage rates, particularly over time.Note 27
Linkage rates were calculated for both the CCHS-to-HTSF linkage and the CCHS-to-CMDB linkage. Rates represent the number of linked CCHS respondents divided by the number of CCHS respondents eligible for linkage (unweighted and weighted).
Internal and external validation was conducted to evaluate the accuracy of the linkage process and the fitness of the data for use in analysis.
False positive and false negative rates were calculated. Manual review (blind) was conducted on a random sample of 1,000 pairs uniformly selected across eight strata of linkage weights representing the full range of linkage weights above the threshold (that is, links), but only a limited range of weights just below the threshold (non-links). This approach is recommended given the large number of non-links generated in the initial creation of pairs, which, if fully represented, would result in an artificially low false negative rate.Note 28
Three reviewers independently conducted a manual review of the same information used in the linkage (names, date of birth, postal code and sex). Additional information available for the manual review included date of death from the HTSF for CCHS respondents who linked to the tax data and had died. Each pair was then assigned a link or non-link status.
Results of the manual review were compared with G-Link results to calculate stratum-specific estimates of false positive and false negative links.Note 26Note 29 Global false positive and false negative rates were estimated using a weighting function based on the distribution of all possible pairs. Although a uniform sample allocation was used to select pairs from each weight stratum, the sampling fractions differed because some strata contained more pairs than did others. About 75% of pairs belonged to the first stratum, where linkage weights were lowest; fewer than 1% of pairs belonged to the last stratum (stratum 8), where linkage weights were highest.
Linkage rates to the CMDB were calculated by basic respondent characteristics (for instance, age and sex). It was anticipated that patterns of linkage rates would reflect the differential risk of death among groups at higher risk, such as males and older adults.Note 30Note 31
The linked data were also externally validated by comparing the distribution of deaths identified in the linked cohort with the distribution of deaths derived from the CMDB alone for people aged 12 or older. Distributions by geography, age, sex and major cause of death were examined. Age-standardized mortality rates (ASMRs) were calculated by calendar year for each CCHS cycle and compared with official mortality statistics.Note 32
Cox proportion hazards survival models (age-adjusted) were used to assess the impact of several behavioural risk factors on mortality. The analysis pertained to respondents aged 20 or older from the 2003 and 2005 CCHS linked data, with up to eight years of mortality follow-up. Based on the external validation, the first year of follow-up was excluded from the analysis.
Results of the linkage are presented by province and territory, sex and age at the time of the CCHS interview. Results of the survival analysis are reported for five risk factors as defined in the CCHS: smoking status, body mass index (BMI), alcohol consumption, physical activity, and fruit and vegetable consumption. Self-reported BMI values were adjusted with the recommended correction factor.Note 33 Analyses were conducted using the special sampling weights to adjust for the exclusion of CCHS respondents who did not agree to share and link their data.
Respecting respondent privacy
Statistics Canada ensures respondent privacy during the linkage process and subsequent use of the linked files. Only employees directly involved in the process had access to the unique identifying information required for linkage (such as names); these individuals did not access health-related information. Once the data linkage was completed, an analytical file was created from which identifying information was removed. This de-identified file could then be accessed by analysts for validation and analysis.
A large majority (84.2%) of eligible CCHS respondents linked to the HTSF, with rates ranging across survey years from 79.3% to 86.6% (Table 2). Linkage rates were low (78%) for British Columbia and Nunavut compared with other regions (data not shown in tables). As well, linkage rates were low among respondents younger than 25 (74.3%) or older than 74 (79.4%), and high among 25- to 34-year-olds (89%) (data not shown in tables). These patterns by region and age persisted over the survey years (data not shown).
Overall, 5.3% of eligible CCHS respondents linked to the CMDB (Table 2). The linkage rate to the CMDB was slightly higher among CCHS respondents who linked to the HTSF (6.0%). As expected, given the shorter follow-up period, linkage rates were lower for more recent survey years. Respondents in the territories were less likely to link to the CMDB (2.4% to 3.0%) than were those in other regions.
The false positive and false negative rates were 0.04% and 2.43%, respectively (data not shown in tables). As expected, linkage rates (unweighted) to the CMDB were higher among males (5.8%) than females (5.0%) and rose with age, with the highest rates among those aged 75 or older (27.0%) (Table 3). This pattern of linkage rates was similar across survey years (data not shown in tables).
Geographic distributions of deaths in the linked CCHS-CMDB data were similar to those derived from the CMDB alone. The majority of deaths occurred in the largest provinces (Ontario, Quebec and British Columbia) (Table 4). While overall patterns persisted, the weighted distribution of deaths across provinces and territories resembled results from the CMDB more closely than did the unweighted distributions. Differences were particularly evident for the Atlantic Provinces.
The distribution of deaths by age group revealed the expected increase at older ages (Table 4). However, as a percentage of all deaths, 49% were among people aged 75 or older in the linked data, compared with 61% in the CMDB data. The difference largely reflects deaths among the institutionalized population in the CMDB, which are not represented in the linked CCHS-CMDB data. When only deaths captured in the CMDB among people aged 12 through 74 were considered, the distributions were similar (data not shown). The distribution of deaths by cause derived from the linked data was similar to the distribution in the CMDB (Table 4).
Annual age-standardized mortality rates (ASMRs) derived from the linked CCHS-CMDB data were generally lower than those derived from official mortality rates (Figure 1). ASMRs were lowest in the first year of follow-up for all CCHS years, indicating a potential “healthy respondent bias,” which supports the wisdom of the decision to exclude the first year of follow-up from the survival models.
The survival analysis generally reveals higher hazard ratios (HRs) for mortality among groups at greater risk (Table 5). Heavy smokers (males = 2.36; females = 2.91) and light smokers (males = 1.92; females = 1.81) had elevated HRs compared with non-smokers. The mortality risk associated with BMI was U-shaped, with elevated HRs for individuals in the underweight class (males = 1.77; females = 1.50) or in obese class II or above (males = 1.51; females = 1.20). HRs for former drinkers (males = 1.65; females = 1.56) were high, compared with moderate drinkers. Reporting no physical activity or less than 30 minutes a day was associated with greater mortality risks, compared with reporting 60 or more minutes a day. Finally, eating fewer than five servings of fruit and vegetables per day was associated with elevated mortality risks, compared with consuming more than five servings; findings for no servings were not significant, likely because of small sample sizes.
Overall, 5.3% of eligible Canadian Community Health Survey respondents were linked to a death record in the Canadian Mortality Database; false positive and false negative rates were 0.04% and 2.43%, respectively. Use of the Historical Tax Summary File yielded slightly higher linkage rates to the CMDB. External validation revealed patterns in mortality rates in the linked CCHS-CMDB data that were comparable to nationally reported estimates. These patterns attest to the quality of the linked data and the suitability of these data for mortality research at the population level.
Age-standardized mortality rates derived from the linked data were consistently lower in the first and second years of follow-up, compared with national official mortality estimates. This was anticipated, given that the official estimates are based on the entire population, which includes institutionalized individuals, whereas rates estimated from the linked CCHS-CMDB data represent only the household population.
Low ASMRs in the first two years of follow-up may also reflect a “healthy respondent” bias. CCHS respondents may have a more favourable morbidity and mortality profile than non-respondents. People who are ill or near death may be less likely to respond to a survey. Evidence from a Scottish health survey linked to mortality data revealed higher mortality rates among non-respondents (55 or older) than respondents.Note 34 Similarly, in Finland, linked survey and mortality data showed higher mortality rates among non-respondents than respondents to the annual health survey.Note 35
Differences in mortality rates could be influenced by the demographic and socioeconomic status of survey respondents versus non-respondents.Note 35Note 36 In addition, non-respondents may be at higher risk for alcohol-, drug- and smoking-related mortality than are respondents.Note 37 This suggests that the impact of survey non-respondents may be to underestimate overall mortality and the influence of risk behaviours.
Consistent with international findings,Note 18Note 38Note 39Note 40Note 41 the survival analysis based on the linked CCHS-CMDB data revealed elevated hazard ratios for mortality among smokers, people with very low or very high BMI, those who reported fewer than five servings of fruit and vegetables per day, and those who reported less than 30 minutes of physical activity per day.
For more than half a century, large epidemiological studies have documented risks of premature mortality related to smoking.Note 42 A U-shaped curve for BMI and mortality has been shown in both Canadian and American research.Note 43 The pattern for alcohol consumption and mortality aligns with an international meta-analysis,Note 44 and the elevated risk among former drinkers compared with moderate drinkers supports the “sick quitter” effect (individuals stop drinking because of disease onset).Note 45
Recent researchNote 46Note 47 has indicated that, paradoxically, some traditional risk factors might actually reverse at older ages, hypotheses that future analyses might test with linked CCHS-CMDB data. Other social determinants of health, such as social interactions and support that may contribute to mortality risk, particularly among the elderly,Note 48Note 49Note 50Note 51 could be investigated using the linked data.
A degree of caution is warranted in using the linked CCHS-CMDB data. The low mortality rates in the first year of follow-up suggest a “healthy respondent” bias. Consequently, it is recommended that at least the first year of follow-up be excluded from mortality estimates. As well, although the CCHS is cross-sectional, individuals may be represented more than once within and across cycles. These respondents were not removed, but rather, identified and flagged in order to link them to the same death record. However, this affects only analyses requiring more than one survey year.
The CCHS can be linked to the CMDB to examine associations between health risk factors and mortality. The linkage constitutes a resource for research on mortality outcomes. These data can be used to refine estimates of associations between risk behaviours and mortality in predictive models of all-cause and cause-specific mortality.
The authors acknowledge funding from Health Canada, Institute for Clinical and Evaluative Sciences/Ottawa Hospital Research Institute and McGill University to support the linkage project.