Two approaches to linking census and hospital data
Archived Content
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
by Michelle Rotermann, Claudia Sanmartin, Gisèle Carrière, Richard Trudeau, Hélène St-Jean, Abdelnasser Saïdi, Alexander Reicker, Aimé Ntwari and Eric Hortop
Record linkage, the process of matching records across or within datasets, is common in health research.Note 1-7 The goal is to create an enriched dataset with wider applications.Note 6-11 The data suited for linkage are those that are complementary—information unavailable in one source is available in the other.
Accurate data linkage depends on the availability of co-occurring unique identifiers.Note 12,Note 13 Each identifier should pertain to only one person, and each person should have only one identifier.Note 14 For health-related linkages in Canada, provincial health insurance numbers (HINs) have been used.Note 2,Note 3,Note 11,Note 15 However, in most databases (for instance, vital statistics, census and tax files), HINs are not available.
One option is a registry approach to linking databases. Provincial health insurance registries can act as “bridge” files for linkage because they contain HINs along with names and other identifying variables.Note 2,Note 3,Note 11,Note 15 However, registries are not always available. Alternatively, a non-registry approach can be taken. This involves matching records in different databases, using combinations of co-occurring person-level information, such as birth date and postal code.Note 13,Note 14,Note 16,Note 17
If databases pertain to the same population, it is generally expected that most records will link. But if only a fraction of records is expected to link, determining a reasonable linkage rate a priori is problematic. This is typical of health-based linkages—for example, limited numbers of individuals will be hospitalized or die during follow-up. In such situations, linkages have been evaluated by comparing the results of different approaches using the same dataNote 12,Note 14,Note 18-20; by comparing outcome rates and percentage distributions of variables available in linked and unlinked dataNote 14,Note 21,Note 22; and by calculating sensitivity and specificity.Note 19, Note 23,Note 24Based on Canadian studies, match rates around 75% among records that are expected to link are considered appropriate for research applications.Note 5,Note 25,Note 26
This study compares a registry and non-registry approach to linking the 2006 Census of Population with hospital data from the Discharge Abstract Database (DAD) for Manitoba and Ontario, two provinces for which health insurance registry data (HINs) are available to Statistics Canada. The aim is to determine if a research-quality dataset can be produced without the aid of “bridge” data from provincial health insurance registries. The linkage was approved by Statistics Canada’s Policy Committee,Note 27 and was governed by the Record Linkage Directive.Note 28
Data and methods
Data sources
2006 Census of Population
The 2006 Census collected information using both short- and long-form questionnaires. The entire population answered the 7 basic demographic questions on the short form, which included the birth date, sex and postal code of all members of each household.Note 29 As well, about 20% of private households were randomly selected to answer the long-form questionnaire, which asked 52 additional questions about income, education, ethnicity, Aboriginal status, etc.Note 29
The census short-form data file includes long-form respondents, and so contains records pertaining to nearly the entire population (97% in Manitoba; 96% in Ontario). The long-form data file pertains only to the 20% of households selected to provide in-depth information.
Before linkage, the census file was cleaned, unduplicated and validated.Note 29,Note 30 The short-form file was linked to the provincial health insurance registries and then to the DAD (registry approach), or directly to the DAD (non-registry approach). Inclusion of all census records in the linkage made it possible to identify and remove provincial health insurance registry and DAD records that pertained to short-form-only respondents. Only records pertaining to people who completed a long-form census questionnaire were included in the study cohort.
Provincial health insurance registries (registry approach)
For the registry approach, the Manitoba Health Services Insurance Plan (MHSIP) Registration File and the Ontario Registered Persons Database (RPDB) were used as “bridge” files. Linking the census data to a provincial health insurance registry makes it possible to then link the census data to the DAD, based on HIN agreement.
The MHSIP and RPDB contain records for all individuals registered to receive health services in Manitoba and Ontario, respectively. Because registrants are not obliged to pay, population coverage is high.Note 31-33 Records for people no longer living in the province, but who retain their coverage for up to three months after moving, are included in both registries. New residents must wait three months for coverage. People covered by other plans (for instance, prison inmates, members of the RCMP and Canadian Forces) are excluded.
Before linkage to the census data, the MHSIP and RPDB were pre-processed, including identification of people with multiple HINs (Manitoba = 0.2% or 3,588; Ontario = 1% or 165,123). Only records with surnames, birth dates before January 1, 2007, insurance coverage effective between December 31, 2005 and January 1, 2007, and where applicable, death dates after December 31, 2005, were eligible for the study cohort.Note 36
2006/2007 Discharge Abstract Database
The DAD contains demographic, administrative (including HIN) and clinical data for all acute-care and some psychiatric, chronic rehabilitation and day-surgery hospital discharges.Note 34,Note 35 The 2006/2007 version pertains to hospital discharges from April 1, 2006 through March 31, 2007 (n = 3,186,079).
2005/2006/2007 T1 Personal Master Files (non-registry approach)
To account for changes in postal codes over time and to improve the chances of linkage to the DAD using the non-registry approach, postal codes from Statistics Canada’s 2005, 2006 and 2007 tax files (T1 Personal Master Files - T1PMF) were added to the short-form census file. Based on sex, birth date, and partial family and given names, most census records (91%) linked to at least one year of tax data. For people who did not file taxes annually and/or were not required to (for instance, children), postal codes were identified and assigned using information from other household members who were tax-filers.
Record linkage
Registry approach
The registry-based linkage was conducted in two steps: 1) Manitoba and Ontario short-form census records were probabilistically linked to the provincial health insurance registries to obtain HINs; and 2) based on HINs, the registry-linked long-form census records were linked deterministically to the DAD (Figure 1).
Probability scores based on similarities of birth date, postal code, sex, surname, and given name were used to estimate the likelihood that linking records represented the same person.Note 19,Note 37,Note 38 Weights (positive/negative) were assigned to the comparison fields, which were summed to create a total linkage weight. Separate thresholds to distinguish true matches from non-matches were pre-determined for Manitoba and Ontario based on distributions of these weights. Because of the size of the Ontario files, comparing every census record with every registry record was prohibitive. Consequently, the Ontario file was split, so that only records sharing the same sex were compared. This was not done for Manitoba, because the population and files were smaller.
Record-pairs were ranked. Those scoring above the pre-determined threshold were accepted as matches. Before the thresholds were finalized, record-pairs ranking close to them were reviewed and thresholds were adjusted where necessary.
The linked census-registry files containing both the census identifiers and HIN were then used to deterministically link the census identifiers to the DAD files. Only census identifiers corresponding to long-form census respondents were retained for analysis.Note 38
Non-registry approach
The non-registry approach used hierarchical deterministic exact matching on linkage keys comprised of combinations of three variables (birth date, postal code and sex) common to the census short-form and DAD records (Figure 2). Matching involved comparing census-DAD pairs to determine if they referred to the same person. If only matches using one key had been accepted, linkage rates would be reduced.Note 12 By using multiple keys successively, hierarchical deterministic exact matching is a refinement that maximizes the discriminating power of the linking information and minimizes the impact of missing values and errors.Note 16
Data were reformatted in preparation for linkage. When records did not indicate sex (n ~ 312,000 census; n ~ 300 DAD), the existing record was assigned a sex, and an additional record was created with the same birth date and postal code, but the opposite sex. When the original census postal code and post-processed postal code differed (n~525,000), an additional record was also created.
Multiple records pertaining to the same person were identified in the census and DAD files based on GroupID and HINKey (HIN+Province). This facilitated their removal after the linkage was completed.
The non-registry linkage used an iterative approach that applied 28 rules to the census-DAD files in succession. Early iterations observed stringent rules; later passes tolerated divergence. For example, the first iteration required an exact match in the census and DAD records on birth date, sex and postal code. Iterations 2 to 4 required an exact match on birth date, sex and postal code in the DAD and T1PMF files for 2005, 2006 and 2007. Iterations 5 to 10 relaxed the rules for postal code, allowing one of the six characters to be dropped. This process was repeated using the T1PMF 2005, 2006 and 2007 postal codes (iterations 11 to 28). After each iteration, census records with the same GroupID and DAD records with the same HINKey as those that linked were removed from future iterations to ensure that people who were represented in the census and DAD files because of multiple linkage keys were linked only once.
Records with duplicate linkage keys could exist in each dataset, particularly the DAD, given that in a single year one person could have multiple hospital records with the same birth date, sex and postal code. To improve efficiency and remove the possibility of ties, the census and DAD files were unduplicated using this linkage key prior to linking. After the linkage was completed, the census records that had been added to allow for inconsistent census postal codes and/or missing sex were removed, and the dropped DAD records (multiple hospitalizations of the same person) were added back to the file. Finally, only records of hospitalizations of Manitoba and Ontario residents were retained for this comparison study.
Protecting respondent privacy
Statistics Canada ensures respondent privacy during the linkage process and subsequent use of linked files. Only employees directly involved in the linkage process have access to the unique identifying information required for linkage (such as names and health insurance numbers) and do not access health-related information. Once the data linkage process is completed, an analytical file is created from which the identifying information is removed. This de-identified file is accessed by analysts for validation and analysis.
Record counts
For the registry approach, 1,111,133 Manitoba and 11,704,729 Ontario short-form census records were sent to linkage with 1,201,152 MHSIP and 13,121,593 RPDB records. Of these, 246,578 of the 278,937 long-form Manitoba and 2,136,455 of the 2,387,911 long-form Ontario census records linked to a health insurance file, and thus, were in-scope for the registry-based linkage to the DAD (Table 1).
Because provincial health insurance registries were not required for the non-registry approach, most short-form census records for residents of most provinces/territories were eligible for linkage to the DAD (n = 23,592,671). Census records were excluded if the postal code or birth date was invalid or missing, or if both the original census postal code and post-censal postal code were from Quebec. (Quebec census respondents were not eligible for linkage because Statistics Canada does not have access to Quebec inpatient hospital data.) The final non-registry-based study cohort consisted of 278,937 Manitoba and 2,387,911 Ontario long-form census records.
Both linkage approaches excluded DAD records with birth dates after Census Day (May 16, 2006). Records for out-of-country residents and for stillbirths were also removed. For the registry-based linkage, only DAD records with a valid Manitoba (227,069) or Ontario (1,081,443) HIN were retained. For the non-registry linkage, only DAD records with non-missing dates of birth and postal codes were retained (2,106,104).
Validation
Linkage rates
To evaluate the linkage rates achieved with the registry approach, the percentages of census respondents linking to either the MHSIP or RPDB were examined. Given differences in the population coverage of the census and provincial health insurance registry files, linkage rates should approach, but not equal, 100%.
For both the registry and non-registry approaches, the percentages of Manitoba and Ontario long-form census respondents linking to the DAD were calculated overall and by selected socio-demographic characteristics. These linkage rates should reflect the prevalence of being hospitalized at least once in the 2006/2007 fiscal year, and are expected to be higher among groups more likely to be hospitalized (for example, seniors).
Linkage accuracy
Sensitivity (true positives) and specificity (true negatives) were calculated to assess linkage accuracy at the record level. The registry-based results were used as the “gold standard” against which the non-registry results were compared.Note 2,Note 3,Note 11,Note 15
For the census records that linked to the DAD in both approaches, HINs were compared to assess internal consistency. Matches were taken as evidence of correct links.
Coverage analysis
The coverage rates for each approach were calculated by dividing the number of acute-care hospital discharges of long-form census respondents in Manitoba and Ontario using the linked census-DAD data (numerator) by the number of acute-care hospital discharges reported in the 2006/2007 unlinked DAD data (denominator). Unweighted and weighted coverage rates were calculated for each approach overall and for hospitalizations attributable to three “most responsible” diagnoses—circulatory system diseases; injury and poisoning; and pregnancy, childbirth and the puerperium.
Weights are necessary to make inferences at the population level.Note 39 When weights are applied, coverage rates should approach, but not equal, 100%. When weights are not applied, coverage rates should approach the percentage of the population completing the long-form census (25% in Manitoba; 20% in Ontario). Owing to differences in the populations covered by the linked census-DAD data and the unlinked DAD data, exact agreement is not expected. For instance, the institutionalized population, who are high users of hospital services,Note 40,Note 41 is represented in the unlinked DAD data, but not in the linked census-DAD data. Census weights were not adjusted for such anomalies, so applying the weights may distort some estimates.
To more closely match the target population of the linked files, DAD records for the following populations were removed from the denominator: residents of seniors’ homes (2,368 Manitoba and 24,487 Ontario DAD discharges), people born after Census Day, stillbirths, and non-Canadians.
Socio-demographic characteristics
For each economic family or unattached individual, total after-tax income from all sources and from all family members was summed, adjusted for family composition and size, and separated into income quintiles.
Highest level of education for people aged 18 or older was categorized as secondary school graduation, or less than secondary graduation.
Employment status was defined as employed, unemployed or not in the labour force.
Respondents were divided into four categories based on their self-reported knowledge of Canada’s official languages: English, French, both, or neither.
Information on Aboriginal status was derived from the question: “Is this person an Aboriginal person, that is, North American Indian, Métis or Inuit (Eskimo)?” Respondents marked all that applied. Responses were grouped into six categories: North American Indian (only), Métis (only), Inuit (only), other Aboriginal (multiple or indeterminate), Aboriginal (total of four preceding categories), and non-Aboriginal.
Country of birth, citizenship, and immigration status were combined into an immigrant status variable: immigrant, non-immigrant or non-permanent resident.
A one-year residential mobility variable was created to reflect address changes: same address, moved within Canada, or moved from outside Canada. This was derived by comparing each respondent’s municipality and province of residence on Census Day and one year earlier.
A rural-urban variable reflected residence location and community size. Farm or non-farm residences in areas with a population of less than 1,000 were considered rural/farm. Other residences were categorized as being in centres with small (1,000 to 29,999), medium (30,000 to 99,999) or large (100,000 or more) populations.
Hospital outcomes
Based on the DAD, the total number of all-cause hospitalizations with discharge dates from April 1, 2006 through March 31, 2007 was determined for Manitoba and Ontario.
Hospitalizations attributable to circulatory system disease required a most-responsible diagnosis (MRDx) coded to J00-J99 using the International Classification of Diseases, 10th Revision. Injury and poisoning hospitalizations were those coded S00-S99 or T00-T98. Pregnancy, childbirth and the puerperium hospitalizations were those coded O00-O99.
Linkage results
Census-to-provincial health insurance registry (registry approach)
For the registry approach, 88% (246,578) of Manitoba long-form census respondents linked to the MHSIP, and 89% (2,136,455) of Ontario long-form census respondents linked to the RPDB, and thus, constituted the study cohort eligible for registry-based linkage to the DAD (Table 1). (The percentages of short-form census respondents linking to the registries were somewhat higher: 93% for Manitoba and 90% for Ontario.)
Long-form census respondents’ linkage rates to the health insurance registries differed by socio-demographic characteristics. For example, rates in Manitoba ranged from 85% for children younger than age one to 92% at ages 65 to 74. Linkage rates were relatively low for people in the lowest income quintile, those reporting no knowledge of Canada’s official languages, non-permanent residents, and people who had not lived in Canada the year before the census. Linkage rates among Aboriginal groups ranged from 76% to 89%.
Census-to-DAD (non-registry approach)
With the non-registry approach, 80% or 1.69 million DAD keys (Canada excluding Quebec) were linked to a short-form census record (Table 2). The majority of links between the census and DAD were achieved in the first iteration (72% or 1.52 million), which required an exact match on birth date, sex and postal code. Using postal code information from the T1PMF tax files, iterations 2 to 4 resulted in 80,000 more links (4%); iterations 5 to 28 added another 68,000 (3%).
Comparison of approaches
The percentages of Manitoba and Ontario census long-form respondents linking to the DAD were similar for both approaches (Table 3). Based on the registry approach, 7% of Manitobans had been hospitalized in an acute-care facility; based on the non-registry approach, the figure was 6%. In Ontario, the rate for both approaches was 5%.
As expected, the linked data reflected differential use of hospital services. For example, regardless of province or linkage approach, a higher percentage of females than males had been hospitalized. Children younger than age one and seniors were more likely than other age groups to link to hospital records. Other characteristics correlated with age and/or disability, such as not being in the labour force, were associated with higher rates of hospitalization.
Sensitivity and specificity
The majority of long-form census records had the same DAD linkage outcome, regardless of approach. The sensitivity and specificity of the Manitoba linkages were 87.9 % and 98.8%, respectively; the corresponding Ontario figures were 89.4% and 99.6% (Table 4).
The consistency of HINs among census respondents who linked to the DAD with both approaches was also high. Virtually all (99%) of the 28,428 Manitoba and 106,968 Ontario long-form census respondents who linked to the DAD according to both approaches linked to the same HIN in each approach.
Coverage evaluation
Coverage rates for all-cause hospitalizations in both provinces were comparable for the registry and non-registry linkages. Regardless of approach, the unweighted coverage rates were 23% in Manitoba and 17% in Ontario (Table 5). The weighted coverage rates were also similar: in both Manitoba and Ontario, 84% for the registry approach and 82% for the non-registry approach.
Coverage rates varied by age. Unweighted rates for Manitobans aged 75 or older were 5 or 6 percentage points below the all-ages total. For Manitoba children aged 1 to 4, weighted rates were 18 (registry) and 16 (non-registry) percentage points below the all-ages total. In both Manitoba and Ontario, and according to both approaches, weighted rates for 20- to 24-year-olds were lower (6 to 7 percentage points) than the all-ages total.
For cause-specific hospitalizations, unweighted coverage rates were closer to their Ontario and Manitoba targets than were weighted coverage weights. Unweighted cause-specific coverage rates also tended to be more similar across approaches than were those calculated with weights.
Discussion
The registry approach linked approximately 90% of long-form census respondents to the provincial health insurance registries, thereby allowing subsequent linkage to the DAD based on HINs. This rate is high, given that large-scale Canadian studies have used thresholds of about 75% as the point at which linked data are deemed appropriate for research applications.Note 8,Note 15,Note 26 Consistent with other studies, this analysis shows that name-based linkages (registry approach) yield slightly higher linkage rates than do non-name-based linkages (non-registry approach), but that both are sufficient for research.Note 1
Many characteristics associated with lower linkage rates in this analysis have been reported previously.Note 5,Note 42-44 Linkage rates were relatively low for individuals with lower socio-economic status, those identifying as Aboriginal, people without knowledge of Canada’s official languages, rural residents, and people who recently moved to Canada.
The results of the registry and non-registry linkages to the DAD were similar: 5% of Ontario and 6% to 7% of Manitoba census respondents linked to the hospital data. As well, linked data were consistent with expected patterns of hospital use in that higher percentages of the poor, the elderly, and people identifying as Aboriginal had been hospitalized.Note 45,Note 46 This suggests that the non-registry approach can yield a research-quality dataset.
The coverage evaluation demonstrated consistency across linkage approaches. Unweighted coverage rates were higher in Manitoba than in Ontario, reflecting the higher percentage of Manitoba’s population who completed a long-form census questionnaire and higher rates of hospitalization in Manitoba.Note 47 When weights were applied, the Manitoba and Ontario coverage rates often became more similar.
The linked data tended to underestimate hospitalizations of seniors, children aged 1 to 4, and people aged 15 to 44. To some extent, this is because the population represented in the long-form census data does not exactly match the groups captured by hospital data. For example, the linked data do not include the institutionalized population, who are partially included in the hospital data. Lower coverage rates of younger adults may be related to census under-coverage of populations with less stable living arrangements and/or incomplete coverage of some Aboriginal people in Ontario.Note 48,Note 49
Limitations
The linked data have several limitations. Because this study concerns only two provinces, the generalizability of the results to other jurisdictions is unclear. Preliminary analyses of coverage rates suggested potential difficulties achieving statistical significance when covariates are too narrowly defined. The results showing that coverage sometimes worsened when weights were applied indicate that the use of census weights should be considered on a study-by-study basis.
Conclusion
The comparison of linkage approaches provides evidence that research-quality data can be produced without recourse to provincial health insurance registries, most of which are not available to Statistics Canada. The research opportunities offered by the non-registry linked file are great due to the nationally representative sample and the statistical power provided by its size and population coverage. Nonetheless, users of linked data should consider the impact of the linkage methodology, the linkage and coverage rates, and population exclusions on their analyses.
- Date modified: