Statistics Canada
Symbol of the Government of Canada

Findings

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Methods
Results
Discussion
Acknowledgements

Record linkage is used in health studies to obtain more complete information, to fill gaps in existing datasets, and/or to improve data quality.1, 2 For instance, prospective death clearance of survey respondents, study cohorts or administrative data sources, such as inpatient hospital records, have made it possible to study associations between death and factors such as lifestyle, occupation, treatment modalities, patient histories and geography.3-20 

Similarly, linking birth and stillbirth records with death registrations and/or hospitalization data has enabled the study of maternal, fetal and infant morbidity and mortality by maternal and infant characteristics.21-24 Record linkage has also been used to validate self-reported information,25, 26 describe the characteristics of unmatched records,27 assess the comparability or quality of data files generated using probabilistic and deterministic linkage approaches,28 reduce under-ascertainment of disease prevalence,29 and monitor health system performance.30, 31 In the absence of disease registries, record linkage is a cost-effective and efficient way to monitor disease incidence and prevalence.32-35

This study was motivated by the need to assess the coverage of the linkage between the Canadian Community Health Survey (CCHS) and Health Person-Oriented Information (HPOI), an administrative database of hospital records. Initial research on the rate of linkage between the CCHS and HPOI estimated the proportion of CCHS respondents who had been hospitalized during the 1994/1995 to 2004/2005 period, but coverage has yet to be assessed.36 Evaluation of the coverage is essential if the linked file is to be used for epidemiologic research. It is important to know if findings will be biased, that is, if survey respondents with certain characteristics are more likely than others to have been linked.

HPOI and the CCHS are complementary sources of data. HPOI does not have information about non-medical determinants of health, such as socio-economic and lifestyle factors. For example, hospital records do not contain information about smoking status or body mass index (BMI), two important risk factors. The CCHS, by contrast, is a rich source of information about health status and determinants of health, but lacks the detail needed to study hospitalization. Combining HPOI with the CCHS reduces many of the limitations of each source, and thereby facilitates a more complete understanding of what brings Canadians in contact with the health care system and how they fare within the system. 

The two main objectives of this study were to: 1) evaluate the coverage of the linked CCHS and HPOI by calculating coverage rates; and 2) identify characteristics of CCHS cycle 1.1 respondents who were less likely to be in the linked file.

Methods

Data sources

Canadian Community Health Survey

The Canadian Community Health Survey is a cross-sectional survey that collects information about health status, health care use and health determinants. It covers the household population aged 12 or older in the provinces and territories, except members of the regular Forces and residents of institutions, Indian reserves and other Aboriginal settlements, and some remote areas. The rate of coverage is in the 98% range in the provinces, 97% in the Northwest Territories, 90% in the Yukon, and 71% in Nunavut.

Data for cycle 1.1 were collected from September 1, 2000 through November 3, 2001 from a sample of 131,535 people; the response rate was 84.7%. All CCHS information, including provincial health care numbers (HNs) and postal codes, is self-reported by respondents, and the extent of error in these variables is unknown. However, data capture applications used by interviewers contain features that check for inconsistent answers, out-of-range responses or invalid alpha-numeric sequences. More information about the CCHS is available in a published report.37   

CCHS respondents were asked for permission to link information collected during the interview with their provincial health information, including past and continuing use of services such as hospitals, clinics, doctor’s offices or other services provided by the province; 91% of respondents gave permission.  The sample used for this study consists of 72,354 (66.5%) respondents aged 12 or older in all provinces and territories except Quebec, who agreed to link and provided a valid health number (HN) (Appendix A). Quebec HPOI records cannot be linked to CCHS records because the Quebec hospital records provided to Statistics Canada contain scrambled HNs, no date of birth and incomplete postal codes.

Survey weights were used so that estimates produced from the CCHS data were representative of the target population, not just the sample itself. The survey weight is the number of people in the population represented by each respondent. Survey weights reflect the differing probabilities of selection and response. Each record is, therefore, weighted by the inverse of the probability of selecting the person and getting a response from him or her.38 Additional survey weights are required for record linkage because not all respondents agree to link and not all those who agree to link, provide a valid HN. For this study, survey weights, adjusted for agreement to link and provision of a valid HN, were calculated.

Statistics Canada does not have access to provincial health insurance databases against which the HNs provided by CCHS respondents could be verified. Instead, all provinces and territories provide check-digit formulas that are used to verify that the HNs are at least plausible. Although check-digits are not a substitute for databases that contain first and last names, birth dates, addresses and HNs, they can detect accidental transcription errors, such as the inversion of two numbers, and offer a simple method of distinguishing meaningful numbers from strings of random digits.

Hospital data

The Health Person-Oriented Information (HPOI) database, maintained by Statistics Canada, contains information about inpatient hospital separations (discharges and in-hospital deaths) from virtually all acute-care and some psychiatric, chronic and rehabilitative hospitals.

HPOI is a person-level dataset derived from discharge records (which can reflect multiple discharges of the same person) in the Hospital Morbidity Database (HMDB). Sequential person-level HPOI records can be used to construct each patient’s hospitalization history. During the linkage process, records belonging to the same individual are identified from the patient’s HN and demographic and diagnosis/intervention information (for example, sex, birth date, sex-specific procedures).39

Hospital records pertaining to the past fiscal year are added to HPOI annually. With each additional year of data, the entire HPOI process is rerun to ensure internal consistency of the demographic information at the person-level for patients with multiple hospital discharges.

Reabstraction studies, which validate the accuracy of hospital records, have found that the non-medical administrative data elements (essential for record linkage) are of high quality. For example, 99% of a random sample of discharge records for hospital stays from September through November 2000 had correct HNs, and 91% of postal codes were error-free.40   

Statistics Canada has hospital data with HNs for all provinces (except Quebec) and the Northwest Territories from fiscal year 1994/1995 onwards; data for 1992/1993 and 1993/1994 are available for some provinces.

While the HPOI database includes the vast majority of records from HMDB, about 3% of records for patients aged 12 or older (the target population of this study) were excluded because of missing or invalid HNs.39

From September 1, 2000 through November 3, 2001, there were 2.3 million discharges of 1,624,972 people aged 12 or older from acute-care hospitals outside Quebec. Discharges from non-acute hospitals were excluded from this study because coverage of such hospitals is inconsistent across provinces.  

The target populations of the Canadian Community Health Survey (CCHS) and HPOI differ somewhat. The CCHS excludes full-time members of the Canadian Forces and residents of Indian Reserves, of institutions (for instance, nursing homes and prisons) and of some remote areas. HPOI is a census and, therefore, these groups are included among hospitalizations.  In an effort to match the target populations of the CCHS and HPOI more closely, hospitalizations that could be identified as pertaining to the on-reserve or the institutionalized population were removed from this analysis.     

The on-reserve population is a derived census variable created by identifying census sub-division (CSD) type according to criteria established by Indian and Northern Affairs Canada (INAC), as well as selected CSDs that correspond to northern communities in Saskatchewan, the Northwest Territories, and the Yukon.41 The postal code conversion file (PCCF+)42 and a list of facilities used by the Residential Care Facility survey43 were used to identify institutional residents. Hospitalizations pertaining to 31,330 residents of Reserves and associated lands were removed from HPOI, as were hospitalizations of 21,299 residents of institutions. Removal of these 52,629 records, which amounted to about 3% of the HPOI patients hospitalized during the study period, brought the population covered by HPOI more in line with the CCHS target population.

Analytical techniques

Probabilistic record linkage

Probabilistic record linkage was used to identify CCHS respondents who were hospitalized. The linkage between the CCHS and HPOI was done with Generalized Record Linkage software (GRLS) developed at Statistics Canada. The two data sources contain many variables, but only a few fields appear in both and are distinct enough to be useful in matching for linkage. A CCHS respondent was considered to have been hospitalized if a record containing an HN and/or similar demographic characteristics (for example, birth date, sex, postal code) and an admission date to an acute-care facility between September 1, 2000 and November 3, 2001 was found in HPOI.

Probabilistic linkage does not require complete agreement on the matching variables. Rather, the quality of the match between pairs of records is rated with algorithms that evaluate the likelihood of a correct match1, 44 (Figure 1). Points were given or subtracted depending on the similarity of the values between fields. For instance, high positive scores were assigned if the HNs were identical and the issuing province of the HN matched; if the values were similar but not exact, a lower positive score was assigned, reflecting partial agreement; if the values on the two records were totally different, points were subtracted.

Figure 1
Example of how pairs of Canadian Community Health Survey (CCHS) and Health Person-oriented Information (HPOI) records were assessed and scored using Generalized Record Linkage Software (GRLS)

The number of points assigned to each pair of linking variables reflected their importance as matching variables, which typically was related to uniqueness. For example, because there are only two possible values for the sex of the respondent/patient, matches on this field scored fewer points than if the postal codes or HNs matched.

Total linkage weights for each pair of CCHS-HPOI records were calculated by summing the scores assigned to each pair of linking variables. The higher the total linkage weight, the more likely the two records pertained to the same individual. Total linkage weights ideally form a bi-modal distribution. When pairs of records scored above the selected threshold, they were accepted as “true” matches; pairs below the threshold were rejected. To eliminate the need for manual review, the cut-off points chosen for this study were identical, which meant that each pair of records could have only one of two values: match or non-match. 

Results

To evaluate the coverage of the linkage between cycle 1.1 of the CCHS and HPOI, the number of people admitted to hospital according to each data source was compared. Survey weights, adjusted for agreement to link and HN validity, were applied to the records of CCHS respondents for whom records were also found in the HPOI database. The HPOI count of hospitalizations was regarded as the standard. The coverage rate was calculated by dividing the weighted estimates of CCHS respondents who successfully linked to HPOI by HPOI counts, minus records identified as pertaining to residents of Indian Reserves or associated lands or of institutions and then multiplying by 100. Differences between the HPOI counts and the weighted estimates from the CCHS were examined. Standard errors and 95% confidence intervals were calculated for the coverage rates using the bootstrap technique. Statistical significance was tested using the t-test (p<0.05).45, 46

According to HPOI, from September 1, 2000 through November 3, 2001, 1,572,343 people were admitted to an acute-care hospital (excluding Quebec) (Table 1). Weighted estimates from the CCHS, adjusted for agreement to link and valid HN, were 7.7% lower (1,451,272).

Table 1
Number hospitalized in acute-care hospitals and coverage rates, September 1, 2000 through November 3, 2001, by selected characteristics and data source, population aged 12 or older, Canada excluding Quebec

Coverage rates were similar for males and females (91.0% and 93.1%). Provincial rates did not differ significantly from the rate for the rest of Canada. However, based on the CCHS, the estimated number of residents of the territories who were hospitalized was considerably higher than the HPOI number. As a result, the coverage rate for the territories exceeded 100%.

Coverage rates for most age groups were similar. The exception was seniors aged 75 or older whose rate (76.2%) was significantly lower than that of people aged 12 to 74 (96.4%).

Discussion

The significantly lower coverage rate for seniors aged 75 or older was anticipated because the two data sources did not pertain to exactly the same populations. The CCHS excludes residents of institutions, but they are included in the hospital data (HPOI). Institutionalization is considerably more common among seniors than among younger people: overall, fewer than 2% of Canadians live in an institution, but at age 75 or older, the figure is 16%.47

In the absence of direct information in HPOI records about patients’ place of residence, the postal code in combination with the PCCF+ and the Residential Care Facilities list was used to determine if patients lived in an institution. More than 20,000 institutional residents were identified and subsequently removed from HPOI using the PCCF+. Nonetheless, the coverage rate for seniors aged 75 or older remained significantly below the rates for younger people.

Use of the PCCF+ and the Residential Care Facilities list to identify institutions based only on the postal code is not ideal. Institutions that accounted for the majority of the population sharing a postal code had a higher chance of being identified and subsequently removed from the HPOI counts. As well, institutions in urban areas have more precise postal codes, and therefore, residents of such institutions were more likely to have been removed from HPOI. Rural and outlying suburban areas and smaller towns often have the same postal code for multiple enumeration/ dissemination areas. Consequently, the coding is far less precise than for centralized urban postal codes, which are usually linked to a single enumeration/dissemination area. Therefore, residents of institutions in rural and outlying suburban areas and smaller towns likely remained in the HPOI counts.

The coverage rate in the territories is also problematic, in that the linked CCHS-HPOI estimates exceeded the standard (HPOI). This, however, is less of a concern, because the small number of CCHS records linking to HPOI (124) precludes future analyses featuring this subpopulation. Before the removal of on-reserve residents from the HPOI count, the coverage rate for the territories was 113%; after their removal, the rate was 163%. It is unclear why the linked HPOI-CCHS estimate is so much higher than HPOI. Records of CCHS respondents identified as living in the territories were reviewed to determine if some had high survey weights, which might explain the discrepancy between the HPOI and HPOI-CCHS counts. No discrepant weights were found; the average weight was 51, with weights ranging in value from 11 to 178.

In addition, hospitalizations pertaining to military personnel could not be identified and removed from HPOI. Full-time members of the Armed Forces are excluded from CCHS, and their inclusion may affect the coverage rate.  

Conclusion

The value of record linkage is well established in epidemiological studies of population health. Linking information from routinely collected administrative health data such as HPOI with survey data like the CCHS holds promise for discoveries about health determinants, different types of health care use and health outcomes. Coverage evaluation is a fundamental pre-requisite to analyses that integrate health-related information from multiple sources based on the CCHS-HPOI linked file.

This evaluation shows that the overall coverage rate is high, often over 90%, although some CCHS respondents, notably seniors, had lower rates. Even this limitation is manageable, however, as long as users of the file explicitly acknowledge that findings pertain only to the general household population (the target population of the CCHS), and not to the total population, particularly residents of institutions. 

Acknowledgements

The author thanks Claude Nadeau for his assistance and thoughtful comments.