Longitudinal Immigration Database (IMDB) Technical Report, 2018
4 Record linkage

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

As described in this document, the IMDB is the product of numerous record linkagesNote . It was created for the purpose of providing statistical information in an anonymous format. This section gives an overview of the record linkage methods used to create the IMDB. For more details regarding data processing related to record linkage, see Section 5.

Record linkage is the process of matching records between or within databases. This approach is commonly used to fill data gaps and create a dataset with broad applications (Rotermann et al. 2015).

To produce the IMDB the Social Data Linkage Environment (SDLE) was used. It is a highly secure linkage environment that facilitates the creation of integrated population data files for social analysis.

At the core of the SDLE is a Derived Record Depository (DRD or Depot), a national dynamic relational database containing only basic personal identifiers. The DRD is created by integrating selected Statistics Canada source index files for the purpose of producing a list of unique individuals. These files, which contain personal identifiers without analysis variables, are brought into the environment, processed and integrated only once to the DRD. Updates to these data files are integrated to the DRD on an ongoing basis.

In 2018, the linkage rate to the depot for immigration records was 97.4% (Cascagnette, 2019).The probabilistic method was used to integrate IRCC’s immigration data to CRA’s tax data. To perform the record linkage G-Link was used.

Comparisons between records are done field-by-field using comparison rules with outcomes such as exact match, string proximity, missing information or fields disagreement generated by each rule based on the similarity of values in a pair of records. Each pair of records is assigned a comparison result pattern and that pattern is evaluated to classify pairs as linked, possibly linked or not linked.

The theory of probabilistic record linkage is based on the premise that the results of certain comparison result patterns are characteristic of truly linked pairs, while others are characteristic of truly unlinked pairs. Therefore, each rule outcome is assigned a weight based on the ratio of the estimated probability of the outcome occurring for true matches to the estimate probability of the outcome occurring for non-matches.

The composition of the linked set is not known in advance, so the probabilities of result patterns for truly linked records are not known. Linked weight components are estimated from prior knowledge and early iterations of the linkage process, and refined by treating successive iterations of the linkage process.

The unlinked weight components are calculated based on the frequency with which the rule outcomes were observed among record pairs that do not belong together, which is approximately equal to the frequency with which the rule outcomes would be observed among randomly paired records. After repeated iteration of the linkage process, linked weight components stabilize and final weights are ready for use.

The strategy for the probabilistic record linkage involves the following six steps:

  1. Generate potential pairs using initial criterion
  2. Develop and apply comparison rules to potential pairs to derive probability ratios
  3. Apply frequency weights
  4. Assign linkage states to the pairs using probability ratios and thresholds
  5. Form groups
  6. Resolve conflicts using mapping.

Steps 2 to 4 are repeated iteratively.

Users of a dataset created as a result of record linkage need to be aware that linkage errors are possible. Record linkages will have one of four outcomes: true matches correctly classified as matches, true matches falsely classified as non-matches, true non-matches falsely classified as matches, or true non-matches correctly classified as non-matches (Winkler, W.E. 2009). As shown in the example in Table 2, where records from file 1 are linked to records from file 2, the result of the record linkage between two variables will be either a match or a non-match. A good record linkage will maximize the proportion of true matches correctly classified as matches and the proportion of true non-matches correctly classified as non-matches, and minimize the other record linkage outcomes.


Table 2
Example of record linkage outcomes
Table summary
This table displays the results of Example of record linkage outcomes Record, File 2 and Type of Outcome (appearing as column headers).
Record File 2 Type of Outcome
A B D
File 1 A Match Non-match Non-match True match
C Non-match Match Non-match False match
D Non-match Non-match Non-match False non-match
E Non-match Non-match Non-match True non-match

The results of probabilistic record linkage are dependent on the quality of the linkage variables. For example, misspelled names or typos in the date of birth can create missed or erroneous matches. A non-match does not necessarily mean that the person did not file taxes. The record linkage rates for the most recent IMDB are available in Section 7.2.1.

This year, to improve the record linkage results, the SDLE linkage results were combined with the results of the linkage of the Immigration data to the Linkage control file (LCF) as per the 2015 IMDB instalment.

In order to produce the 2015 IMDB, the hierarchical deterministic method was used to link immigration records to the Linkage Control File (LCF), a database of personal identification numbers (see Section 2 for the descriptions of these files). This method consists of matching records between multiple files (or within a given file) by means of common variables (Dusetzina et al. 2014). Over the course of waves of matches, the linkage criteria become less and less stringent. The LCF is not available to researchers; it is used only to produce the IMDB.

The December 2019 release of the 2018 IMDB included only tax files for Immigrants and non-permanent residents who arrived between 1974 and 2016. The IMDB was updated in January 2020 to include tax files for immigrants who arrived between 1952 and 1973, as well as those who arrived in 2017 and 2018. Tax files for non-permanent residents who arrived in 2017 and 2018 were also added to the IMDB.


Date modified: