Longitudinal Immigration Database (IMDB) Technical Report, 2019
4 Record linkageNote

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

As described in this document, the IMDB is the product of numerous record linkages. It was created for the purpose of providing statistical information in an anonymous format. This section gives an overview of the record linkage methods used to create the IMDB. For more details regarding data processing related to record linkage, see Section 5.

Record linkage is the process of matching records between or within databases. This approach is commonly used to fill data gaps and create a dataset with broad applications (Rotermann et al. 2015).

To produce the IMDB the Social Data Linkage Environment (SDLE) was used. It is a highly secure linkage environment that facilitates the creation of integrated population data files for social analysis.

At the core of the SDLE is a Derived Record Depository (DRD or Depot), a national dynamic relational database containing only basic personal identifiers. The DRD is created by integrating selected Statistics Canada source index files for the purpose of producing a list of unique individuals. These files, which contain personal identifiers without analysis variables, are brought into the environment, processed and integrated only once to the DRD. Updates to these data files are integrated to the DRD on an ongoing basis.

In 2019, the linkage rate to the depot for immigration records was 97.1% (Cascagnette, 2020).The probabilistic method was used to integrate IRCC’s immigration data to CRA’s tax data. To perform the record linkage G-Link was used.

The generalized record linkage system used at Statistics Canada, G-Link, is based on the mathematical theory of record linkage developed by Ivan P. Fellegi and Alan B. Sunter. Probabilistic record linkage methodology compares non-unique identifiers (e.g., name and birth date) and estimates the likelihood that records being matched refer to the same entity (e.g. individual). Probabilistic record linkage is especially valuable when the identifiers are prone to change (e.g. surnames of females who get married), error-prone and frequently missing.

Comparisons between records are done field-by-field using comparison rules with outcomes such as exact match, string proximity, missing information or fields disagreement generated by each rule based on the similarity of values in a pair of records. Each pair of records is assigned a comparison result pattern and that pattern is evaluated to classify pairs as linked, possibly linked or not linked.

The theory of probabilistic record linkage is based on the premise that the results of certain comparison result patterns are characteristic of truly linked pairs, while others are characteristic of truly unlinked pairs. Therefore, each rule outcome is assigned a weight based on the ratio of the estimated probability of the outcome occurring for true matches to the estimate probability of the outcome occurring for non-matches.

The composition of the linked set is not known in advance, so the probabilities of result patterns for truly linked records are not known. Linked weight components are estimated from prior knowledge and early iterations of the linkage process, and refined by treating successive iterations of the linkage process.

The unlinked weight components are calculated based on the frequency with which the rule outcomes were observed among record pairs that do not belong together, which is approximately equal to the frequency with which the rule outcomes would be observed among randomly paired records. After repeated iteration of the linkage process, linked weight components stabilize and final weights are ready for use.

The strategy for the probabilistic record linkage involves the following six steps:

  1. Generate potential pairs using initial criterion
  2. Develop and apply comparison rules to potential pairs to derive probability ratios
  3. Apply frequency weights
  4. Assign linkage states to the pairs using probability ratios and thresholds
  5. Form groups
  6. Resolve conflicts using mapping.

Steps 2 to 4 are repeated iteratively.

Users of a dataset created as a result of record linkage need to be aware that linkage errors are possible. Record linkages will have one of four outcomes: true matches correctly classified as matches, true matches falsely classified as non-matches, true non-matches falsely classified as matches, or true non-matches correctly classified as non-matches (Winkler, W.E. 2009). As shown in the example in Table 2, where records from file 1 are linked to records from file 2, the result of the record linkage between two variables will be either a match or a non-match. A good record linkage will maximize the proportion of true matches correctly classified as matches and the proportion of true non-matches correctly classified as non-matches, and minimize the other record linkage outcomes.

Table 2
Example of record linkage outcomes
Table summary
This table displays the results of Example of record linkage outcomes Record, File 2 and Type of Outcome (appearing as column headers).
Record File 2 Type of Outcome
File 1 A Match Non-match Non-match True match
C Non-match Match Non-match False match
D Non-match Non-match Non-match False non-match
E Non-match Non-match Non-match True non-match

The results of probabilistic record linkage are dependent on the quality of the linkage variables. For example, misspelled names or typos in the date of birth can create missed or erroneous matches. A non-match does not necessarily mean that the person did not file taxes. The record linkage rates for the most recent IMDB are available in Section 7.2.1.

Date modified: