Longitudinal Immigration Database (IMDB) Technical Report, 2019
5 Data processing

5.1 Processing

A number of government agencies are involved in the creation and processing of the IMDB. From initial data collection, to processing and dissemination, their cooperation is required to ensure the high standard of data quality that data users expect from Statistics Canada. At each step in the processing sequence, thorough manual and automated data quality checks are performed, and feedback loops are in place to correct any detected errors at the source. The following section briefly describes the annual processing that updates the IMDB.

Figure 3
Summary of the IMDB process flow

Description for Figure 3

Input files are received. Immigration data, namely the Immigrant Landing File, the Non-permanent Resident File, and the Citizenship File are received from IRCC. Tax data, namely the T1 file, the Canada Child Tax Benefit (CCTB) and T4, are received from CRA. After the reception of the files, record linkages are done using Social Data Linkage Environment (SDLE) and Linkage Control File (LCF) to identify individuals. Then the final core files of the Longitudinal Immigration Database (IMDB) are produced. From the immigration data, the PNRF, PNRF_Nonfilers, PNRF_Extra, NRF_Permit and NRF_Person datasets are created. From the tax data, the T1FF are created for years 1982 to 2018.

Integrated within the IMDB are several modules: the Settlement Services Module (based on the 2018 IMDB), the Wages module, and the Children module, as well as the Express Entry file. Note: See glossary of terms for definitions of acronyms. The source for this figure is Statistics Canada.

As shown in Figure 3, Statistics Canada first receives from the Canada Revenue Agency (CRA) the T1 data, in a file called “Personal Master File” (PMF), and other tax files. The tax files are then used to create the T1 Family File (T1FF), where individuals are linked to spouses and children via a common identifier, and geographic variables are created. Statistics Canada performs manual quality checks, and compares estimates from the T1FF with other data sources, such as the census (in census years) and the Survey of Labour and Income Dynamics, as well as annual income statistics produced by the CRA.Note

On the immigration side, IRCC provides the data on landed immigrants non-permanent residents and citizens used to produce the IMDB. These data serve to create the Immigrant Landing File (ILF) and the Non-permanent Resident File (NRF). The ILF and NRF are assumed to be complete censuses of permanent and temporary resident permits issued by IRCC since 1980.

In addition to adding the information for the most recent tax year, a full back-sweep of previous years is done in order to add tax information for any new individuals that have been linked. This could mean that a landed immigrant’s or non-permanent resident’s filed tax records are not linked in the IMDB one year but that their subsequent tax filings could still be linked in a later year. As methodology improves, the back-sweep could ensure that all their previous tax filings, if they are on the T1FF, can become linked as well. This is how, after the processing of the most recent tax data, individuals who had landed and filed taxes many years earlier could still be added to the IMDB. For individuals with multiple admissions since 1980, data from the time of the first admission are retained.

Although taxes for a given year are usually filed in the spring of the following year (i.e., claiming 2013 income in 2014), there are exceptions. At times, someone may have filed taxes later in the year, and would not be included in that year’s T1 processing done by Statistics Canada. When that file is handed down for IMDB processing, these late-filers are excluded and will not be included in the next year’s processing, as the T1FF is not updated. Similarly, individuals who file taxes for previous years are not added to the IMDB for those years, as previous years’ T1FF is not updated. In that case, a person’s first on-time filing will show up as their first year in the database.

At this point, a series of programs are run to assess the data quality and linkage rates, ensuring that there are no duplicates and flagging outliers. Once the database is linked, it is deemed complete and dissemination is ready to take place.

In the end, the database consists of SAS files, one tax file per year since 1982 (IMDB_T1FF_&year), and Immigration data files (PNRF_1980_2019, PNRF_EXTRA_1980_2013, PNRF_1952_1979 and NRF_PERMIT_1980_2019). All these files are described in Section 2. The IMDB Unique Person Identifier (IMDB_ID) is used to connect all these files (see Appendix D.1 for programming tips).

5.2 Non-permanent Resident File (NRF) linkage

The Non-permanent Resident File (NRF), provided by IRCC, covers records of temporary resident permits issued for 1980 and subsequent years. It provides some demographic information about non-permanent residents as well as detailed information regarding their permits, such as permit type and the valid-date range.

The NRF contains millions of observations. These, however, include duplicates, whereby a single individual may have a number of different IDs. This issue is due mainly to records from the late 1980s where the original person identification number was lost. These records have been removed by linking the NRF to itself. This has resulted in approximately 220,000 records (roughly 400,000 observations) being identified as duplicates. In cases where both non-permanent resident records had their own landing record, the duplication link has been nullified (applicable to fewer than 1,000 records), as it is assumed that the landing file contains unique identifiers. After cleaning, only distinct non-permanent residents remain.

Both immigration files (ILF and NRF) contain some demographic information. However, the demographic information contained in the two files may not always be consistent. This is the case when more than one source is available or when there is a conflict. It has been decided that information in the ILF on the Integrated Permanent and Non-permanent Resident File (PNRF) shall be retained in light of data quality issues with the NRF in its earlier years.

5.3 Derived variables included in T1FF

Once record linkages have been performed, immigration-specific variables for immigrants and temporary residents are added to the T1FF.

In order to identify a taxfiler’s immigration status, the admission year (LANDING_YEAR) along with the first effective year, which represents the year that they first obtained a non-permanent residence permit (FIRST_EFFECTIVE_YEAR) have been created. As a result, the presence on the non-permanent resident file indicator (TR_IND) has been removed.

Derived variables that identify and describe families are also created. In each annual T1FF, it is possible to have an estimate of the number of immigrants in a family who were admitted in 1980 or thereafter (variable IMM80F&year). However, this can be an underestimation as this variable includes only filers and not imputed records, therefore children are under-estimated. It is also possible to determine whether the immigrant has a spouse (in the given taxation year) and whether this spouse is an immigrant or a non-permanent resident (variable SP_IDI&year). Data users can identify immigrants in the same family, each tax year, by using the variable Family Identification Number (FIN_). All members of a family have the same value for this variable, namely the IMDB_ID of the oldest family member who landed in 1980 or thereafter. The quality of these variables depends on the quality of the record linkage and the T1FF files, since only linked individuals will be counted (see Section 7.5).

The variables with the prefix TNK are counts of the number of claimed children of a given age in the families of immigrants and non-permanent residents (see the tax component of the data dictionary for more details). The term “children” (“child”) is defined as any person who is single and living with one or two parents; a child can be of any age. For example, in Table 3, the family of immigrant identified as IM19801 has two children aged 1 in 2011 (TNK01I2011), while family IM19873 has a total of three children in 2011 (TNKIDI2011), one aged 0 (TNK00I2011), one aged 1 (TNK01I2011), and one who is older than 18 years of age (TNK19I2011). The immigrant IM20105 has no children in 2011.

Table 3
Example on variables related to number of children in family
Table summary
This table displays the results of Example on variables related to number of children in family. The information is grouped by IMDB_ID (appearing as row headers), TNK00I2011, TNK01I2011, TNKxxI2011, TNK19I2011 and TNKIDI2011, calculated using number units of measure (appearing as column headers).
IMDB_ID TNK00I2011 TNK01I2011 TNKxxI2011 TNK19I2011 TNKIDI2011
IM19801 0 2 0 0 2
IM19802 0 1 0 0 1
IM19873 1 1 0 1 3
IM19994 0 0 0 1 1
IM20105 0 0 0 0 0

Another variable added to the T1FF is OUTLIER_IND (1: outlier; 0: no). It is a flag added to identify records with extreme incomes (see Section 5.5 for more details) and to be removed from any tables or calculation. Records identified as outliers have some extreme incomes that could bias analysis results.

5.4 Derived variables included in PNRF

When the PNRF is produced, some variables relating to tax filing patterns are derived and included in the file. The variable FIRST_TAX_YEAR indicates the first year for which a tax record was available for a given individual, while LAST_TAX_YEAR indicates the last year for which a tax file is available. It is to be noted that a tax record does not necessarily exist for every year between the first tax year and the last tax year. For example, a case where First_ tax_year=1982 and Last_tax_year=2012 does not necessary indicate that the taxfiler has filed taxes continuously, as the tax file for 2006 may be missing, for example. When the FIRST_TAX_YEAR and LAST_TAX_YEAR variables are missing, it is to denote the non-filers or people who have never filed income tax before. This is an update since the 2018 IMDB, since in the past the filers and non-filers have been merged together.

The variable PREFILER_IND is used to identify immigrants who have T1FF data prior to their admission year. Most have been linked to a non-permanent resident record, as expected (see Section 7.2.4 for more details).

5.5 Outlier detection

After creating the IMDB_T1FFs, outlier detection is performed on all tax files to identify outlier records. A record is deemed to be an outlier when it is determined to contain one or some extreme income values compared to other records. The criteria used to identify the outliers are confidential. The variable OUTLIER_IND is created to identify the records with extreme values.

The outlier flag, OUTLIER_IND, is in the tax files, but is not in the PNRF. A given person’s record may be flagged as an outlier in a specific year without necessarily being found to be an outlier for all years for which the person filed taxes. All outliers are to be removed from analysis. As shown in Table 4, for person IM19802, only the 1983 record has been flagged as being an outlier, while person IM19801 has no tax files flagged as outlier. No outlier flag is available in 2012 for IM19994 because no tax records are available for that person in 2012.

Table 4
Example related to the outlier flag
Table summary
This table displays the results of Example related to the outlier flag. The information is grouped by IMDB_ID (appearing as row headers), OUTLIER_IND1982, OUTLIER_IND1983, OUTLIER_INDyyyy, OUTLIER_IND2012 and OUTLIER_IND2014 (appearing as column headers).
IM19801 0 0 0 0 0
IM19802 0 1 0 0 0
IM19873 1 1 1 0 Note ...: not applicable
IM19994 0 0 0 Note ...: not applicable 0

The outliers can be removed from tabulations and any analysis. The IMDB excludes (the very few) large incomes as they would skew averages and give users an incorrect impression of the income situation for certain types of immigrants. Consider a fictitious example where the average income of Czech-Canadians is $40,000 in a given year and, the next year, it suddenly jumps to $500,000 because, by chance, a Czech hockey player was admitted. This would bias the “real” income situation for Czech-Canadians. For that reason, the “un-representative” Czech hockey player’s income would be removed from calculations. There is a confidentiality component to this example, as well. If such a jump in average income were observed, one could deduce the Czech hockey player’s income, which would be a breach of confidentiality. Incidentally, in some IMDB products, median income, which is more resistant to the changing influence of large individual values, is also provided as a measure.

When one is producing tables or analyzing data, records deemed to be outliers for a given year have to be removed from calculations relating to the year in question for the reasons mentioned above or a dominance rule need to be applied. For further details, see Appendix D.6.

