Longitudinal Immigration Database (IMDB) Technical Report, 2022
7 Data evaluation and quality indicators
Skip to text
Text begins
7.1 Error sources
Because the IMDB is the product of several record linkages, it is subject to different sources of errors, including record linkage errors, measurement errors, and coverage errors. In this section, the sources of errors are explained and the prevalence of some of these errors is presented.
It is to be noted that, given that it is a census of immigrant taxfilers who were admitted in 1980 or thereafter, no weights are created in the IMDB. No adjustments are made for the missing tax years of filers or for linkage errors; no sampling is performed; and every linked taxfilers is kept in the final dataset. However, the linkage itself presents a form of sampling error when links are missed.
7.1.1 Record linkage errors
Datasets produced from the results of record linkages are subject to record linkage errors. Two types of errors are possible—false positives (false matches) and false negatives (false non-matches). A link is considered a false positive when two records not belonging to the same person are deemed a match. A link is considered a false negative when two records belonging to the same person are deemed a non-match.
It is possible to miss part of an immigrant’s fiscal history since some immigrants have more than one social insurance number (SIN) through time (a temporary SIN assigned at arrival to the individual as a non-permanent resident, and later a permanent SIN assigned after admission). Both SINs are required in order to have a complete fiscal history from arrival in Canada. The SDLE (described in Sections 2.3) allows for identification of these SINs. It is possible that, in a few instances, some SIN connections are missed or false connections are made.
7.1.2 Measurement errors
Measurement error is the difference between a variable’s measured value and its true value. This type of error can be attributed to a number of factors, including data capture (e.g., typos) and respondent error (e.g., misinterpretation of the question asked). This type of error was taken into account in the creation of the Integrated Permanent and Non-permanent Resident File (PNRF) to avoid conflicting information for any individual. For example, when a person has a record on both the ILF and the NRF, and the sociodemographic variables have inconsistent values, the values at admission (in the ILF) are kept. See sections 7.2 and section 7.5 for some counts.
7.1.3 Coverage errors
Coverage errors are the result of omissions, erroneous additions, duplicates, and errors of classification of records in the database. Coverage errors can result from inadequate coverage of the population. They can create biased estimates, and the impact can vary for different sub-groups of the population. These errors often result in undercoverage. Undercoverage in the IMDB is in part the result of the exclusion of tax files of immigrant taxfilers from the database. Immigrants who do not file taxes for a given year or who file late would not have an IMDB_T1FF record although linked to tax and part of the population of interest. If, for any reason, an immigrant record was not included in the Immigrant Landing File (ILF), it would not be part of the IMDB. Overcoverage is the result of the addition to the database of records excluded from the target population. An immigrant could have more than one ILF record as a result of multiple admissions not identified as such, for example. Please refer to Section 7.4 and Appendix B for more information on IMDB coverage.
7.2 Data accuracy
This section will discuss the accuracy of the immigration data. For details on the accuracy of the T1 Family File (T1FF), please refer to the T1FF entry (record number 4105).
The accuracy of the IMDB is dependent on the representativeness of the population included in it. A study conducted in the first years of the IMDB concluded that the IMDB “appears to be representative of the population most likely to file tax returns. Therefore, the results obtained from the IMDB should not be inferred to the immigrant population as a whole, but rather to the universe of tax-filing immigrants” (Carpentier and Pinsonneault 1994).
The reasons for the differences between taxfilers and the entire foreign-born population are explained in an article by Badets and Langlois (2000) describing the challenges of using the IMDB:
The characteristics of the immigrant taxfiler population will differ from those of the entire foreign-born population because the tendency or requirement to file a tax return will vary in relation to a person’s age, family status, and other factors. One would expect a higher percentage of males to file a tax return, for example, because males have higher labour force participation rates than females. The extent to which immigrants are “captured” in the IMDB will also be influenced by changes to the income tax. For example, the introduction of federal and provincial non-refundable tax credit programs encourage individuals with no taxable income to file a return to qualify for certain tax credits. (Badets and Langlois 2000)
7.2.1 2022 IMDB: Linkage rates
This section is based on the 2022 IMDB. The overall linkage rate between IRCC immigration file and the SDLE Derived Record Depository was 97.0% (see Section 4). A link does not necessarily mean that a tax file is available since it is possible to link dependents of taxfilers or immigrants who have yet to file taxes. This SDLE theoretical linkage rate mostly informs on how well IRCC files could be associated within a larger repository environment.
Of the immigrants who landed in any year from 1980 to 2021, 85.7% were linked to at least one T1FF record. This rate represents the effective coverage of immigrant linkage to tax files. As presented in the following statistics, this coverage rate may change according to gender and age.
The proportion of linked taxfilers by age group at admission and sex is shown in Table 4. The lower rates for the 0-to-14 age group are expected since those in this age group are not of working age. See Appendix B for rates by sex, age group and admission cohort.
Age at landing | |||||||
---|---|---|---|---|---|---|---|
0 to 14 | 15 to 24 | 25 to 34 | 35 to 49 | 50 to 64 | 65 and older | Total | |
percent | |||||||
Male | 58.0 | 90.4 | 91.7 | 91.2 | 87.7 | 74.9 | 83.4 |
Female | 57.0 | 89.7 | 91.2 | 91.8 | 85.9 | 73.7 | 83.4 |
Total | 57.5 | 90.1 | 91.4 | 91.5 | 86.7 | 74.2 | 83.4 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
As immigrants become older, they start filing taxes and are included in the IMDB. Chart 1 shows that, among immigrants who landed at any age from birth to age 14, the proportion of linked taxfilers is higher for immigrants who landed prior to 2000 than for immigrants who have landed since 2000. Recent immigrants also have lower linkage rates. See Appendix B for table showing the proportion of linked taxfilers by age group at admission, sex and admission decade.
Data table for Chart 1
Cohorts | Age groups | |||||
---|---|---|---|---|---|---|
0 to 14 years | 15 to 24 years | 25 to 34 years | 35 to 49 years | 50 to 64 years | 65 years and older | |
proportion | ||||||
1980 to 1989 cohorts | 0.82 | 0.93 | 0.94 | 0.93 | 0.83 | 0.61 |
1990 to 1999 cohorts | 0.81 | 0.92 | 0.93 | 0.93 | 0.89 | 0.76 |
2000 to 2009 cohorts | 0.75 | 0.93 | 0.92 | 0.93 | 0.93 | 0.88 |
2010 to 2019 cohorts | 0.26 | 0.95 | 0.95 | 0.95 | 0.92 | 0.85 |
2020 to 2021 cohorts | 0.00 | 0.80 | 0.94 | 0.94 | 0.83 | 0.71 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
Chart 2 illustrates the proportion of filers, and the number of filers and non-filers by landing year, where the term “non-filer” means that no T1FF records are available. For the 2022 IMDB, the filing rate varies by landing year, ranging from 74.8% for those who landed in 2019 to 91.2% for those who landed in 1989. The filing rates increase with the number of years that immigrants stay in Canada; this may explain why the linkage rates are higher for those who landed in the 1990s and early 2000’s. See Appendix B, tables 14 and 15, for detailed distribution numbers by landing year.
Data table for Chart 2
Landing Year | Taxfilers | Non-taxfilers | Rates |
---|---|---|---|
number of immigrants | percent | ||
1980 | 120,460 | 22,660 | 84.2 |
1981 | 107,730 | 20,850 | 83.8 |
1982 | 103,460 | 17,630 | 85.4 |
1983 | 77,130 | 11,900 | 86.6 |
1984 | 77,490 | 10,540 | 88.0 |
1985 | 75,050 | 8,890 | 89.4 |
1986 | 89,090 | 9,680 | 90.2 |
1987 | 137,620 | 13,550 | 91.0 |
1988 | 146,350 | 14,410 | 91.0 |
1989 | 173,960 | 16,710 | 91.2 |
1990 | 192,320 | 23,110 | 89.3 |
1991 | 208,730 | 23,080 | 90.0 |
1992 | 228,870 | 25,060 | 90.1 |
1993 | 230,970 | 24,710 | 90.3 |
1994 | 198,900 | 24,690 | 89.0 |
1995 | 188,930 | 23,230 | 89.1 |
1996 | 198,820 | 26,550 | 88.2 |
1997 | 189,670 | 25,790 | 88.0 |
1998 | 155,630 | 18,050 | 89.6 |
1999 | 169,140 | 20,230 | 89.3 |
2000 | 204,050 | 22,700 | 90.0 |
2001 | 225,440 | 24,330 | 90.3 |
2002 | 205,320 | 22,890 | 90.0 |
2003 | 198,430 | 22,100 | 90.0 |
2004 | 211,790 | 23,560 | 90.0 |
2005 | 233,060 | 28,720 | 89.0 |
2006 | 222,770 | 28,340 | 88.7 |
2007 | 207,970 | 28,210 | 88.1 |
2008 | 214,250 | 32,360 | 86.9 |
2009 | 218,400 | 33,190 | 86.8 |
2010 | 237,430 | 42,640 | 84.8 |
2011 | 207,670 | 40,450 | 83.7 |
2012 | 215,390 | 41,850 | 83.7 |
2013 | 215,060 | 43,440 | 83.2 |
2014 | 216,740 | 42,760 | 83.5 |
2015 | 222,260 | 48,740 | 82.0 |
2016 | 231,260 | 64,130 | 78.3 |
2017 | 229,630 | 55,690 | 80.5 |
2018 | 247,600 | 72,190 | 77.4 |
2019 | 253,980 | 85,760 | 74.8 |
2020 | 140,020 | 43,760 | 76.2 |
2021 | 313,230 | 91,220 | 77.4 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
7.2.2 Availability of date of death
The year and month of death, as well as a death flag, are included in the PNRF. Starting in 2018, these variables were linked by using the Canadian Mortality Database (CMDB). In the past, these variables were based on Statistics Canada’s Amalgamated Mortality Database (AMDB), which is a retired dataset that combined records between CMDB and vital statistics and tax files. The CMDB is an administrative database that collects information on death dates and cause of death from all provincial and territorial vital statistics registries in Canada. Some undercoverage, while minimal, exists in the database as it does not include deaths of Canadians (1) who died outside of Canada, with the exception of United States; (2) who served as members of the Canadian military, or (3) whose bodies were unidentified. Note that the CMDB does not include deaths which were reported in the tax files.
Chart 3 describes the general trend in the number of deaths per year since 1974 for immigrants admitted since 1952. The availability of data for pre-1980 admission was recently added to the IMDB. The value “9999” represents the records of deceased immigrants for which the year of death is not available.
Data table for Chart 3
Year of death | Permanent residents from 1952 to 1979 | Permanent residents since 1980 |
---|---|---|
number of deaths | ||
1974 | 4,840 | Note .: not available for any reference period |
1975 | 5,270 | Note .: not available for any reference period |
1976 | 5,580 | Note .: not available for any reference period |
1977 | 6,220 | Note .: not available for any reference period |
1978 | 6,530 | Note .: not available for any reference period |
1979 | 7,090 | Note .: not available for any reference period |
1980 | 7,700 | 80 |
1981 | 7,610 | 300 |
1982 | 7,950 | 510 |
1983 | 8,530 | 740 |
1984 | 8,940 | 940 |
1985 | 9,370 | 1,110 |
1986 | 10,090 | 1,340 |
1987 | 10,470 | 1,630 |
1988 | 11,090 | 1,900 |
1989 | 11,550 | 2,210 |
1990 | 11,850 | 2,440 |
1991 | 12,660 | 2,920 |
1992 | 13,220 | 3,210 |
1993 | 14,000 | 3,750 |
1994 | 14,510 | 4,280 |
1995 | 15,300 | 4,730 |
1996 | 15,720 | 5,130 |
1997 | 16,180 | 5,450 |
1998 | 16,750 | 5,770 |
1999 | 17,530 | 6,190 |
2000 | 17,550 | 6,400 |
2001 | 18,010 | 6,940 |
2002 | 18,710 | 7,320 |
2003 | 19,190 | 8,100 |
2004 | 19,420 | 8,280 |
2005 | 20,150 | 8,670 |
2006 | 20,390 | 9,150 |
2007 | 21,350 | 9,860 |
2008 | 21,850 | 10,290 |
2009 | 22,360 | 10,780 |
2010 | 22,710 | 11,140 |
2011 | 23,000 | 11,940 |
2012 | 23,540 | 12,200 |
2013 | 24,830 | 13,210 |
2014 | 25,400 | 14,150 |
2015 | 26,280 | 15,040 |
2016 | 26,900 | 15,900 |
2017 | 27,730 | 17,240 |
2018 | 27,950 | 18,080 |
2019 | 28,060 | 18,950 |
2020 | 31,210 | 23,290 |
2021 | 30,630 | 24,600 |
. not available for any reference period Note: The value 9999 is assigned when the date of death is missing. Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
7.2.3 Prefilers compared to records on the Non-permanent Resident File (NRF)
The results included in this section are drawn from a study based on the 2014 IMDB. Prefilers are immigrants who filed taxes prior to their landing year. It is sometimes assumed that all prefilers are immigrants who were non-permanent residents prior to admission. This section discusses why it is not the case. A total of 1.26 million individuals filed taxes before officially admission in 1980 or a subsequent year—of these, 212,500 are not linked to a non-permanent resident record as may otherwise be expected. Upon further investigation, it has been discovered that most permanent resident prefilers not linked to a non-permanent resident record are likely immigrants who have filed taxes when not required: 96% of these prefilers filed taxes only for the year prior to admission, and 75% reported no income (96% had no wages ). As shown in Chart 4, most of these prefilers landed in the first months of the year, prior to the deadline to file taxes for the previous year. It appears some immigrants who landed prior to the month of May filed taxes for the year prior to their landing year, for which they were not required to file.
Given these findings, whether it is appropriate to remove records with Prefiler_ind=1 and FIRST_EFFECTIVE_YEAR=. from studies on immigrants with pre-admission experience depends on the analysis since FIRST_EFFECTIVE_YEAR=. means no record is available on the non-permanent permit file.
Data table for Chart 4
Landing month | Number of immigrants |
---|---|
January | 32,300 |
February | 36,100 |
March | 35,500 |
April | 24,100 |
May | 20,500 |
June | 18,200 |
July | 16,100 |
August | 11,200 |
September | 9,800 |
October | 5,500 |
November | 2,000 |
December | 1,200 |
Source: Statistics Canada, 2014 Longitudinal Immigration Database. |
Not all immigrants with pre-admission experience are identified as prefilers: 478,100 immigrants have non-permanent resident records with Prefiler_ind=0. Depending on the subject of interest, using the FIRST_EFFECTIVE_YEAR<>. or the number of temporary resident permits (variable NUMBER_ALL_PERMITS) is more appropriate to study immigrants with pre-admission experience. Prefiler_ind=0 indicates that no tax records have been filed prior to admission, but this does not mean that the individual had no pre-admission Canadian experience.
7.2.4 Spouse indicator
The IMDB contains variables that enable data users to obtain information on marital status and spouses. The following section contains results of a study done on the 2012 IMDB. No major changes have occurred since then in the marital status codes or family flag.
The spouse identification number (SP__IDI) is derived from tax files. This information can be derived only when the respondent claims his or her spouse or common-law partner while filing taxes; this causes an underestimation of couples as compared to the marital status declared in the tax files. From the T1FF, it is also possible to obtain the marital status at time of filing.
Prior to 1991, the “single” category was not available as marital status (MSTCO). The “common-law” status was made available as of 1992 for all datasets (1982 to 2012). Since 1992, the proportion of IMDB records indicating marital status as “single” has ranged from 20% to 30%. The proportion of “separated” has declined from 30% prior to 1992 to 4% after. The other marital status categories have not been affected by pattern changes.
Analysis done on the distribution of marital status (MSTCO from tax files) and the spouse ID (SP__IDI) shows differences between the two variables. This is because values for marital status are missing for some records. In a perfect situation, the records of all married persons would have spousal information, and the records of all single persons would have no spousal information. This analysis shows data quality to be better after 1992, when separate statuses for “common-law” and “single” were introduced.
Presence of spouse reporting gaps
Further to a review of the longitudinal history of immigrants on the 2012 IMDB, some cases where the spouse or common-law partner is missing (or different) for a given year and the same spouse is declared two or three years later have been found. The Chart 5 gives a summary of these gaps.
Data table for Chart 5
Tax year | Percent |
---|---|
1980 | 16.8 |
1981 | 16.6 |
1982 | 17.1 |
1983 | 17.7 |
1984 | 17.4 |
1985 | 17.7 |
1986 | 17.7 |
1987 | 16.9 |
1988 | 14.4 |
1989 | 13.9 |
1990 | 13.4 |
1991 | 14.4 |
1992 | 13.5 |
1993 | 12.7 |
1994 | 9.8 |
1995 | 9.4 |
1996 | 8.7 |
1997 | 8.1 |
1998 | 7.8 |
1999 | 7.1 |
2000 | 6.3 |
2001 | 5.6 |
2002 | 5.0 |
2003 | 4.6 |
2004 | 4.0 |
2005 | 3.1 |
2006 | 2.9 |
2007 | 2.4 |
2008 | 1.8 |
2009 | 1.1 |
2010 | 0.7 |
2011 | 0.5 |
Source: Statistics Canada, 2012 Longitudinal Immigration Database. |
Most immigrants on the file have one or no spouse in the years from 1980 to 2012 according to the IMDB_T1FF files. It is to be noted that no marital status (and no spouse info) is available for 1.2 million immigrants out of approximately 6 million immigrants.
7.3 Imputed variables
7.3.1 Imputation of education variables
A data quality issue regarding the variables for education qualifications and years of schooling was identified. A non-negligible proportion of individuals who did not state their education qualifications or years of schooling were coded as “0” or “None” instead of “Missing” on EDUCATION_QUALIFICATIONS and YEARS_OF_SCHOOLING. This problem was prevalent from 2011 to 2014. In 2011, 35% of immigrants stated that they had no education qualifications, compared to roughly 10% in the 1990s.
This issue was resolved by imputing the education variables by means of values for education variables from 2008 to 2010 to model the most recent year’s education variables. For the imputation, variables such as admission age, immigration_category_rollup2, intended occupation, gender and country of last permanent residence were used. The nearest-neighbour imputation method was employed. The variable Education_imputation_ind (0: no; 1: yes), available in the PNRF, was created to identify records with imputed education variables.
For immigrants admitted in 2016, the number of cases where a non-stated education was coded to “0” or “None” instead of “Missing” was reduced. However, a non-negligible number of records had a missing education qualification with a valid number of years of schooling. For these records, years of schooling was used to impute a value for education qualifications.
For principal applicants admitted since 2015, under the express entry, the year of schooling in most cases are underestimated.
For the 2022 IMDB, those who were admitted between 2015 and 2021 and who were connected to an Express Entry record had their education imputed using values found in the Express Entry file. A more comprehensive education variable called Education_Derived was created, combining data from Education_Qualification, for those who were not found in the Express Entry file, and the new values from Express Entry. Users should note that the values from the Express Entry file are based on the immigration officer’s assessment of the educational qualifications of the applicant in the context of a Canadian equivalency, whereas the values from Education_Qualification are self-declared by the applicant, and do not necessarily reflect a Canadian equivalency.
Education_Derived was set to missing for those admitted in 2021 or later.
7.3.2 Imputation of language variables
For the 2019 IMDB, two new language variables were added, English_IND and French_IND, reporting the first official language known at admission. For those who were admitted in 2018 or earlier, they are defined as permanent residents having English (French) as their mother tongue or having a mother tongue other than English or French while declaring English only (French only) as their knowledge of official language at admission.
For those admitted in 2019 or later, they are defined as permanent residents having declared English only (French only) as their knowledge of official language at admission or having declared English and French as their knowledge of official language at admission and reporting English (French) as the language in which they are most at ease.
7.4 Coverage
7.4.1 Coverage of the Integrated Permanent and Non-permanent Resident File (PNRF)
The 2022 Integrated Permanent and Non-permanent Resident File (PNRF) contains over 9.2 million records (Table 5); of these, over 7.9 million records (85.7%) are linked to at least one tax file. It is to be noted that immigration data belonging to non-taxfilers and taxfilers alike are included in a file named PNRF_1980_2022. The following table shows the distribution of records depending on their presence in the different immigration and tax files. Over 2.4 million records belong to immigrants who were temporary residents prior to becoming permanent residents; over 2.2 million of these records are linked to at least one tax file. See Appendix B for detailed distribution numbers by landing year.
See Appendix B for detailed distribution numbers by landing year.
Permanent resident | Permanent resident with non-permanent resident permit | Total | |
---|---|---|---|
number | |||
Total filers | 5,651,430 | 2,290,610 | 7,942,040 |
Total non-filers | 1,193,990 | 126,390 | 1,320,370 |
Total | 6,845,420 | 2,417,000 | 9,262,410 |
Percent taxfilers | 82.6 | 94.8 | 85.7 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
Data on immigrants with non-permanent resident permits are available. The proportion of immigrants with pre-admission experience varies by landing year (Chart 6); it ranges from 3.8% in 1980 to 69.4% in 2021. As a result, the proportion of immigrants with pre-admission experience in the early 1980s is underrepresented. The proportion of immigrant filers with pre-admission experience (solid line) is higher than the overall proportion of immigrants with pre-admission experience (dotted line) because the linkage rate for these immigrants is higher than that for immigrants without pre-admission experience.
Data table for Chart 6
Landing Year | All Immigrants | Taxfilers | Non-taxfilers |
---|---|---|---|
percent | |||
1980 | 3.8 | 4.1 | 2.6 |
1981 | 11.3 | 12.4 | 5.6 |
1982 | 14.2 | 15.5 | 6.8 |
1983 | 17.3 | 18.8 | 7.6 |
1984 | 19.8 | 21.3 | 8.9 |
1985 | 20.0 | 21.3 | 8.7 |
1986 | 24.7 | 26.3 | 9.9 |
1987 | 23.4 | 24.8 | 8.7 |
1988 | 11.5 | 12.1 | 6.1 |
1989 | 13.6 | 14.3 | 7.0 |
1990 | 16.6 | 17.6 | 7.9 |
1991 | 32.0 | 34.2 | 12.7 |
1992 | 34.7 | 36.9 | 14.7 |
1993 | 26.8 | 28.4 | 12.2 |
1994 | 18.2 | 19.5 | 7.6 |
1995 | 19.9 | 21.4 | 7.5 |
1996 | 19.6 | 21.4 | 6.5 |
1997 | 17.5 | 19.0 | 5.7 |
1998 | 19.3 | 20.8 | 7.0 |
1999 | 19.0 | 20.5 | 6.6 |
2000 | 18.3 | 19.6 | 6.1 |
2001 | 16.3 | 17.5 | 5.7 |
2002 | 15.2 | 16.3 | 4.9 |
2003 | 15.8 | 17.0 | 4.7 |
2004 | 19.2 | 20.8 | 5.2 |
2005 | 20.1 | 21.9 | 5.0 |
2006 | 22.7 | 25.0 | 5.1 |
2007 | 23.4 | 25.9 | 4.9 |
2008 | 23.3 | 26.1 | 4.6 |
2009 | 24.2 | 27.2 | 4.9 |
2010 | 23.0 | 26.4 | 4.4 |
2011 | 23.6 | 27.2 | 5.3 |
2012 | 25.4 | 29.3 | 5.3 |
2013 | 26.9 | 31.2 | 5.8 |
2014 | 34.1 | 39.5 | 6.8 |
2015 | 33.2 | 39.0 | 6.7 |
2016 | 30.0 | 36.7 | 5.9 |
2017 | 37.8 | 44.9 | 8.4 |
2018 | 37.2 | 44.9 | 10.6 |
2019 | 36.8 | 45.6 | 10.9 |
2020 | 47.3 | 56.3 | 18.6 |
2021 | 69.4 | 78.9 | 36.9 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
7.4.1.2 Coverage of non-permanent residents
This section describes the coverage of individuals who only had non-permanent resident permits since 1980, overall tax records are available for 29.6% of them. Among individuals who have not become permanent residents, asylum seekers have the highest coverage rate, tax records are available for 39.7% of them (see table 6). There is a wide variety of non-permanent resident permits; some permits are as short as one day.
With work permit | With study permits | Asylum claimants | Total | |
---|---|---|---|---|
number | ||||
Total filers | 1,598,650 | 811,850 | 196,430 | 1,884,670 |
Total non-filers | 2,572,300 | 1,903,120 | 298,200 | 4,486,880 |
Total | 4,170,950 | 2,714,970 | 494,630 | 6,371,550 |
percent | ||||
Percent taxfilers | 38.3 | 29.9 | 39.7 | 29.6 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
7.4.2 T1 Family File (T1FF) size and coverage by year
Tax files for 1982 and subsequent years are available for linked non-permanent and permanent residents. Some permanent residents were non-permanent residents prior to admission. Table 7 gives details on the distribution of linked permanent residents with and without non-permanent permits prior to admission, by tax year. At least one tax file is available for the 82.6% of permanent residents without a non-permanent permit prior to admission and for the 94.8% of permanent residents who were non-permanent residents prior to admission. The fact that permanent residents with pre-admission temporary permits have a higher rate of filing taxes than permanent residents without pre-admission permits can be explained by a requirement in the permanent resident application process with respect to non-permanent residents. Non-permanent residents who apply for permanent residency are required to fulfil their obligation to file tax in Canada. The number of taxfilers on the IMDB_T1FF increases as the years pass since the size of the in-scope population increases.
Permanent resident admitted prior to 1980 | Permanent resident since 1980 | Permanent resident with non-permanent resident permit | Non-permanent resident only | Number of taxfilers | |
---|---|---|---|---|---|
number | |||||
1982 | 1,644,610 | 189,810 | 56,180 | 25,480 | 1,911,270 |
1983 | 1,625,930 | 225,610 | 66,100 | 23,650 | 1,936,470 |
1984 | 1,621,630 | 264,860 | 80,570 | 24,050 | 1,986,070 |
1985 | 1,602,750 | 299,150 | 95,630 | 22,740 | 2,015,040 |
1986 | 1,658,390 | 357,320 | 125,710 | 26,790 | 2,162,220 |
1987 | 1,638,520 | 416,990 | 159,190 | 26,990 | 2,235,100 |
1988 | 1,658,960 | 509,400 | 201,070 | 36,030 | 2,397,860 |
1989 | 1,686,360 | 623,390 | 264,540 | 48,380 | 2,614,030 |
1990 | 1,695,460 | 744,830 | 311,980 | 51,740 | 2,794,500 |
1991 | 1,692,660 | 842,460 | 360,750 | 51,500 | 2,937,360 |
1992 | 1,699,130 | 948,600 | 404,690 | 51,040 | 3,092,970 |
1993 | 1,731,600 | 1,092,740 | 444,080 | 51,010 | 3,308,160 |
1994 | 1,716,340 | 1,214,140 | 469,540 | 50,160 | 3,438,690 |
1995 | 1,700,230 | 1,326,300 | 494,870 | 52,500 | 3,562,260 |
1996 | 1,681,320 | 1,433,840 | 515,770 | 54,400 | 3,673,630 |
1997 | 1,656,820 | 1,548,270 | 536,370 | 56,410 | 3,786,220 |
1998 | 1,632,800 | 1,648,110 | 556,570 | 55,340 | 3,881,190 |
1999 | 1,626,410 | 1,772,520 | 592,510 | 59,970 | 4,039,770 |
2000 | 1,608,650 | 1,916,170 | 633,170 | 67,830 | 4,214,280 |
2001 | 1,595,230 | 2,073,150 | 682,440 | 77,840 | 4,417,080 |
2002 | 1,565,250 | 2,203,130 | 720,290 | 83,430 | 4,560,620 |
2003 | 1,545,270 | 2,325,380 | 757,650 | 86,620 | 4,703,490 |
2004 | 1,529,220 | 2,456,490 | 799,810 | 88,130 | 4,862,320 |
2005 | 1,504,050 | 2,572,530 | 833,890 | 93,600 | 4,992,890 |
2006 | 1,486,340 | 2,721,860 | 891,510 | 98,380 | 5,186,970 |
2007 | 1,467,490 | 2,845,350 | 960,990 | 110,080 | 5,372,890 |
2008 | 1,447,220 | 2,970,260 | 1,040,070 | 131,960 | 5,578,630 |
2009 | 1,426,780 | 3,085,610 | 1,108,130 | 141,270 | 5,751,030 |
2010 | 1,401,960 | 3,211,470 | 1,166,910 | 149,380 | 5,919,090 |
2011 | 1,382,920 | 3,342,050 | 1,235,190 | 155,740 | 6,105,370 |
2012 | 1,356,240 | 3,460,390 | 1,307,750 | 164,200 | 6,278,200 |
2013 | 1,337,810 | 3,593,450 | 1,390,150 | 179,810 | 6,490,950 |
2014 | 1,316,110 | 3,713,020 | 1,481,040 | 192,750 | 6,692,820 |
2015 | 1,288,720 | 3,840,510 | 1,557,910 | 200,020 | 6,877,270 |
2016 | 1,260,480 | 3,962,460 | 1,646,490 | 216,290 | 7,076,030 |
2017 | 1,234,500 | 4,066,680 | 1,764,190 | 269,060 | 7,324,950 |
2018 | 1,211,140 | 4,208,320 | 1,895,310 | 366,530 | 7,672,040 |
2019 | 1,172,190 | 4,344,850 | 1,991,280 | 534,530 | 8,033,840 |
2020 | 1,149,650 | 4,414,040 | 2,015,950 | 593,300 | 8,164,080 |
2021 | 1,119,090 | 4,498,230 | 2,022,890 | 816,230 | 8,447,800 |
Total taxfilers | 2,063,990 | 5,651,430 | 2,290,610 | 1,876,210 | |
Total non-taxfilers | 2,049,690 | 1,193,990 | 126,390 | 3,694,710 | |
percent | |||||
Percent taxfilers | 50.2 | 82.6 | 94.8 | 33.7 | |
Notes: Pemanent resident statistics are for people who were admitted between 1980 and 2022 Non-permanent residents statistics are for people who obtain their first permits between 1980 and 2022. Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
Chart 7 shows the proportion of permanent residents who were non-permanent residents prior to admission. This varies by tax year from a low of 22.7% for the 1983 tax year to a high of 31.4% for the 2019 tax year.
Data table for Chart 7
Tax year | Percent |
---|---|
1982 | 22.8 |
1983 | 22.7 |
1984 | 23.3 |
1985 | 24.2 |
1986 | 26.0 |
1987 | 27.6 |
1988 | 28.3 |
1989 | 29.8 |
1990 | 29.5 |
1991 | 30.0 |
1992 | 29.9 |
1993 | 28.9 |
1994 | 27.9 |
1995 | 27.2 |
1996 | 26.5 |
1997 | 25.7 |
1998 | 25.2 |
1999 | 25.1 |
2000 | 24.8 |
2001 | 24.8 |
2002 | 24.6 |
2003 | 24.6 |
2004 | 24.6 |
2005 | 24.5 |
2006 | 24.7 |
2007 | 25.2 |
2008 | 25.9 |
2009 | 26.4 |
2010 | 26.7 |
2011 | 27.0 |
2012 | 27.4 |
2013 | 27.9 |
2014 | 28.5 |
2015 | 28.9 |
2016 | 29.4 |
2017 | 30.3 |
2018 | 31.1 |
2019 | 31.4 |
2020 | 31.4 |
2021 | 31.0 |
Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
An immigrant who filed taxes for a given year will not necessarily file taxes the next year. For example, if Person A landed in 1983, this individual might be found on tax files from 1984 to 1999, but not be found on the 2000 file, and again be found on the 2001 to 2013 files. For example, 24.7% of filers from the 1980 cohort had tax files available for all years. Out-migration, death and late filing are some of the reasons immigrant filers might stop filing permanently or for some years.
Most immigrants file taxes for the first time in the year they land or one year before or after. For example, of the 251,110 immigrants who landed in 2006, 100,600 (40.1%) filed taxes for the first time in 2006, while 15,550 (6.2%) did so in 2007 and 3,200 (1.3%) did so in 2015 .
7.5 Quality assessment of immigration data
7.5.1 Quality assessment of the Integrated Permanent and Non-permanent Resident File (PNRF)
A validation of the content of the PNRF_1980_2022 was done. While admission and tax data are collected mandatorily from those in scope, some fields may not have been completed. They could be left empty because the response was unknown, or for other reasons unbeknownst to database users (e.g., refusal) (McLeish 2011). Item non-response can present issues when one is considering the IMDB for statistical purposes, including the following:
- If the database user is interested in producing a sample based on characteristics for which there are missing records, there will be coverage error (i.e., those being included in the sampling frame may not be representative of the target population).
- If the non-response is non-ignorable (i.e., the fact that information is missing is not a random occurrence; the fact that there is no response is indicative of what the response would have been), any analysis using those variables would be biased.
The presence of missing variables and invalid values was assessed. The numbers presented in this section are rounded. Invalid values are either inconsistent or not listed in the metadata tables available to users (see the immigration component of the data dictionary appendix). Most of the quality issues listed in Table 8 are for data collected in the 1980s and 1990s. It should be noted that some seemingly valid values may be erroneous as well.
The variable Case Identification Number (CASE_ID) has item response rates generally in the high 90% range (usually over 99%). However, for some landing years, the response rate drops significantly (to as low as 80% in 1991 and 1992). Therefore, any analysis using this variable for all landing years will under-represent those years where the item non-response is higher (e.g., 1986, 1987 , 1991, 1992, 1993, 2020). No detection of invalid values was performed for the variable Case Identification Number (CASE_ID).
The variable Landing_age was defined as invalid when it was greater than 100, although it is possible in some instances that these values are accurate. It should be noted that, according to the values for this variable, the number of immigrants who landed after age 100 was much higher between 1987 and 1995 than the other landing years. This could be the result of a data capture issue.
In the PNRF of the 2022 IMDB, 25 records had a birth year prior to 1880, with 18 records having birth year 1753 with corresponding landing years that are post 1985, even up to 2012.
The variables related to country have quality issues as well. The country of birth is missing for some records in almost all landing years. For example, values are missing for over 100 records in each of the years from 1985 to 1993. The country of citizenship is missing for fewer than 20 records per landing year for most years (with the major exceptions of 2005, 2006, 2016 and 2017, where over 80 records were missing per landing year). The country of residence is missing for many admission records from 2013 (this value is missing for 1190 records, or 0.5% of admissions taking place that year) and 2014, (this value is missing for 5845 records, or 2.3% of admissions taking place that year) and 2015 (missing for 7360 records, or 2.8% of admissions in that year).
The education variables, prior to the 2017 cohort, after imputation (see Section 6.3), have over 150 missing values per landing year from 1980 to 1984; this translates as a rate of missing values per landing year of less than 0.5%. A new derive variable using Express Entry data was used to help fill some missing data for the education variables for those admitted from 2015 to 2020
The percentage of valid responses for the occupation variables is above 99% for all landing years.
The variables Family_Status and CSQ_IND have most of their missing values for records with a landing year prior to 1999.
Mother_Tongue is missing for 550 records from the 2011 admissions and 535 records for the 1991 admissions.
The variable Official language has an increasing number of missing values; from 2016 to 2021, between 1,810 and 13,045 missing values per cohort.
The variable Marital_Status has over 200 missing values per cohort since 2012.
The variables Destination_CD, Destination_CMA, Destination_CSD and Destination_province have few missing values; the 2022 IMDB uses the Standard Geographical Classification (SGC) to update the geographical region and code.
The year and month of death was missing for some individuals identified as deceased (Death_Indicator=1). The value “9999” was assigned to Death_Year and the value “99” was assigned to Death_Month in cases where the year and month of death were unknown.
PNRF variables | Valid responses | Blanks / missing | Invalid responses | |||
---|---|---|---|---|---|---|
number | percent | number | percent | number | percent | |
Case_ID | 9,513,290 | 98.09 | 184,900 | 1.91 | 0 | 0.00 |
Landing_age | 9,696,730 | 99.98 | 730 | 0.01 | 730 | 0.01 |
Birth_Year | 9,698,140 | 100.00 | 30 | 0.00 | 30 | 0.00 |
Gender | 9,698,190 | 100.00 | 0 | 0.00 | 0 | 0.00 |
Country_Birth | 9,695,050 | 99.97 | 3,140 | 0.03 | 0 | 0.00 |
Country_Citizenship | 9,697,250 | 99.99 | 940 | 0.01 | 0 | 0.00 |
Country_Residence | 9,681,670 | 99.83 | 16,520 | 0.17 | 0 | 0.00 |
Education_Qualification | 9,030,710 | 93.12 | 667,480 | 6.88 | 0 | 0.00 |
Level_of_Education | 9,534,020 | 98.31 | 164,170 | 1.69 | 0 | 0.00 |
Years_of_Schooling | 9,481,450 | 97.77 | 216,740 | 2.23 | 0 | 0.00 |
Education_Derived | 9,327,830 | 96.18 | 370,360 | 3.82 | 0 | 0.00 |
Landing_age_6_groups | 9,697,460 | 99.99 | 730 | 0.01 | 0 | 0.00 |
Landing_age_9_groups | 9,697,460 | 99.99 | 730 | 0.01 | 0 | 0.00 |
Occupation_CD | 9,690,900 | 99.92 | 7,290 | 0.08 | 0 | 0.00 |
NOC5-NOC2 | 9,690,900 | 99.92 | 7,290 | 0.08 | 0 | 0.00 |
Skill_level_CD11 | 9,690,830 | 99.92 | 7,360 | 0.08 | 0 | 0.00 |
Family_Status | 9,695,600 | 99.97 | 2,590 | 0.03 | 0 | 0.00 |
Family_Status_rollup | 9,695,600 | 99.97 | 2,590 | 0.03 | 0 | 0.00 |
Marital_status | 9,692,280 | 99.94 | 5,910 | 0.06 | 0 | 0.00 |
Marital_status_rollup | 9,692,280 | 99.94 | 5,910 | 0.06 | 0 | 0.00 |
Mother_Tongue | 9,695,430 | 99.97 | 2,760 | 0.03 | 0 | 0.00 |
Official_Language | 9,643,180 | 99.43 | 55,020 | 0.57 | 0 | 0.00 |
Special_Program | 1,791,420 | 18.47 | 7,906,770 | 81.53 | 0 | 0.00 |
CSQ_ind | 9,697,960 | 100.00 | 230 | 0.00 | 0 | 0.00 |
Destination_CD | 9,697,400 | 99.99 | 790 | 0.01 | 0 | 0.00 |
Destination_CMA | 9,697,400 | 99.99 | 790 | 0.01 | 0 | 0.00 |
Destination_CSD | 9,697,400 | 99.99 | 790 | 0.01 | 0 | 0.00 |
Destination_Province | 9,697,420 | 99.99 | 780 | 0.01 | 0 | 0.00 |
Permits and NPR-specific variables | 2,598,340 | 100.00 | 0 | 0.00 | 0 | 0.00 |
Death_Year | 9,697,580 | 99.99 | 610 | 0.01 | 0 | 0.00 |
Death_Month | 9,697,540 | 99.99 | 650 | 0.01 | 0 | 0.00 |
Notes: PNRF: Integrated Permanent and Non-permanent Resident File. NPR: non-permanent resident. Only variables with missing or invalid values were included in the table. All numbers are rounded. Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
7.5.2 Quality assessment of the Non-permanent Resident File (NRF)
A validation of the content of the NRF_PERMIT_1980_2022 and NRF_PERSON_1980_2022 files was done. These files contain different sets of variables from each other. In Table 8B, variables Landing_Year to Number_All_Permits appear on the person file, while the remainder appear on the permits file. While admission and tax data are collected mandatorily from those in scope, some fields may not have been completed. They could be left empty because the response was unknown, or for other reasons unbeknownst to database users (e.g., refusal) (McLeish 2011). Item non-response can present issues when one is considering the IMDB for statistical purposes, including the following:
- If the database user is interested in producing a sample based on characteristics for which there are missing records, there will be coverage error (i.e., those being included in the sampling frame may not be representative of the target population).
- If the non-response is non-ignorable (i.e., the fact that information is missing is not a random occurrence; the fact that there is no response is indicative of what the response would have been), any analysis using those variables would be biased.
The presence of missing variables and invalid values was assessed. The numbers presented in this section are rounded. Invalid values are either inconsistent or not listed in the metadata tables available to users (see the immigration component of the data dictionary appendix). It should be noted that some seemingly valid values may be erroneous as well.
The variable Landing_Year has a high percentage of missing values ( 71.0%). This is normal as only landed individual will have a landing year, and the NRF includes all non-permanent residents, both landed and un-landed.
In the NRF_PERSON of the 2022 IMDB, 360 records had a birth year prior to 1880, with 350 records having birth year 1753.
While most records have a birth country, those with a missing Country_Birth also have a missing landing year.
The variables Effective_Date and Valid_Date don’t have invalid responses themselves, but when compared together can produce invalid responses. For example, the Valid_Date must always appear after the Effective_Date. Records that have the Valid_Date appear before the Effective_Date could be considered as invalid responses for one or both of these variables. Additionally, any record that has a duration of 5 or more years between the Effective_Date and the Valid_Date are suspicious and are likely to include an invalid value for one of the variables. Overall, 0.005% of effective/valid date comparisons could be considered invalid because of these 2 issues.
Over 99% of missing Valid_Date occur when the Document_Type variable is 46 (refugee claim).The reason for this is because refugee claims are not assigned an end date.
The variables Destination_CD, Destination_ER, Destination_CMA, Destination_CSD and Destination_province have a smaller proportion of missing values than other variables, but much larger than the PNRF. Most years (based on the effective_date variable) have a very low missing variables rate, around 1%. However, years 1989, 2018 and 2019 have almost a 12% missing rate, and years 2014, 2015, 2016 and 2017 have between 32%-37% missing rate. The 2022 IMDB uses the Standard Geographical Classification (SGC) to update the geographical region and code.
NRF variables | Valid responses | Blanks / missing | Invalid responses | |||
---|---|---|---|---|---|---|
number | percent | number | percent | number | percent | |
Landing Year | 2,598,340 | 28.97 | 6,371,550 | 71.03 | 0 | 0.00 |
birth_year | 8,968,260 | 99.98 | 1,280 | 0.01 | 360 | 0.00 |
birth_month | 8,968,570 | 99.99 | 1,330 | 0.01 | 0 | 0.00 |
gender | 8,969,900 | 100.00 | 0 | 0.00 | 0 | 0.00 |
COUNTRY_BIRTH | 8,949,870 | 99.78 | 20,030 | 0.22 | 0 | 0.00 |
NUMBER_OTHER_PERMITS | 8,969,900 | 100.00 | 0 | 0.00 | 0 | 0.00 |
NUMBER_REFUGEE_CLAIMS | 8,969,900 | 100.00 | 0 | 0.00 | 0 | 0.00 |
NUMBER_WORK_PERMITS | 8,969,900 | 100.00 | 0 | 0.00 | 0 | 0.00 |
NUMBER_STUDY_PERMITS | 8,969,900 | 100.00 | 0 | 0.00 | 0 | 0.00 |
NUMBER_ALL_PERMITS | 8,969,900 | 100.00 | 0 | 0.00 | 0 | 0.00 |
COUNTRY_RESIDENCE | 20,789,740 | 96.48 | 759,190 | 3.52 | 0 | 0.00 |
COUNTRY_CITIZENSHIP | 21,490,530 | 99.73 | 58,400 | 0.27 | 0 | 0.00 |
LEVEL_OF_STUDY_ROLLUP | 7,471,830 | 34.67 | 14,077,090 | 65.33 | 0 | 0.00 |
LEVEL_OF_STUDY | 7,471,830 | 34.67 | 14,077,090 | 65.33 | 0 | 0.00 |
SKILL_LEVEL_CD11 | 13,973,970 | 64.85 | 7,574,960 | 35.15 | 0 | 0.00 |
OCCUPATION_CD | 13,981,170 | 64.88 | 7,567,750 | 35.12 | 0 | 0.00 |
NOC5_CD11 | 13,973,970 | 64.85 | 7,574,960 | 35.15 | 0 | 0.00 |
NOC4_CD11 | 13,973,970 | 64.85 | 7,574,960 | 35.15 | 0 | 0.00 |
NOC3_CD11 | 13,973,970 | 64.85 | 7,574,960 | 35.15 | 0 | 0.00 |
NOC2_CD11 | 13,973,970 | 64.85 | 7,574,960 | 35.15 | 0 | 0.00 |
DESTINATION_CSD | 19,872,150 | 92.22 | 1,676,780 | 7.78 | 0 | 0.00 |
DESTINATION_CMA | 19,872,150 | 92.22 | 1,676,780 | 7.78 | 0 | 0.00 |
DESTINATION_PROVINCE | 19,872,190 | 92.22 | 1,676,740 | 7.78 | 0 | 0.00 |
DESTINATION_CD | 19,872,150 | 92.22 | 1,676,780 | 7.78 | 0 | 0.00 |
DESTINATION_ER | 19,872,150 | 92.22 | 1,676,780 | 7.78 | 0 | 0.00 |
effective_date | 21,548,930 | 100.00 | 0 | 0.00 | 0 | 0.00 |
valid_date | 20,496,420 | 95.12 | 1,052,510 | 4.88 | 0 | 0.00 |
DOCUMENT_TYPE | 21,548,930 | 100.00 | 0 | 0.00 | 0 | 0.00 |
SPECIAL_PROGRAM | 4,008,430 | 18.60 | 17,540,500 | 81.40 | 0 | 0.00 |
CLASSIFICATION_ID | 8,012,660 | 37.18 | 13,536,270 | 62.82 | 0 | 0.00 |
LMIA_EXEMPTIONS | 9,224,990 | 42.81 | 12,323,930 | 57.19 | 0 | 0.00 |
Notes: NPR: non-permanent resident. Variables come from either the person level file or the permit level file. Only variables with missing or invalid values were included in the table. All numbers are rounded.Effective_date and Valid_date variables can be invalid when compared against each other. See Section 7.5.2 paragraph for details. Source: Statistics Canada, 2022 Longitudinal Immigration Database. |
7.6 Quality assessment of the Province of Residence Variable (PRCO_)
A validation of the geography variables included in the IMDB tax files was done. This section discusses how the variable Province of Residence (PRCO_) was derived and its quality.
The Province of residence (PRCO_) is based on information from taxfilers when available. Missing information from the province of residence is replaced by information collected on the postal code of the mailing address either from the individual (PSCO_I), if available, otherwise from the family (PSCO_F).
PRCO | Province and Territories | First character of the postal code (PSCO) |
---|---|---|
0 | Newfoundland and Labrador | A |
2 | Prince Edward Island | B |
1 | Nova Scotia | C |
3 | New Brunswick | E |
4 | Quebec | G, H, J |
5 | Ontario | K, L, M, N ,P |
6 | Manitoba | R |
7 | Saskatchewan | S |
8 | Alberta | T |
9 | British Columbia | V |
10 | Northwest Territories | X |
11 | Yukon Territories | Y |
12 | Non-residents | missing |
14 | Nunavut | X |
Note: The value some postal codes are U or F for blank, respectively U or US is U and Foreign is F. |
While the Province of residence (PRCO_) is more reliable than the Taxing province (TXPCO_), some abnormalities were observed mostly on the non-resident code in the reporting for taxation years 1989, 1993, and 1998. These may impact specific provinces.
For the 1993 tax year, IMDB_T1FF includes anomalies for the province of Manitoba with an unusual number of residents (48,130 in 1993, compared to 33,650 the tax year before, and 37,365 the tax year after). Similar changes are observed for the Northwest Territories. Additionally, 740 individuals are coded as residing in Nunavut while Nunavut was created in 1998. 725 individuals are coded as residing in multiple jurisdictions. Users can use the information from the variable PSCO_F to diminish the effect of the anomalies on their analyses that include province of residence. However, as stated above, the time are different between PSCO_ (based residence at time of filling) and PRCO_ (residence on December 31st).
Non-resident (PRCO_=12) records appear to be overestimated in the 1989 IMDB_T1FF. It includes 79,210 non-residents of Canada, with many of them having a non-permanent residency status. Users can decide to use the postal code of the mailing address (PSCO_ at the individual or family level) to derive the value of PRCO_ or remove the non-residents from their analysis.
In the 1998 IMDB_T1FF, a higher than expected number of records are assigned to Newfoundland and Labrador (PRCO_). In these cases the place of residence of the family at the time of filing is also Newfoundland based on variable PSCO_F.
- Date modified: